named cpu usage pretty high because of dns_dnssec_findzonekeys2 -> file not found

Mon Mar 11 15:53:08 UTC 2019

On 11-Mar-19 03:52, Mark Andrews wrote:
> Because you removed the key from disk before it was removed from the zone.  Presumably named
> was logging other error messages before you removed the key from disk or the machine was off
> for a period or you mismanaged the key roll and named keep the key alive.
>
> Named’s re-signing strategy is different to when you are signing the whole zone at once as
> you are signing it incrementally.  You should be allowing most of the sig-validity interval
> before you delete the DNSKEY after you inactive it.  One should check that there are no RRSIGs
> still present in the zone before deleting the DNSKEY from the zone.  Inactivating it stops the
> DNSKEY being used to generate new signatures but it needs to stay around until all those RRSIGs
> have expired from caches which only happens after new replacement signatures have been generated.

There are a lot of these "administrator should know" events and timeouts
in DNSSEC.  One could argue that these complexities are one of the
barriers to adoption.

It seems worth considering ways to make life easier, for administrators
and automation alike.

A few thoughts come immediately to mind - no doubt there are more:

- Rather than documenting "wait for n TTLs (or sig-validity interval)",
have bind log events that require/enable administrator actions (at
non-debug levels), such as:

"key (keyid) /foo/bar/.. no longer required and can be removed" - issue
at inactivation + max TTL of any RRSIG is signed.  Allows an admin (or
script) to know when it's safe rather than requiring research and/or math.

"key (keyid) /foo/baz... is now signing zone(s)
example.net,example.org.  It expires on <> and will be removed on <>"

- Provide an "obsolete-keys" directory - have named move keys that are
no longer required there.  (Or delete the files.  But emptying
obsolete-keys, like emptying /tmp, can be automated, and deleting a key
might be a problem if forensics - or audits - is required.) The key idea
is that an admin never removes a file from "keys".  And that should
prevent mistakes.

- Rather than relying on the keys directory for signing, use it only to
import/update keys.  Once named starts using a key, put a copy (or move
it) to ".active-keys" - or a database file - that persists as long is
the protocol requires it.  If the file in the keys directory is updated
with new dates, generate the appropriate events - but work from
.active-keys.  If the file disappears from "keys" before it should, use
.active-keys to restore it -- and add a comment explaining why.  ("#
Restored by named at 1-apr-2411: sig-validity interval for
lost.example.net (internal) extends to 15-may-2412")

- Provide an rndc show class command (or stats channel output) that
explains the status/fate of each signing key.  Perhaps a table:

     Key Zone view State created publish active deactivate remove
next_event

       key (keyid) /foo/baz... example.net external Published 1-jan-2000
1-jun-2000 1-Jul-2000 31-dec-2000 1-feb-2001 activate 1-Jun-2000 #
Assumes today is 11-Mar-2000

       key (keyid) /foo/baz... example.org external Published 1-jan-2000
1-jun-2000 1-Jul-2000 31-dec-2000 1-feb-2001 activate 1-Jun-2000 # Same
key, different zone

- Think more about what admins want to do, rather than how named (and
the protocols) do it.  E.g. "sign a zone", "roll key now|every month",
"use latest|specified|safest signature algorithm | key length", 
"enable/disable nsec|nsec3", "unsign zone"... Provide scripts and/or
named primitives that do this.  "dnssec settime -xyz" doesn't do a good
job of specifying intent - one has to do a lot of math, and the intent
isn't logged - just the date change.

I'm aware of the dnssec keymgr effort - it's still more oriented to
timeouts and e.g. coverage periods than to what one wants to
accomplish.  (As far as I can tell, it also doesn't support multiple
views - which makes it unusable for me.  I don't think this is an
unusual configuration...)

If you look at validate() in policy.py.in, there are 6 different errors
for conditions involving timer relationships.  [And the errors are
reported in seconds, not even as something vaguely human - such as
57w2d1h30m12s.] Why not (by default) adjust the timers & log the result?

I'm sure someone will opine that for every case, there's a choice
between shrinking one timer and extending the other. This is undoubtedly
true.  But better to pick a strategy that is consistent with safe
practice than to kick back each error to an admin.  An admin who has
particular requirements can read the log.  But for those who "just want
things to work", I suspect that we can identify a driver (I nominate key
lifetime) & adjust everything else to fit...

I'm sure there are some challenges in the details - but I hope the
message is clear.  Avoid blaming the admin for trying to make things
work.  Instead, package actions at admin-oriented levels of
abstraction.  Guard data that named needs, and avoid having the admin
manipulate live files (where mistakes can be made).

I do want to acknowledge the considerable efforts already made to make DNSSEC more usable.  They have helped, but as evidenced by the exchange that precipitated this noted, the level of abstraction is still too low.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20190311/c0feda55/attachment.bin>