validating ... bad cache hit

Fri Apr 24 10:17:50 UTC 2020

Hi,

we got reports about a temporary resolution failure for some names
under norid.no this morning.  Digging through the logs, the first
instance appears to be

Apr 24 08:35:02 resolver named[244]: validating zabbix-test.norid.no/CNAME: bad cache hit (norid.no/DNSKEY)

and a couple of minutes later, a rash of entries pointing to the same
bad cache hit.  The last entry after this pattern was some 10 minutes
later.

Looking at the code in BIND 9.14.10 (BIND 9.16.2 doesn't appear to be
significantly different in this regard), there appears to be a "cache
of bad records" implemented by lib/dns/badcache.c.  There are two
invocations of dns_resolver_addbadcache() in lib/dns/resolver.c, with
fairly complicated preconditions to reach each of those two points.
However, it appears that if I have not turned on query tracing (we
have not; I think we did previously, but found it to be a severe and
noticeable performance hit), I will not get any logging indicating
which of the two conditions hit, or why, so the trace for the root
cause for why norid.no/DNSKEY was temporarily marked bad goes cold at
this point as far as I can see.

Our logging is configured to (among other things) log the
"dnssec" and "security" categories at severity info and higher.

Is there something which can be done to improve the diagnostics for
such situations?  I don't suppose there is anything more to be found
for this particular problem at the moment?

Regards,

- Håvard