DNSSEC troubleshooting on a recursive server.

Wed Aug 7 08:53:40 UTC 2013

On 08/07/2013 12:09 AM, Grant Keller wrote:
> Hello,
>
> We have 7 recursive DNS servers running Bind 9.9.2, and we are seeing
> some strange behavoir validating DNSSEC. We have seen this happen a few
> times, and in the past the problem has gone away when the server is
> rebooted, so my first guess is that some record is stuck in the cache.

"Rebooted" is a bit extreme; did you actually reboot the OS, or do you 
mean "restart bind"? When the problem occurs, have you tried "rndc 
flush" to see if that corrects it?

Are you using any forwarders, or might your upstream be doing 
transparent DNS caching? Unlikely, but not unheard of.

> # dig a zygo.com @127.0.0.1 +nocomments

+nocomments has hidden the rcode (NODATA, SERVFAIL, etc.). So, not 
entirely helpful here.

http://dnsviz.net/d/zygo.com/dnssec/

...suggests there might be an oddity with the TTL on the TXT records at 
zone apex, but not the A record. Otherwise zone looks ok.

You could try:

rndc dumpdb -cache

...then go and inspect the dumped cache for the zones and see if 
anything obvious stands out. Pay particular attention to the "bad" cache 
(at the end of the file) and what DNSKEY ID# and RRSIGs are in-cache.

Note you might need to set "dump-file" in named.conf to a path named can 
write.

>
> ; <<>> DiG 9.7.0-P2-RedHat-9.7.0-17.P2.el5_9.2 <<>> a zygo.com
> @127.0.0.1 +nocomments
> ;; global options: +cmd
> ;zygo.com.            IN    A
> ;; Query time: 162 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Tue Aug  6 16:06:10 2013
> ;; MSG SIZE  rcvd: 26
>
> # dig rrsig zygo.com @127.0.0.1 +nocomments
>

Hmm. This *is* odd. We're on bind 9.9.3 and it seems "dig domain.com 
rrsig" always returns TTL=0.

I wonder if this is new? I don't recall seeing it before.

In any event, as Mark has suggested, you don't want to dig the RRSIG 
yourself. Rather, use:

dig +dnssec zygo.com a

...and if you get a SERVFAIL:

dig +dnssec +cd zygo.com a

> The thing that really confuses me is that the ttl on the RRSIG DS record
> has been stuck at 5 for about a day now. I tried doing a rndc flushname
> zygo.com, which did not help. What else can I do to troubleshoot this,
> and if it is a cache problem, what can I do to clear the records? Thanks.

Well, "rndc flush" rather than "flushname", but as above, examine the 
cache first and see if the problem stands out.