DNSSEC troubleshooting on a recursive server.
p.mayers at imperial.ac.uk
Wed Aug 7 08:53:40 UTC 2013
On 08/07/2013 12:09 AM, Grant Keller wrote:
> We have 7 recursive DNS servers running Bind 9.9.2, and we are seeing
> some strange behavoir validating DNSSEC. We have seen this happen a few
> times, and in the past the problem has gone away when the server is
> rebooted, so my first guess is that some record is stuck in the cache.
"Rebooted" is a bit extreme; did you actually reboot the OS, or do you
mean "restart bind"? When the problem occurs, have you tried "rndc
flush" to see if that corrects it?
Are you using any forwarders, or might your upstream be doing
transparent DNS caching? Unlikely, but not unheard of.
> # dig a zygo.com @127.0.0.1 +nocomments
+nocomments has hidden the rcode (NODATA, SERVFAIL, etc.). So, not
entirely helpful here.
...suggests there might be an oddity with the TTL on the TXT records at
zone apex, but not the A record. Otherwise zone looks ok.
You could try:
rndc dumpdb -cache
...then go and inspect the dumped cache for the zones and see if
anything obvious stands out. Pay particular attention to the "bad" cache
(at the end of the file) and what DNSKEY ID# and RRSIGs are in-cache.
Note you might need to set "dump-file" in named.conf to a path named can
> ; <<>> DiG 9.7.0-P2-RedHat-9.7.0-17.P2.el5_9.2 <<>> a zygo.com
> @127.0.0.1 +nocomments
> ;; global options: +cmd
> ;zygo.com. IN A
> ;; Query time: 162 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Tue Aug 6 16:06:10 2013
> ;; MSG SIZE rcvd: 26
> # dig rrsig zygo.com @127.0.0.1 +nocomments
Hmm. This *is* odd. We're on bind 9.9.3 and it seems "dig domain.com
rrsig" always returns TTL=0.
I wonder if this is new? I don't recall seeing it before.
In any event, as Mark has suggested, you don't want to dig the RRSIG
yourself. Rather, use:
dig +dnssec zygo.com a
...and if you get a SERVFAIL:
dig +dnssec +cd zygo.com a
> The thing that really confuses me is that the ttl on the RRSIG DS record
> has been stuck at 5 for about a day now. I tried doing a rndc flushname
> zygo.com, which did not help. What else can I do to troubleshoot this,
> and if it is a cache problem, what can I do to clear the records? Thanks.
Well, "rndc flush" rather than "flushname", but as above, examine the
cache first and see if the problem stands out.
More information about the bind-users