timeouts and negative caching

Thu Jun 11 13:27:09 UTC 2015

Hi,

I've got a bind running as recursive resolver behind a thin internet line. 
When the line is clogged, requests sometimes time out. When the dns client 
retries the query, bind usually retries the request and eventually succeeds. 
So far so good.

But now I sometimes see that bind does not retry immediately, but somehow 
caches the error for up to 5 minutes (300 secs). The negative answer is then 
given right away, without checking again if the remote server can be reached 
now.

Here is an example:

> time dig www.strato.com
; <<>> DiG 9.9.3-P2-RedHat-9.9.3-2.P2.i2n <<>> @localhost www.strato.de
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 43535
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.strato.de.                 IN      A

;; Query time: 4397 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Jun 11 14:14:17 CEST 2015
;; MSG SIZE  rcvd: 42

real    0m0.007s
user    0m0.004s
sys     0m0.000s

When I look into the bind cache I see this:

> rndc dumpdb -all
> cat cache_dump.db
[...]
; authauthority
strato.de.              85530   NS      ns3.strato.de.
                        85530   NS      ns4.strato.de.
                        85530   NS      ns1.strato.de.
                        85530   NS      ns2.strato.de.
; additional
ns1.strato.de.          85530   A       193.141.40.1
; additional
ns2.strato.de.          85530   A       81.169.144.234
; additional
ns3.strato.de.          85530   A       195.122.141.2
; additional
                        85530   AAAA    2a00:e10:2004::2
; additional
ns4.strato.de.          85530   A       192.166.192.4
; additional
                        85530   AAAA    2a01:238:e100:192::4
[...]
;
; Address database dump
;
[...]
; ns2.strato.de [v4 TTL 59] [v4 failure] [v6 unexpected]
; ns3.strato.de [v4 TTL 59] [v4 failure] [v6 unexpected]
; ns4.strato.de [v4 TTL 59] [v4 failure] [v6 unexpected]
; ns1.strato.de [v4 TTL 59] [v4 failure] [v6 unexpected]

I've seen this "[v4 TTL 59]" go up to 300. 

So there must be some kind of "negative caching" which caches timeouts and,
not like the real negative caching, just active negative results.

Where do these 300 seconds come from and how can I configure them? I'd like to 
drastically reduce them to something like 10 seconds or so to make sure bind 
retries to resolve a query shortly after a timeout.

Thank you.

Kind regards,

Gerd