Seemingly random ServFail issues on a caching server
lyle at lcrcomputer.net
Wed Aug 31 14:05:08 UTC 2011
On 8/31/2011 8:40 AM, Florian CROUZAT wrote:
> Florian CROUZAT wrote on 2011-08-25:
>> Hi list,
>> On a few domains (we'll consider only one domain for this example) I
>> encounter sometimes (seemingly randoms) ServFails while resolving domain
>> names. A client (192.168.147.2) asks my caching server (192.168.151.100)
>> to resolve a target (www.leclercdrive.fr)
>> Here are the relevant logs:
>> Aug 24 17:14:19 ns named: 24-Aug-2011 17:14:19.377 queries: info:
>> client 192.168.147.2#34502: view internal: query: www.leclercdrive.fr IN
>> A + Aug 24 17:14:19 ns named: 24-Aug-2011 17:14:19.380 queries:
>> info: client 192.168.147.2#34502: view internal: query:
>> www.leclercdrive.fr IN A + Aug 24 17:14:19 ns named: 24-Aug-2011
>> 17:14:19.382 queries: info: client 192.168.147.2#34502: view internal:
>> query: www.leclercdrive.fr IN A +
>> A tcpdump on the local side of the NS server shows the A request and the
>> instant ServFail. A tcpdump on the external side of the NS server shows
>> no traffic at all in this case meaning it fails internally and doesn't
>> even try to forward the A request to the Internet.
>> 17:14:19.377608 IP 192.168.147.2.34502> 192.168.151.100.53: 26340+ A?
>> www.leclercdrive.fr. (37) 17:14:19.378845 IP 192.168.151.100.53>
>> 192.168.147.2.34502: 26340 ServFail 0/0/0 (37) 17:14:19.380607 IP
>> 192.168.147.2.34502> 192.168.151.100.53: 52628+ A? www.leclercdrive.fr.
>> (37) 17:14:19.381383 IP 192.168.151.100.53> 192.168.147.2.34502: 52628
>> ServFail 0/0/0 (37) 17:14:19.382605 IP 192.168.147.2.34502>
>> 192.168.151.100.53: 58933+ A? www.leclercdrive.fr. (37) 17:14:19.383406
>> IP 192.168.151.100.53> 192.168.147.2.34502: 58933 ServFail 0/0/0 (37)
>> A few minutes before, or later, it worked just fine, see:
>> 17:15:58.736177 IP 192.168.147.2.34502> 192.168.151.100.53: 49610+ A?
>> www.leclercdrive.fr. (37) 17:15:58.784470 IP 192.168.151.100.53>
>> 192.168.147.2.34502: 49610 3/3/6 CNAME[|domain]
>> The TTL of the www.leclercdrive.fr entry is 300 - which seems short to
>> me - maybe the ServFail happens when a request is treated at the exact
>> time of the TTL reaching zero and the cache entry beeing flushed ? I
>> tried flushing the cache using rndc but the first request after that
>> worked just fine (of course...)
>> Any ideas/hints are welcome.
>> The DNS server runs 1:9.5.1.dfsg.P3-1+lenny1
>> cat /etc/debian_version => 5.0.4
>> (I have no control on the version of the tools)
> I found in my logfiles a few other domains where the ServFails happen, their
> respective TTL are all different, from 300 sec to 86400.
> I still have no idea at all how to resolve this issue and as far as I
> investigated, I haven't been able to identify a pattern in those ServFails.
> I'm not even sure the TTL is involved since I saw two ServFail separated in
> time by less than the TTL value of the entry...
The authorative name servers for leclercdrive.fr are a.dns.gandi.net,
b.dns.gandi.net and c.dns.gandi.net. I don't know how big gandi.net is,
but traceroutes to those servers end up going through Level3 in
Baltimore, MD from here. They did have a hurricane go through there and
I would not be surprised if traffic levels have been a bit high for the
last few days.
More information about the bind-users