Seemingly random ServFail issues on a caching server

Florian CROUZAT gentoo at floriancrouzat.net
Tue Sep 6 14:39:12 UTC 2011


Florian CROUZAT wrote on 2011-08-31:

> Lyle Giese wrote on 2011-08-31:
>
>> On 8/31/2011 8:40 AM, Florian CROUZAT wrote:
>>> Florian CROUZAT wrote on 2011-08-25:
>>>
>>>> Hi list,
>>>>
>>>> On a few domains (we'll consider only one domain for this example) I
>>>> encounter sometimes (seemingly randoms) ServFails while resolving
>>>> domain names. A client (192.168.147.2) asks my caching server
>>>> (192.168.151.100) to resolve a target (www.leclercdrive.fr)
>>>>
>>>> Here are the relevant logs:
>>>>
>>>> Aug 24 17:14:19 ns named[24929]: 24-Aug-2011 17:14:19.377 queries:
>>>> info: client 192.168.147.2#34502: view internal: query:
>>>> www.leclercdrive.fr IN A + Aug 24 17:14:19 ns named[24929]:
>>>> 24-Aug-2011 17:14:19.380 queries: info: client 192.168.147.2#34502:
>>>> view internal: query: www.leclercdrive.fr IN A + Aug 24 17:14:19 ns
>>>> named[24929]: 24-Aug- 2011 17:14:19.382 queries: info: client
>>>> 192.168.147.2#34502: view internal: query: www.leclercdrive.fr IN A +
>>>>
>>>>
>>>> A tcpdump on the local side of the NS server shows the A request and
>>>> the instant ServFail. A tcpdump on the external side of the NS server
>>>> shows no traffic at all in this case meaning it fails internally and
>>>> doesn't even try to forward the A request to the Internet.
>>>>
>>>> 17:14:19.377608 IP 192.168.147.2.34502>  192.168.151.100.53: 26340+
>>>> A? www.leclercdrive.fr. (37) 17:14:19.378845 IP 192.168.151.100.53>
>>>> 192.168.147.2.34502: 26340 ServFail 0/0/0 (37) 17:14:19.380607 IP
>>>> 192.168.147.2.34502>  192.168.151.100.53: 52628+ A?
>>>> www.leclercdrive.fr. (37) 17:14:19.381383 IP 192.168.151.100.53>
>>>> 192.168.147.2.34502: 52628 ServFail 0/0/0 (37) 17:14:19.382605 IP
>>>> 192.168.147.2.34502> 192.168.151.100.53: 58933+ A?
>>>> www.leclercdrive.fr. (37) 17:14:19.383406 IP 192.168.151.100.53>
>>>> 192.168.147.2.34502: 58933 ServFail 0/0/0 (37)
>>>>
>>>> A few minutes before, or later, it worked just fine, see:
>>>>
>>>> 17:15:58.736177 IP 192.168.147.2.34502>  192.168.151.100.53: 49610+
>>>> A? www.leclercdrive.fr. (37) 17:15:58.784470 IP 192.168.151.100.53>
>>>> 192.168.147.2.34502: 49610 3/3/6 CNAME[|domain]
>>>>
>>>> The TTL of the www.leclercdrive.fr entry is 300 - which seems short
>>>> to me - maybe the ServFail happens when a request is treated at the
>>>> exact time of the TTL reaching zero and the cache entry beeing
>>>> flushed ? I tried flushing the cache using rndc but the first request
>>>> after that worked just fine (of course...)
>>>>
>>>> Any ideas/hints are welcome.
>>>>
>>>> The DNS server runs 1:9.5.1.dfsg.P3-1+lenny1
>>>> cat /etc/debian_version =>  5.0.4
>>>> (I have no control on the version of the tools)
>>>
>>>
>>>
>>> I found in my logfiles a few other domains where the ServFails happen,
>>> their respective TTL are all different, from 300 sec to 86400. I still
>>> have no idea at all how to resolve this issue and as far as I
>>> investigated, I haven't been able to identify a pattern in those
>>> ServFails. I'm not even sure the TTL is involved since I saw two
>>> ServFail separated in time by less than the TTL value of the entry...
>>>
>>> Florian
>>>
>>
>> The authorative name servers for leclercdrive.fr are a.dns.gandi.net,
>> b.dns.gandi.net and c.dns.gandi.net.  I don't know how big gandi.net
>> is, but traceroutes to those servers end up going through Level3 in
>> Baltimore, MD from here.  They did have a hurricane go through there
>> and I would not be surprised if traffic levels have been a bit high for
>> the last few days.
>>
>> Lyle
>
> Well, it's a french registrar, my servers are in France and my clients
> are french too so from here the traceroute is pretty neat. Anyway my
> problem isn't (apparently) Gandi related, or even www.leclercdrive.fr
> related since the ServFails happen internally and instantanetly in my
> BIND which doesn't even try to forward the A request.
>
>
> Florian

Apparently -- even if I don't understand why -- the problem seems to be that
the NS ({a,b,c}.dns.gandi.net) of leclercdrive.fr and other domains which
ServFail have AAAA entries and my caching server has IPv6 enabled but my
network doesn't route or handle IPv6.

All I had to do to get rid of those ServFails was to add "-4" in the
starting options of bind (CentOS: /etc/default/bind9, OPTIONS=)

Anyway, I don't really understand whether or not it's a bug in bind that
only happens when your interface has a link-local IPv6 addr, the remote NS
have AAAA entries and your network doesn't handle IPv6.

The solution I applied works, but I'm not satisfied with it.
Any precisions are of course welcome.

Greetings,
Florian








More information about the bind-users mailing list