Resolver timeouts, EDNS and networking

Fri Sep 28 20:58:01 UTC 2007

Christian Robottom Reis wrote:
> On Thu, Sep 27, 2007 at 07:27:10PM -0400, Kevin Darcy wrote:
>   
>>> Has anyone seen this before? Is the EDNS0 issue a red herring, or is
>>> what I'm seeing indicative of EDNS being broken at a few sites,
>>> including my forwarders? I can issue manual EDNS queries (using dig
>>> +bufsize=500) just fine, so I would think not.
>>>   
>>>       
>> Hmmm... bufsize of 500 is rather silly, since that's _below_ the default 
>> buffer size (512). I'd set it to something higher. In fact, I'd probably 
>> do a packet trace of the forwarded queries and then try to replicate 
>> them *exactly* with "dig", including EDNS0 buffer size, source address, 
>> even source port. In the unlikely event that you're TSIG-signing your 
>> queries, I'd mimic that behavior as well. Assuming that you're still 
>> getting timeouts on precisely-mimic'ed queries, then I'd start changing 
>> things to see what makes it work better. A DNS query packet has only a 
>> finite number of attributes -- it should be possible to home in on the 
>> attribute or combination of attributes that is giving rise to the problem.
>>     
>
> So after a day and night of this, ISTM that the resolvers appear to be
> red herrings. I disabled the resolvers last night but given it was off
> office peak hours I saw the timeouts lessened, and today, as soon as the
> office is in buzz I am seeing timeouts peak to 87 in a single minute
> (just counting the "too many timeouts" string in the debug log).
>
> A few are for servers I would not expect to time out:
>
>       6 0x8288158(api.del.icio.us/A'):
>       4 0xb489e018(ns-2.nipcable.com.br/A'):
>       3 0xb452f9d0(f3.yahoofs.com/A'):
>       3 0x8296f68(unisys.com/A'):
>       2 0xb489b080(zd.akadns.org/A'):
>       2 0xb455c250(row.bc.yahoo.com/A'):
>
> Am I right in assuming that when the server logs a "too many timeouts"
> it's likely that the client resolver library will have given up and
> reported an error upstream?
>   
Maybe, maybe not. Stub resolvers differ greatly in their timeouts, and 
this is even tunable in many resolvers. If the resolver times out, most 
likely it'll retry the lookup and you'll see this as a separate query in 
your logs. Or, it may be configured with no retries at all and just 
fail. "Too many timeouts" means that named is giving up, it doesn't 
necessarily have anything to do with what the client is doing.
> The fact that the problems are really intermittent and that I am unable
> to reproduce any EDNS-related failures (just following the hint I picked
> up at http://lists.debian.org/debian-user/2005/10/msg03334.html)
> suggests to me that either the network latency rises too high (it's
> around 40ms to my upstream hop, and I can see some packet loss, though
> not more than 5%) or the server is overloaded doing reverse-DNS
> queries for apache and DNSBL-related queries for sendmail.
>   
It's still not completely satisfying, though, since you don't know the 
real root cause.

- Kevin