Resolver timeouts, EDNS and networking

Fri Sep 28 13:41:46 UTC 2007

On Thu, Sep 27, 2007 at 07:27:10PM -0400, Kevin Darcy wrote:
> > Has anyone seen this before? Is the EDNS0 issue a red herring, or is
> > what I'm seeing indicative of EDNS being broken at a few sites,
> > including my forwarders? I can issue manual EDNS queries (using dig
> > +bufsize=500) just fine, so I would think not.
> >   
> Hmmm... bufsize of 500 is rather silly, since that's _below_ the default 
> buffer size (512). I'd set it to something higher. In fact, I'd probably 
> do a packet trace of the forwarded queries and then try to replicate 
> them *exactly* with "dig", including EDNS0 buffer size, source address, 
> even source port. In the unlikely event that you're TSIG-signing your 
> queries, I'd mimic that behavior as well. Assuming that you're still 
> getting timeouts on precisely-mimic'ed queries, then I'd start changing 
> things to see what makes it work better. A DNS query packet has only a 
> finite number of attributes -- it should be possible to home in on the 
> attribute or combination of attributes that is giving rise to the problem.

So after a day and night of this, ISTM that the resolvers appear to be
red herrings. I disabled the resolvers last night but given it was off
office peak hours I saw the timeouts lessened, and today, as soon as the
office is in buzz I am seeing timeouts peak to 87 in a single minute
(just counting the "too many timeouts" string in the debug log).

A few are for servers I would not expect to time out:

      6 0x8288158(api.del.icio.us/A'):
      4 0xb489e018(ns-2.nipcable.com.br/A'):
      3 0xb452f9d0(f3.yahoofs.com/A'):
      3 0x8296f68(unisys.com/A'):
      2 0xb489b080(zd.akadns.org/A'):
      2 0xb455c250(row.bc.yahoo.com/A'):

Am I right in assuming that when the server logs a "too many timeouts"
it's likely that the client resolver library will have given up and
reported an error upstream?

The fact that the problems are really intermittent and that I am unable
to reproduce any EDNS-related failures (just following the hint I picked
up at http://lists.debian.org/debian-user/2005/10/msg03334.html)
suggests to me that either the network latency rises too high (it's
around 40ms to my upstream hop, and I can see some packet loss, though
not more than 5%) or the server is overloaded doing reverse-DNS
queries for apache and DNSBL-related queries for sendmail.

> Note that the 10-minute TTL on bugs.launchpad.net is going to incur a 
> fairly high fetch rate, and if there is some sort of connectivity 
> problem between your ISP's nameservers and the launchpad.net 
> nameservers, you could very well get timeouts. Is it possible that all 

Yeah, I raised that with the sysadmins and have requested they increase
that, but I am still left with my general problem.

> Another, somewhat non-scalable, high-maintenance "middle ground" option 
> would be to keep your forwarding configuration, but define 
> "launchpad.net" as a "type stub" zone. The high-maintenance part comes 

The problem is that I'm not really restricted to launchpad.net -- we get
timeouts for assorted queries. I just picked launchpad.net because I
care about it, and because it was easy to find in the logs!
-- 
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125