Resolver timeouts, EDNS and networking
Christian Robottom Reis
kiko at async.com.br
Fri Sep 28 13:41:46 UTC 2007
On Thu, Sep 27, 2007 at 07:27:10PM -0400, Kevin Darcy wrote:
> > Has anyone seen this before? Is the EDNS0 issue a red herring, or is
> > what I'm seeing indicative of EDNS being broken at a few sites,
> > including my forwarders? I can issue manual EDNS queries (using dig
> > +bufsize=500) just fine, so I would think not.
> Hmmm... bufsize of 500 is rather silly, since that's _below_ the default
> buffer size (512). I'd set it to something higher. In fact, I'd probably
> do a packet trace of the forwarded queries and then try to replicate
> them *exactly* with "dig", including EDNS0 buffer size, source address,
> even source port. In the unlikely event that you're TSIG-signing your
> queries, I'd mimic that behavior as well. Assuming that you're still
> getting timeouts on precisely-mimic'ed queries, then I'd start changing
> things to see what makes it work better. A DNS query packet has only a
> finite number of attributes -- it should be possible to home in on the
> attribute or combination of attributes that is giving rise to the problem.
So after a day and night of this, ISTM that the resolvers appear to be
red herrings. I disabled the resolvers last night but given it was off
office peak hours I saw the timeouts lessened, and today, as soon as the
office is in buzz I am seeing timeouts peak to 87 in a single minute
(just counting the "too many timeouts" string in the debug log).
A few are for servers I would not expect to time out:
Am I right in assuming that when the server logs a "too many timeouts"
it's likely that the client resolver library will have given up and
reported an error upstream?
The fact that the problems are really intermittent and that I am unable
to reproduce any EDNS-related failures (just following the hint I picked
up at http://lists.debian.org/debian-user/2005/10/msg03334.html)
suggests to me that either the network latency rises too high (it's
around 40ms to my upstream hop, and I can see some packet loss, though
not more than 5%) or the server is overloaded doing reverse-DNS
queries for apache and DNSBL-related queries for sendmail.
> Note that the 10-minute TTL on bugs.launchpad.net is going to incur a
> fairly high fetch rate, and if there is some sort of connectivity
> problem between your ISP's nameservers and the launchpad.net
> nameservers, you could very well get timeouts. Is it possible that all
Yeah, I raised that with the sysadmins and have requested they increase
that, but I am still left with my general problem.
> Another, somewhat non-scalable, high-maintenance "middle ground" option
> would be to keep your forwarding configuration, but define
> "launchpad.net" as a "type stub" zone. The high-maintenance part comes
The problem is that I'm not really restricted to launchpad.net -- we get
timeouts for assorted queries. I just picked launchpad.net because I
care about it, and because it was easy to find in the logs!
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125
More information about the bind-users