Resolver timeouts, EDNS and networking

Thu Sep 27 23:20:09 UTC 2007

If it were me, I would start by disabling EDNS0. This is often the  
cause of problems such as this. You can try setting the buffer size  
down to 512 bytes, but if that doesn't solve it, turn EDNS off entirely.

If that solves it, then probably the forwarders are the problem. Ask  
your ISP about this; they may be using some kind of security software  
that is not able to handle EDNS0.

There will always be some timeouts. That's just the nature of the  
Internet today. But these should not result from a large percentage  
of your server's resolving attempts.

Chris Buxton
Men & Mice

On Sep 27, 2007, at 2:55 PM, Christian Robottom Reis wrote:

> Hello there,
>
>     For quite a while now, I've been having a number of users on our
> internal network having queries time out, and I've been investigating
> over the past week to try and find out why. The symptom is simply that
> at times:
>
>     % host bugs.launchpad.net
>     Nameserver not responding
>     bugs.launchpad.net A record not found, try again
>
> I enabled debugging and looked at my name server logs. What I could  
> see
> was (this is simplified; I only left in messages at important  
> timestamp
> boundaries, and the full log is at http://async.com.br/~kiko/bugs- 
> dns1.log):
>
>     27-Sep-2007 16:22:41.870
>         database: debug 1: no_references: delete from rbt:
>         0xb4930e 40 bugs.launchpad.net
>
> So the entry got evicted from the cache. Then I see a query coming  
> in a
> few seconds later:
>
>     27-Sep-2007 16:27:36.078
>         queries: info: client 192.168.99.4#43950:
>         query: bugs.launchpad.net IN A +
>     27-Sep-2007 16:27:36.078
>         resolver: debug 1: createfetch: bugs.launchpad.net A
>     27-Sep-2007 16:27:36.078
>         resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'):  
> create
>
> Then, 5 seconds later, I see this:
>
>     27-Sep-2007 16:27:41.090
>         resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'):  
> timeout
>
> After about 3 iterations, we finally get this:
>
>     27-Sep-2007 16:27:56.099
>          resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'):
>          too many timeouts, disabling EDNS0
>
> I think at this point the client got an error.
>
> The funny thing is that bugs.launchpad.net /does/ resolve fine using
> other name servers, including my forwarders directly (though there  
> is of
> course a timing difference). And if I insist, it eventually resolves
> fine on my server too:
>
>     http://async.com.br/~kiko/bugs-dns3.log
>
> Of course, once it's cached, it's fine and everyone enjoys the entry:
>
>     http://async.com.br/~kiko/bugs-dns3.log
>
> I was originally using 3 ISP-provided forwarders since they are  
> intended
> to improve performance; however I decided to try disabling them to see
> if it improved things. And, if I do a profile of the timeout counts,
> indeed it does reduce the timeouts, though I do still get a few server
> failures on a few sites, which suggests this isn't the entire  
> problem --
> but maybe the sites are just broken or taking too long to answer.
>
> Has anyone seen this before? Is the EDNS0 issue a red herring, or is
> what I'm seeing indicative of EDNS being broken at a few sites,
> including my forwarders? I can issue manual EDNS queries (using dig
> +bufsize=500) just fine, so I would think not.
>
> I'm behind a Linux firewall, but I have complete control over it,  
> and I
> doubt there is any UDP issue there (can I test somehow?)
>
> If this just means that the network is acting flaky (high-latency or
> otherwise) and it's going to happen, why does it improve when I  
> disable
> forwarders, and is there a way to mitigate the problem, perhaps using
> caching, a different protocol, traffic shaping or an act of god?
>
> Thanks!
> -- 
> Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16]  
> 3376 0125
>
>