Hello there, For quite a while now, I've been having a number of users on our internal network having queries time out, and I've been investigating over the past week to try and find out why. The symptom is simply that at times: % host bugs.launchpad.net Nameserver not responding bugs.launchpad.net A record not found, try again I enabled debugging and looked at my name server logs. What I could see was (this is simplified; I only left in messages at important timestamp boundaries, and the full log is at http://async.com.br/~kiko/bugs-dns1.log): 27-Sep-2007 16:22:41.870 database: debug 1: no_references: delete from rbt: 0xb4930e 40 bugs.launchpad.net So the entry got evicted from the cache. Then I see a query coming in a few seconds later: 27-Sep-2007 16:27:36.078 queries: info: client 192.168.99.4#43950: query: bugs.launchpad.net IN A + 27-Sep-2007 16:27:36.078 resolver: debug 1: createfetch: bugs.launchpad.net A 27-Sep-2007 16:27:36.078 resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): create Then, 5 seconds later, I see this: 27-Sep-2007 16:27:41.090 resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): timeout After about 3 iterations, we finally get this: 27-Sep-2007 16:27:56.099 resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): too many timeouts, disabling EDNS0 I think at this point the client got an error. The funny thing is that bugs.launchpad.net /does/ resolve fine using other name servers, including my forwarders directly (though there is of course a timing difference). And if I insist, it eventually resolves fine on my server too: http://async.com.br/~kiko/bugs-dns3.log Of course, once it's cached, it's fine and everyone enjoys the entry: http://async.com.br/~kiko/bugs-dns3.log I was originally using 3 ISP-provided forwarders since they are intended to improve performance; however I decided to try disabling them to see if it improved things. And, if I do a profile of the timeout counts, indeed it does reduce the timeouts, though I do still get a few server failures on a few sites, which suggests this isn't the entire problem -- but maybe the sites are just broken or taking too long to answer. Has anyone seen this before? Is the EDNS0 issue a red herring, or is what I'm seeing indicative of EDNS being broken at a few sites, including my forwarders? I can issue manual EDNS queries (using dig +bufsize=500) just fine, so I would think not. I'm behind a Linux firewall, but I have complete control over it, and I doubt there is any UDP issue there (can I test somehow?) If this just means that the network is acting flaky (high-latency or otherwise) and it's going to happen, why does it improve when I disable forwarders, and is there a way to mitigate the problem, perhaps using caching, a different protocol, traffic shaping or an act of god? Thanks! -- Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125 ........ ....... ...... ..... .... ... .. .