Bind timeout bug still in 9.6.1, it seems

Fri Aug 28 17:34:24 UTC 2009

About a year ago, in message <g93sh2$1s05$1 at sf1.isc.org>, Matus Uhlar
noted that it was probably a bind BUG when Aliet Santiesteban Sifontes
had a lot of bogus EDNS0 failures when his named repeatedly timed out
and closed pending requests after only 600ms, even though the hardcoded
timeout in the source code (bin/named/client.c) is 30s or 60s depending
on the state.

The thread subject was "EDNS and DNSSEC impossible to use in Satellite
links", and was a follow up to earlier threads by ASS about the problems
that were traced to this apparent bug.

Monday night this week, I have seen the same problem with named compiled
from bind 3.6.1 sources.  I had compiled for 32 bit x86 Linux in a 32
bit chroot under a 64 bit x86_64 Linux 2.6.30 single-cpu kernel.
Running named with -g -d 3 produced a log on stderr showing timeouts of
less than 2 seconds, followed by named doing all kinds of wrong handling
for the fictitious error, such as disabling EDNS and reporting SERVFAIL
to dig.

A wireshark capture showed that most of the timed out queries did
receive responses in less than 5 seconds, but those responses were
rejected with "port (already) closed" ICMP messages (that were
fortunately dropped by a firewall before annoying the root servers).

The named.conf I was testing had dnssec enabled with dlv.isc.org as the
only trust anchor.  IPv6 was also enabled.  The setup was a prototype
setup for a recursive DNSSEC validating caching-mostly server.  This was
a test run only, running on a different machine and with one manual dig
invocation at a time as the only client.

So anyone know anything about this bug and how to get rid of it?  I
suspect it is responsible for many misdiagnosed complaints about EDNS
failures and such.