Strange Problem: Caching nameservers stopped working properly

Wed Aug 15 02:21:07 UTC 2007

I have a really strange problem. I have several servers with bind
9.2.x and bind 9.4.x running as caching nameservers, and about a week
ago many of them stopped working properly. They stopped being able to
perform certain lookups, and became really slow to respond to certain
queries.

Bind is identically configured across all the machines, but they are
spread out over four different subnets. All of the servers on two of
the subnets still work fine, whereas all of the servers on the other
two subnets are having problems.

Here's the actual output of two queries, first run on one of the
servers without any problem, then on a server with the problem. I
included one query type that works on the broken server just to prove
that bind is indeed running:

============ Working Server ============

$ host -t ns w3.org localhost
Using domain server:
Name: localhost
Address: 127.0.0.1#53
Aliases:

w3.org name server ns3.w3.org.
w3.org name server ns1.w3.org.
w3.org name server ns2.w3.org.

$ host -t ns zen.spamhaus.org localhost
Using domain server:
Name: localhost
Address: 127.0.0.1#53
Aliases:

zen.spamhaus.org name server o.ns.spamhaus.org.
zen.spamhaus.org name server q.ns.spamhaus.org.
...
zen.spamhaus.org name server n.ns.spamhaus.org.

============ Broken Server =============

$ host -t ns w3.org localhost
Using domain server:
Name: localhost
Address: 127.0.0.1#53
Aliases:

w3.org name server ns3.w3.org.
w3.org name server ns1.w3.org.
w3.org name server ns2.w3.org.

$ host -t ns zen.spamhaus.org localhost
;; connection timed out; no servers could be reached

==============================

You can see that on the "broken" server, the second query for the
zen.spamhaus.org nameservers timed out. This is very consistent;
lookups that work always work, and lookups that are broken are always
broken. The problem is -- I cannot figure out any pattern between the
queries that work and the ones that don't.

All the servers are identically configured, and problem started at the
same time across all the servers, so that seems to rule out a software
or hardware issue.

The key point seems to be that all the servers that are failing are on
two certain subnets, and all the servers that are working on two
different subnets.

I've run an strace on the named process while it's failing and it
gives the following output:

==========  strace -fp <PID> =============

recvmsg(24, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("192.35.51.32")},
msg_iov(1)=[{"T\35\204\0\0\1\0\1\0\10\0\n\5henna\4ARIN\3NET\0\0\1\0\1"...,
4096}], msg_controllen=20, msg_control=0x81c95e8, , msg_flags=0}, 0) =
376
brk(0x8201000)                          = 0x8201000
recvmsg(24, 0xbffff850, 0)              = -1 EAGAIN (Resource
temporarily unavailable)
gettimeofday({1187143611, 172447}, NULL) = 0
gettimeofday({1187143611, 172478}, NULL) = 0
gettimeofday({1187143611, 172508}, NULL) = 0
gettimeofday({1187143611, 172717}, NULL) = 0
gettimeofday({1187143611, 172759}, NULL) = 0
gettimeofday({1187143611, 172789}, NULL) = 0
gettimeofday({1187143611, 172823}, NULL) = 0
gettimeofday({1187143611, 172853}, NULL) = 0
gettimeofday({1187143611, 172910}, NULL) = 0
gettimeofday({1187143611, 172942}, NULL) = 0
gettimeofday({1187143611, 172973}, NULL) = 0
gettimeofday({1187143611, 173144}, NULL) = 0
gettimeofday({1187143611, 173185}, NULL) = 0
gettimeofday({1187143611, 173215}, NULL) = 0
gettimeofday({1187143611, 173243}, NULL) = 0
gettimeofday({1187143611, 173273}, NULL) = 0
gettimeofday({1187143611, 173329}, NULL) = 0
gettimeofday({1187143611, 173387}, NULL) = 0
gettimeofday({1187143611, 173443}, NULL) = 0
gettimeofday({1187143611, 173573}, NULL) = 0
gettimeofday({1187143611, 173663}, NULL) = 0
gettimeofday({1187143611, 173751}, NULL) = 0
select(25, [20 21 22 23 24], [], NULL, {0, 0}) = 1 (in [24], left {0, 0})
gettimeofday({1187143611, 173990}, NULL) = 0
recvmsg(24, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("192.35.51.32")},
msg_iov(1)=[{"\327\217\204\0\0\1\0\1\0\10\0\n\6indigo\4ARIN\3NET\0\0"...,
4096}], msg_controllen=20, msg_control=0x81c95e8, , msg_flags=0}, 0) =
377
recvmsg(24, 0xbffff850, 0)              = -1 EAGAIN (Resource
temporarily unavailable)

=====================================

I'm not sure what to make of this strace output; hopefully someone
more familiar with bind can glean useful information from it.

I can provide any other information if necessary, run diagnostics, etc
-- I just hope someone can help me figure this out. I've had to turn
to 3rd party nameservers in the meantime.

Thanks,
Mike