Strange BIND9 issue

Wed Jan 12 07:36:15 UTC 2005

On 2005-01-12, Brad Knowles <brad at stop.mail-abuse.org> wrote:
> At 7:42 PM -0600 2005-01-11, Will Yardley wrote:

>>  radon: 04:56pm# while true ; do dig yahoo.com @66.33.216.127 | grep 
>>Query  ; done
>>  ;; Query time: 790 msec
>>  ;; Query time: 868 msec
>>  ;; Query time: 753 msec
>>  ;; Query time: 798 msec
>>  ;; Query time: 982 msec
>>  ;; Query time: 1178 msec
>>  ;; Query time: 1284 msec
>>  ;; Query time: 1291 msec
>>  ;; Query time: 1208 msec
>>  ;; Query time: 738 msec

> You're completely by-passing the local caching BIND nameserver here.
> You're going directly the the nameserver specified in the command
> line, and the local copy of BIND is not involved at all.  Unless that
> is the public IP address of your machine, but then queries to
> 127.0.0.1 or the public IP address should be going to the same copy of
> BIND running on the same machine, and I don't understand why this
> would result in the kind of difference you're seeing.

Right - same copy of BIND, same machine, which is why I find it so odd.

Sorry I wasn't more clear about that (though the IP was listed as a
listen-on address in the config snippet).

> Have you seen this kind of behaviour regardless of which IP address
> you query?

When the problem comes up, querying via the machine's loopback interface
doesn't seem to have the problem; querying the public IP (whether from
the machine itself, which should be essentially the same thing, or from
outside) experiences /huge/ timeouts. The actual problem we see is that
other services (ssh, smtp, etc.) then become slow because DNS lookups on
machines with this nameserver listed first in resolv.conf are slower
(going to the second nameserver listed).

Barry's response is interesting, but we're not doing any funny
networking stuff here (definitely no anycast) - and I'm testing the
query from the local machine, so I think it should be exactly the same
whether the loopback interface or a public interface is being queried.

>>          recursive-clients 6000;
>>          tcp-clients 1500;
>>          max-cache-size 150000000;

> Why have you defined these?  Why not make the configuration simpler
> and disable them.  If this fixes your problem, then you know where to
> look.  If not, then you know to look elsewhere.

The max-cache-size thing is an attempt to somewhat limit the amount
of memory BIND sucks up... the default is unlimited. Setting it
explicitly should just mean that old records get purged early, no? I
can try disabling this if people really think it will help - before we
switched to rbldnsd for the blocklists (pretty huge zones), we had to
run two BIND instances and they were sucking down /huge/ amounts of
memory between them.

The recursive-clients and tcp-clients were added when this problem or a
similar problem came up just to make sure that we weren't hitting one of
those limits. Re-reading the ARM, it looks like I somewhat misunderstood
the meaning of this option, though. I'll try disabling both of those and
see if that helps...

>>          /* only allow queries from internal networks */
>>          allow-query { dh_known_networks; 127.0.0.0/8; };

> Well, that would pretty much kill you from doing queries to the 
> external IP address.

Well a), I was doing the query from the local machine, and b),
dh_known_networks is listed in the ACL section that I snipped from the
config... it's an ACL which lists all the networks allowed to query the
machine.

Thanks...

w