Strange BIND9 issue

Thu Jan 13 21:52:27 UTC 2005

Will Yardley wrote:
> On 2005-01-12, Brad Knowles <brad at stop.mail-abuse.org> wrote:
> 
>>At 7:42 PM -0600 2005-01-11, Will Yardley wrote:
> 
> 
>>> radon: 04:56pm# while true ; do dig yahoo.com @66.33.216.127 | grep 
>>>Query  ; done
>>> ;; Query time: 790 msec
>>> ;; Query time: 868 msec
>>> ;; Query time: 753 msec
>>> ;; Query time: 798 msec
>>> ;; Query time: 982 msec
>>> ;; Query time: 1178 msec
>>> ;; Query time: 1284 msec
>>> ;; Query time: 1291 msec
>>> ;; Query time: 1208 msec
>>> ;; Query time: 738 msec
> 
> 
>>You're completely by-passing the local caching BIND nameserver here.
>>You're going directly the the nameserver specified in the command
>>line, and the local copy of BIND is not involved at all.  Unless that
>>is the public IP address of your machine, but then queries to
>>127.0.0.1 or the public IP address should be going to the same copy of
>>BIND running on the same machine, and I don't understand why this
>>would result in the kind of difference you're seeing.
> 
> 
> Right - same copy of BIND, same machine, which is why I find it so odd.
> 
> Sorry I wasn't more clear about that (though the IP was listed as a
> listen-on address in the config snippet).
> 
> 
>>Have you seen this kind of behaviour regardless of which IP address
>>you query?
> 
> 
> When the problem comes up, querying via the machine's loopback interface
> doesn't seem to have the problem; querying the public IP (whether from
> the machine itself, which should be essentially the same thing, or from
> outside) experiences /huge/ timeouts. The actual problem we see is that
> other services (ssh, smtp, etc.) then become slow because DNS lookups on
> machines with this nameserver listed first in resolv.conf are slower
> (going to the second nameserver listed).
> 
> Barry's response is interesting, but we're not doing any funny
> networking stuff here (definitely no anycast) - and I'm testing the
> query from the local machine, so I think it should be exactly the same
> whether the loopback interface or a public interface is being queried.
It's not only a question of anycast.
If you have one IP which gets "hammered" with queries, the queue for
the IP in the OS can get filled up and packets get dropped.
You could test it, if you configure another IP on the interface.
If you have slow or no responses on the "known" IP but normal or
fast responses on the second IP, the queue of the OS seems to
overflow.
> 
> 
>>>         recursive-clients 6000;
>>>         tcp-clients 1500;
>>>         max-cache-size 150000000;
> 
> 
>>Why have you defined these?  Why not make the configuration simpler
>>and disable them.  If this fixes your problem, then you know where to
>>look.  If not, then you know to look elsewhere.
> 
> 
> The max-cache-size thing is an attempt to somewhat limit the amount
> of memory BIND sucks up... the default is unlimited. Setting it
> explicitly should just mean that old records get purged early, no? I
> can try disabling this if people really think it will help - before we
> switched to rbldnsd for the blocklists (pretty huge zones), we had to
> run two BIND instances and they were sucking down /huge/ amounts of
> memory between them.
I wouldn't disable max-cache-size.
When you disable it, Bind could eat up all memory and the
server starts to swap.
> 
> The recursive-clients and tcp-clients were added when this problem or a
> similar problem came up just to make sure that we weren't hitting one of
> those limits. Re-reading the ARM, it looks like I somewhat misunderstood
> the meaning of this option, though. I'll try disabling both of those and
> see if that helps...
The defaults are quite low (tcp 100, udp ?)
We hit the limits all the time and increasing them helps.
If you set them too high, the server will have problems also.
Our numbers are similar (or higher) to yours.
> 
> 
>>>         /* only allow queries from internal networks */
>>>         allow-query { dh_known_networks; 127.0.0.0/8; };
> 
> 
>>Well, that would pretty much kill you from doing queries to the 
>>external IP address.
> 
> 
> Well a), I was doing the query from the local machine, and b),
> dh_known_networks is listed in the ACL section that I snipped from the
> config... it's an ACL which lists all the networks allowed to query the
> machine.
> 
> Thanks...
> 
> w
> 
> 
> 
Guido