bind 9.6.2 with threads hangs

Tue Mar 23 10:46:20 UTC 2010

Fabien Seisen wrote:
>> This doesn't sound like a hugely loaded server,
>>
> exact, on my own test (with "real life" queries), the server can handle
>  ~70000 queries/s with response time ~1ms at 70% cpu and no
> packet lost.
> 
>> else it's somewhat throttled (not particularly large cache and probably
>> default limit on recursive clients).  What kind of query rates do you 
>> have?  Do you get any logging that suggests resource problems?  If so, 
>> you might need to ncrease some of the limits.
> 
> We have a pool of several more or less identicals servers with a
> load-balancer in front.
> 
> On average, each server gets 1800 queries/s and 4000 at peak.
> 
> The problem occurs every few weeks and never on all servers at a time.
> 
> Recursive clients config is not modified (rndc status: recursive clients:
> 188/2900/3000) and we have
> - on avg: 200 recursive clients
> - at peak 600

OK - so reasonably well-loaded, but not struggling and doesn't sound
like it should be hitting any resource problems (although I think your
max-cache-size might be on the small side - would you consider
increasing it?  Are you doing a 64 or 32 bit build?)

> It's intriguing that you're seeing the same issues on two bind versions
>> and two OS (and that other people's experience is different from yours)
>>
> only Solaris 10
> - Solaris 10 U6 with bind 9.5.1-P3 with threads compiled with SUNSpro 12
> - Solaris 10 U6 with bind 9.6.2      with threads compiled with gcc
>
>> - it suggests to me that it's specific to your configuration or client
>> base/queries or your environment.
>>
> 
> we gets real life queries from customers (evil?).

Well, the nameserver is there to answer queries - good, bad,
ill-considered, typos etc..  And it should accommodate them all.

> A simple "rndc flush" revives named.
> 
> Perhaps, a bad formated packet freeze named or create a cache dead lock
> 
> Can something go wrong in the cache ?

Yes, sometimes there can be cache contamination (usually confined to a
particular domain though, and due to admin mistakes on the part of that
domain's owners).  It would surprise me to find cache contamination with
this far-reaching an effect, although it's not unknown where you're
using forwarding and your forwarders use different root or high level
domain NS records.

It's interesting that rndc flush clears the problem - so it might be
cache-related.

You could take a 'normal' cache dump and then a cache dump when the
problem is ongoing.  Look particularly for NS/A record pairs - fairly
high up in your resolution path (I say 'resolution path' because I don't
know if you're resolving directly or via any forwarding) that are
incorrect.  Use "rndc dumpdb -all".  Good luck - it can be a bit like
looking for a needle in a haystack.

> I am not fluent with core files but i have got one in my pocket.

A core file is useful for seeing what named was doing at the instant
that it was created.  It may or may not be useful in this case because
it's only a snapshot.  It would show you a deadlock for example, but
where named is not hung, just not doing what you expect, then the
snapshot is often not what you need for troubleshooting.  You would need
gdb or dbx to analyze it, along with the exact same binary that created
it, and preferably on the same box (so that the dynamic libs match).

> For troubleshooting I'd start by looking at the logging output - if
>> you've got any categories going to null, un-suppress them temporarily;
>> and add query-errors (see 9.6.2 ARM).  Then perhaps do some sampling of
>> network traffic (perhaps there's a UDP message size/fragmentation issue)
>>  to see what's happening (or not).
>>
> 
> all category to non-null and we do not use specific 9.6.2 configuration.
> I did not noticied weird log message (beside regular: shutting down due to
> TCP receive error: 202.96.209.6#53: connection reset)
> 
> here is our log config:
>     category client { client.log; };
>     category config { config.log; default_syslog; };
>     category database { database.log; default_syslog; };
>     category default { default.log; default_syslog; };
>     category delegation-only { delegation-only.log; };
>     category dispatch { dispatch.log; };
>     category general { default.log; };
>     category lame-servers { lamers.log; };
>     category network { network.log; };
>     category notify { notify.log; default_syslog; };
>     category queries { queries.log; };
>     category resolver { resolver.log; };
>     category security { security; };
>     category unmatched { unmatched.log; };
>     category update { update.log; };
>     category xfer-in { xfer-in.log; default_syslog; };
>     category xfer-out { xfer-out.log; default_syslog; };

The other side of this is the various logging channels used by these
categories - what level are they logging at?  I would definitely
recommend the new category query-errors for your 9.6.2 build - you set
it up to log to its own channel with level dynamic, then when things
start to go wrong, increase the trace level via rndc to log at debug
level 2 and see if there are any clues in what you're seeing (I'd also
recommend sampling what's output here when named is running normally too
- failing to resolve sometimes is expected behaviour!)

(Also note that debug level 2 can be rather busy in the other categories
too).

Cathy