URGENT, PLEASE READ: 9.5.0-P1 now available

Chris Thompson cet1 at hermes.cam.ac.uk
Thu Jul 10 11:44:29 UTC 2008


On Jul 10 2008, JINMEI Tatuya / 神明達哉 wrote:

>For anyone experiencing this problem, I'd like to get the following
>information:

This is for the BIND 9.4.2-P1 nameservers running under Solaris 10_x86
(SunOS 5.10 Generic_127112-10, in non-global zones, on a Sun X4100 M2,
if any of that matters) which I mentioned before - we have had a couple
more incidents overnight.

>- checks whether the server constantly opens such a large number of
>  sockets, e.g., by using lsof

Looking at /proc/[pid]/fd shows that most of the time the high-water
mark is around 60-70. They are nearly all sockets, as expected.

>- checks how many clients the server is normally handling, by
>  executing 'rndc status' several times.  (note: you may have to
>  specify a smaller value for the recursive-clients option so that
>  there's at least one TCP socket is available for rndc)

Outstanding queries at any one time are mostly below 50, but we know
from past experience that this is *extremely* variable. Some host
starts unloading lots of slowly unanswerable queries on us when their
primary resolver doesn'r respond fast enough - these servers are much
used as backup resolvers in resolv.conf's-or-equivalent.

I wouldn't be at all suprised to find that the incidents are caused
by attempts at retrospective Apache (or other) log analysis, a common
cause of query rate spikes.

>- checks query rate, cache hit rate, number of queries sent from the
>  server per some time unit.  you can get these numbers by executing
>  'rndc stats' periodically and several times (note: some of the
>  numbers are only available for 9.5)

A bit of query logging shows one of them running at 225 queries/sec and
the other at 85 queries/sec. But of course as above, this is a highly
variable thing. Other stats not yet to hand.

>Also, for those who can try beta versions in the operational
>environment, I'd like you to try it to see whether the problem still
>happens with them.

It would help if you could go into more detail about the differences
in socket handling in 9.4.2-P1 vs 9.4.3b2. I see that there has already
been one report of the same or similar problem against 9.4.3b2 on a 
Linux system.

I have been wondering whether the problem is in fact an effective 256
file descriptor limit, despite the larger resource limit settings. The
named binary is a 32-bit executable, not a 64-bit one (default BIND make
on these Opteron processors). Has anyone tried a 64-bit one?

-- 
Chris Thompson
Email: cet1 at cam.ac.uk


More information about the bind-users mailing list