bind9 is taking little Breaks for Some Reason.

Tue May 15 17:19:16 UTC 2007

When the event is happening, can you connect to the system?  We had a series
of Dell 1655 blades acting as DNS servers that would take a break for no
apparent reason.  However, the entire network connection would drop for 30
seconds, not just BIND.  It turned out to be an issue with the NIC's
specific Broadcom chip set, not BIND itself, but BIND would trigger it.

-Matt

> -----Original Message-----
> From: bind-users-bounce at isc.org 
> [mailto:bind-users-bounce at isc.org] On Behalf Of Martin McCormick
> Sent: Tuesday, May 15, 2007 10:43 AM
> To: bind-users at isc.org
> Subject: bind9 is taking little Breaks for Some Reason.
> 
> 	This is a mystery that has no good suspects yet.  We
> have bind9.3.4 running on a Dell server with a peak DNS query
> load that sometimes tops over a million queries per hour but
> usually stays in the 600,000-per-hour range.  The system loading
> is around 0.05 to 0.11 or so with no signs of stress.
> 
> 	The problem is that at random times, attempts to update
> the DNS from clients on the same network which are even on ports
> in the same switch fall on deaf ears, so to speak.
> 
> 	I discovered it when I noticed that the DHCP server
> which is in the same subnet on the same switch, was giving
> occasional timed out errors in clusters that would sometimes
> span over a minute but are usually limited to a couple of
> seconds.
> 
> 	If I look at the logs on the DNS, I see no squawks about
> anything bad.  There is simply a little gap in the normal tempo
> of messages about Microsoft clients trying to update us and us
> refusing, etc.  The gap is wide enough to be certain that
> something is wrong because it will be significantly longer than
> the normal period of time between messages.
> 
> 	It also does not seem to be related to traffic levels as
> we see it in the wee hours of the morning as well as mid-day on
> a Wednesday which is one of our heaviest usage days.
> 
> 	Our slave is on a brand new Dell 2650 running FreeBSD6.2
> and apparently is exhibiting the same behavior as a client
> recently had a query time out and then repeated it and it
> worked.
> 
> 	Some other background follows:
> 
> 	The master DNS is presently on FreeBSD4.11.  We began
> running bind9.3.4 on March 21 after upgrading from bind9.3.2 or
> similar.  A check of logs over the last 6 months shows one
> little holiday in a whole 24-hour period in September.  The
> number of naps had increased to 6 or 8 or so per day by March
> 15 (older bind).  After March 21, (bind9.3.4), the problem did
> not get any better or worse.
> 
> 	I did ping our master DNS from yet another device on
> that same network while we were having one of those cat-naps,
> and never lost a packet.
> 
> 	Does this sound familiar?  Again, bind does not appear
> to be having any trouble.  I wrote an expect script to run on
> our dhcp server to initiate a rndc status call to our master at
> the first "timed out" message in syslog.  Each time, bind
> responds with a clean bill of health.  In fact, the number of
> recursive clients is always extremely low.  Our usual count is
> 30 to 50 per 1000.  When we do the status check after a
> time-out, it may be as low as 2.
> 
> 	To be clear, this also effects queries as well as
> updates.  Anything that is port 53 and udp seems to be effected.
> The switch being used is a Cisco 3750.
> 
> 	Any ideas, advice, etc is greatly appreciated.
> 
> Martin McCormick WB5AGZ  Stillwater, OK 
> Systems Engineer
> OSU Information Technology Department Network Operations Group
> 
>