Strange Failure in Bind-9.1.1rc3

Tue Mar 20 22:31:41 UTC 2001

	Over the weekend, I upgraded our primary and secondary
name servers to bind-9.1.1rc3.  both are actually slaves to a
stealth primary.  We are extremely pleased so far, but there was
one fly in the ointment today.  Our primary system running on a
Sun Sparc ultra with 512 Megs of RAM suddenly stopped serving
dns.  The system did not crash at all, but named simply stopped
working.  It had a process number still alive and looked for all
the world like it was good, but it wouldn't answer queries.  I
tried  rndc reload  on it and it just hung.  Sending a normal
kill to the process ID wouldn't kill it and I finally had to do a
kill -9 to make it die.

	As with any good red-blooded failure, it left absolutely
no tracks.  Not one message in the name server log nor in any
other normal channel for finding trouble, such as syslog or the
console messages.  It just quit.

	During this same day, our ISP had  had some hick UPS and
we may have been having connectivity trouble intermittently.
Does this sound familiar to anybody?

	Everything started working perfectly as soon as I killed
named and restarted it.  Lookups take anywhere from 3 to 20
milliseconds with most of them in the 6--8ms range so it seems to
be fine again.  Our secondary dns which is an identical system
and software continued to operate properly during this time.

	Is there anything else I should have looked at before
killing named?  At the time, the goal was to get it working again
as quickly as possible.

	Both Sparcs are running Solaris7.

Martin McCormick