dhcp-3.0.1rc12 server daemon woes :-(

Wed Oct 15 16:57:34 UTC 2003

Check to make sure the partners are communicating (my state in
dhcpd.leases).  Also check to make sure clocks are in sync (run an ntp
daemon on both servers).  

-----Original Message-----
From: Nick Garfield [mailto:Nicholas.Garfield at cern.ch] 
Sent: Wednesday, October 15, 2003 10:48 AM
To: dhcp-hackers at isc.org
Subject: dhcp-3.0.1rc12 server daemon woes :-(

Hello Hackers list,

I sent the email below to the dhcp-server list earlier today.  I think
that this list is probably a more appropriate place for it :-)

I cannot offer much help in the knowledge of the code....but.... if you
can tell me what/how to begin debugging the code then I might be of some
use.

Please see the email text below for details of the problem.

Thanks

Nick

------------------------------------------
Hi,

I upgraded our failover servers to the latest dhcpd on 9th September
from dhcp-3.0.1rc11.  The only noticeable "feature" to begin with were
lots of "ping timeout statements" in the log files.  I posted a query
about these with no response from the list.

Now, it seems, I have run into more serious problems which have taken
some time to verify.  Previously, with rc11, I had problems with "peer
holds all free leases" messages on both servers.  This situation would
occur only when the communication link was cut between the two servers.
Re-establishment of the link - say rebooting a router - would leave the
servers in the "communications interrupted/lost" state.

The situation above was not so bad - I wrote a script to send an alarm
when/if the problem occurs.

The new error is worse.  Since I upgraded to rc12 I have been getting
occasional phone calls,  Mr.x can't get an address on subnet (1), Mr. y
can't get an address (2) etc etc.

I had to rule out saturated hubs, switches, wireless access points etc
etc.  No problem here.

Next, I had to rule that the pools were not over utilized - I wrote a
few perl functions to write reports on this.  The pools run between
0-50% utilization.

Armed with the information above I confidently tell the users, "There is
no problem here....".

That is, until yesterday.  I receive the usual "I can't get an
address..." phone call: I go through the usual routine....servers
up.....servers communicating.....network alive.....pools have lots of
free addresses....   hmmm.....  what is going on?

The pool in question has 54 addresses,  4 were used and 50 free.

Normally at this point I would say, "Everything is fine", but I instead
looked in the logs:

On the secondary there were A LOT of "peer holds all free leases"
messages.  On the primary there were none of these messages.

This makes no sense because my report scripts show the binding-state of
each lease and the split was roughly 50/50 free/backup with only one
address reported as expired and a few active.