dhcp-3.0.1rc12 server daemon woes :-(
Stephens, Bill {PBSG}
Bill.Stephens at pbsg.com
Wed Oct 15 16:57:34 UTC 2003
Check to make sure the partners are communicating (my state in
dhcpd.leases). Also check to make sure clocks are in sync (run an ntp
daemon on both servers).
-----Original Message-----
From: Nick Garfield [mailto:Nicholas.Garfield at cern.ch]
Sent: Wednesday, October 15, 2003 10:48 AM
To: dhcp-hackers at isc.org
Subject: dhcp-3.0.1rc12 server daemon woes :-(
Hello Hackers list,
I sent the email below to the dhcp-server list earlier today. I think
that this list is probably a more appropriate place for it :-)
I cannot offer much help in the knowledge of the code....but.... if you
can tell me what/how to begin debugging the code then I might be of some
use.
Please see the email text below for details of the problem.
Thanks
Nick
------------------------------------------
Hi,
I upgraded our failover servers to the latest dhcpd on 9th September
from dhcp-3.0.1rc11. The only noticeable "feature" to begin with were
lots of "ping timeout statements" in the log files. I posted a query
about these with no response from the list.
Now, it seems, I have run into more serious problems which have taken
some time to verify. Previously, with rc11, I had problems with "peer
holds all free leases" messages on both servers. This situation would
occur only when the communication link was cut between the two servers.
Re-establishment of the link - say rebooting a router - would leave the
servers in the "communications interrupted/lost" state.
The situation above was not so bad - I wrote a script to send an alarm
when/if the problem occurs.
The new error is worse. Since I upgraded to rc12 I have been getting
occasional phone calls, Mr.x can't get an address on subnet (1), Mr. y
can't get an address (2) etc etc.
I had to rule out saturated hubs, switches, wireless access points etc
etc. No problem here.
Next, I had to rule that the pools were not over utilized - I wrote a
few perl functions to write reports on this. The pools run between
0-50% utilization.
Armed with the information above I confidently tell the users, "There is
no problem here....".
That is, until yesterday. I receive the usual "I can't get an
address..." phone call: I go through the usual routine....servers
up.....servers communicating.....network alive.....pools have lots of
free addresses.... hmmm..... what is going on?
The pool in question has 54 addresses, 4 were used and 50 free.
Normally at this point I would say, "Everything is fine", but I instead
looked in the logs:
On the secondary there were A LOT of "peer holds all free leases"
messages. On the primary there were none of these messages.
This makes no sense because my report scripts show the binding-state of
each lease and the split was roughly 50/50 free/backup with only one
address reported as expired and a few active.
More information about the dhcp-hackers
mailing list