dhcp-3.0.1rc12 server daemon woes :-(

Wed Oct 15 21:08:35 UTC 2003

both dhcp servers run ntpd v4.1 and are synchronised to within 1ms of
UTC via stratum 1 servers.  I monitor this with rrdtool and ntpq.  The
stratum 1 servers are synchronised to UTC via 2 Trimble Acutime GPS
receivers on our Campus Network.

I have a script that checks syslog for "communications interrupted" - is
this not equivalent to my state messages?  Not a problem for me to
monitor the leases file for changes to the my state line.  I will make
the changes tomorrow.

Tx

Nick=20

> -----Original Message-----
> From: Stephens, Bill {PBSG} [mailto:Bill.Stephens at pbsg.com]=20
> Sent: Wednesday, October 15, 2003 6:58 PM
> To: 'dhcp-hackers at isc.org'
> Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(
>=20
>=20
> Check to make sure the partners are communicating (my state in
> dhcpd.leases).  Also check to make sure clocks are in sync (run an ntp
> daemon on both servers). =20
>=20
> -----Original Message-----
> From: Nick Garfield [mailto:Nicholas.Garfield at cern.ch]=20
> Sent: Wednesday, October 15, 2003 10:48 AM
> To: dhcp-hackers at isc.org
> Subject: dhcp-3.0.1rc12 server daemon woes :-(
>=20
> Hello Hackers list,
>=20
> I sent the email below to the dhcp-server list earlier today.  I think
> that this list is probably a more appropriate place for it :-)
>=20
> I cannot offer much help in the knowledge of the=20
> code....but.... if you
> can tell me what/how to begin debugging the code then I might=20
> be of some
> use.
>=20
> Please see the email text below for details of the problem.
>=20
> Thanks
>=20
> Nick
>=20
>=20
>=20
> ------------------------------------------
> Hi,
>=20
> I upgraded our failover servers to the latest dhcpd on 9th September
> from dhcp-3.0.1rc11.  The only noticeable "feature" to begin with were
> lots of "ping timeout statements" in the log files.  I posted a query
> about these with no response from the list.
>=20
> Now, it seems, I have run into more serious problems which have taken
> some time to verify.  Previously, with rc11, I had problems with "peer
> holds all free leases" messages on both servers.  This situation would
> occur only when the communication link was cut between the=20
> two servers.
> Re-establishment of the link - say rebooting a router - would=20
> leave the
> servers in the "communications interrupted/lost" state.
>=20
> The situation above was not so bad - I wrote a script to send an alarm
> when/if the problem occurs.
>=20
> The new error is worse.  Since I upgraded to rc12 I have been getting
> occasional phone calls,  Mr.x can't get an address on subnet=20
> (1), Mr. y
> can't get an address (2) etc etc.
>=20
> I had to rule out saturated hubs, switches, wireless access points etc
> etc.  No problem here.
>=20
> Next, I had to rule that the pools were not over utilized - I wrote a
> few perl functions to write reports on this.  The pools run between
> 0-50% utilization.
>=20
> Armed with the information above I confidently tell the=20
> users, "There is
> no problem here....".
>=20
> That is, until yesterday.  I receive the usual "I can't get an
> address..." phone call: I go through the usual routine....servers
> up.....servers communicating.....network alive.....pools have lots of
> free addresses....   hmmm.....  what is going on?
>=20
> The pool in question has 54 addresses,  4 were used and 50 free.
>=20
> Normally at this point I would say, "Everything is fine", but=20
> I instead
> looked in the logs:
>=20
> On the secondary there were A LOT of "peer holds all free leases"
> messages.  On the primary there were none of these messages.
>=20
> This makes no sense because my report scripts show the=20
> binding-state of
> each lease and the split was roughly 50/50 free/backup with only one
> address reported as expired and a few active.
>=20
> From the primary log:
>=20
> Oct 15 00:58:42 ip-srv-3 dhcpd: DHCPDISCOVER from=20
> 00:0b:ac:e6:b3:c8 via
> 137.138.194.65: load balance to peer boson
>=20
> From the secondary log:
>=20
> Oct 15 00:58:42 ip-srv-4 dhcpd: DHCPDISCOVER from=20
> 00:0b:ac:e6:b3:c8 via
> 137.138.194.65: peer holds all free leases
>=20
> So here is the problem.....when this situation occurs NO=20
> OFFER WAS SENT
> TO THE CLIENT :-(
> I do not understand why the primary has tried to communicate with the
> failover server on receiving a DISCOVER message.  IMO the=20
> load balancing
> should only occur AFTER a DHCPACK has been received.
>=20
> I conclude that there are two choices.
>=20
> (1) Roll-back to rc11.
>=20
> (2) Play with MCLT as suggested in the DHCP Handbook 2nd ed.  The  HA
> part of the conf file looks like this:
>=20
>   max-response-delay 60;
>   max-unacked-updates 10;
>   mclt 600;
>   split 128;
>   load balance max seconds 3;
>=20
> and the global options in the conf file are as follows:
>=20
> not authoritative; #by default, see PB services below
> allow bootp;
> ddns-update-style none;
> deny duplicates;
> default-lease-time 1800; # for all devices but network devices
> max-lease-time 3600;
>=20
> I believe the MCLT and lease-times are configured correctly,=20
> therefore I
> plan to roll-back to rc11, unless anyone can give me a good=20
> reason to do
> otherwise.
>=20
> The total conf file is 30,000 lines (26,000 lines used in host
> statements), therefore too large to send in an email.
> Both deny/allow- unknown/known clients and client-classing (blocking)
> are used in the config.
>=20
> I would be pleased to provide the maintainer with the configurations,
> leases files, diagrams and trace files if they would be of any help.
>=20
> If anyone has any useful suggestions about why the above is happening
> then I would be very pleased to hear about your experiences with rc12.
>=20
> Thanks,
>=20
> Nick
>=20
>=20
>=20
>=20
> Nick Garfield
> IT/CS Campus Networking Section
> CERN
> Geneva
> Switzerland
>=20
> Tel:+41 22 76 74 533=3D20
>=20
>=20