dhcp-3.0.1rc12 server daemon woes :-(

Thu Oct 16 13:27:30 UTC 2003

What do the leases look like when you run the leasstate script from Kevin
Miller's site (http://www.contrib.andrew.cmu.edu/~kevinm/dhcp)?

I noticed, going back through the archives, that you ran into this same
problem in July.  How did you correct the issue then?

-----Original Message-----
From: Nick Garfield [mailto:Nicholas.Garfield at cern.ch] 
Sent: Thursday, October 16, 2003 4:12 AM
To: dhcp-hackers at isc.org
Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(

Bill,

I rolled back to rc11 yesterday and I actually find no improvement on
the "peer holds all free leases" problem on the secondary.

ip-srv-3 =3D primary
ip-srv-4 =3D secondary

Time on both servers is OK.

ip-srv-3.cern.ch/ROOT[4] date
Thu Oct 16 11:07:32 CEST 2003

ip-srv-4.cern.ch/ROOT[5] date
Thu Oct 16 11:07:34 CEST 2003

Let's have a look at how many peer holds all free leases messages there
are on both servers.=20

ip-srv-3.cern.ch/ROOT[3] grep holds /var/log/today/dhcp.log | wc -l
      0

ip-srv-4.cern.ch/ROOT[8] grep "peer holds all free leases"
/var/log/today/dhcp.log | wc -l
   1248

Here we can clearly see the problem.

Communication is OK:

ip-srv-3.cern.ch/ROOT[1] grep "my state" /usr/local/etc/dhcpd.leases
  my state normal at 4 2003/10/16 08:11:52;
  my state normal at 4 2003/10/16 08:11:52;
  my state normal at 4 2003/10/16 08:11:52;
  my state normal at 4 2003/10/16 08:11:52;
  my state normal at 4 2003/10/16 08:11:52;
  my state communications-interrupted at 4 2003/10/16 08:48:52;
  my state communications-interrupted at 4 2003/10/16 08:48:52;
  my state normal at 4 2003/10/16 08:48:52;
ip-srv-3.cern.ch/ROOT[2]=20

ip-srv-4.cern.ch/ROOT[1] grep "my state" /usr/local/etc/dhcpd.leases
  my state normal at 4 2003/10/16 08:48:46;
  my state normal at 4 2003/10/16 08:48:46;
  my state normal at 4 2003/10/16 08:48:46;
  my state normal at 4 2003/10/16 08:48:46;
  my state normal at 4 2003/10/16 08:48:46;
ip-srv-4.cern.ch/ROOT[2]=20

Is there any information that I can give you to make more sense of this?

Thanks

Nick

> -----Original Message-----
> From: Stephens, Bill {PBSG} [mailto:Bill.Stephens at pbsg.com]=20
> Sent: Wednesday, October 15, 2003 6:58 PM
> To: 'dhcp-hackers at isc.org'
> Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(
>=20
>=20
> Check to make sure the partners are communicating (my state in
> dhcpd.leases).  Also check to make sure clocks are in sync (run an ntp
> daemon on both servers). =20
>=20
> -----Original Message-----
> From: Nick Garfield [mailto:Nicholas.Garfield at cern.ch]=20
> Sent: Wednesday, October 15, 2003 10:48 AM
> To: dhcp-hackers at isc.org
> Subject: dhcp-3.0.1rc12 server daemon woes :-(
>=20
> Hello Hackers list,
>=20
> I sent the email below to the dhcp-server list earlier today.  I think
> that this list is probably a more appropriate place for it :-)
>=20
> I cannot offer much help in the knowledge of the=20
> code....but.... if you
> can tell me what/how to begin debugging the code then I might=20
> be of some
> use.
>=20
> Please see the email text below for details of the problem.
>=20
> Thanks
>=20
> Nick
>=20
>=20
>=20
> ------------------------------------------
> Hi,
>=20
> I upgraded our failover servers to the latest dhcpd on 9th September
> from dhcp-3.0.1rc11.  The only noticeable "feature" to begin with were
> lots of "ping timeout statements" in the log files.  I posted a query
> about these with no response from the list.
>=20
> Now, it seems, I have run into more serious problems which have taken
> some time to verify.  Previously, with rc11, I had problems with "peer
> holds all free leases" messages on both servers.  This situation would
> occur only when the communication link was cut between the=20
> two servers.
> Re-establishment of the link - say rebooting a router - would=20
> leave the
> servers in the "communications interrupted/lost" state.
>=20
> The situation above was not so bad - I wrote a script to send an alarm
> when/if the problem occurs.
>=20
> The new error is worse.  Since I upgraded to rc12 I have been getting
> occasional phone calls,  Mr.x can't get an address on subnet=20
> (1), Mr. y
> can't get an address (2) etc etc.
>=20
> I had to rule out saturated hubs, switches, wireless access points etc
> etc.  No problem here.
>=20
> Next, I had to rule that the pools were not over utilized - I wrote a
> few perl functions to write reports on this.  The pools run between
> 0-50% utilization.
>=20
> Armed with the information above I confidently tell the=20
> users, "There is
> no problem here....".
>=20
> That is, until yesterday.  I receive the usual "I can't get an
> address..." phone call: I go through the usual routine....servers
> up.....servers communicating.....network alive.....pools have lots of
> free addresses....   hmmm.....  what is going on?
>=20
> The pool in question has 54 addresses,  4 were used and 50 free.
>=20
> Normally at this point I would say, "Everything is fine", but=20
> I instead
> looked in the logs:
>=20
> On the secondary there were A LOT of "peer holds all free leases"
> messages.  On the primary there were none of these messages.
>=20
> This makes no sense because my report scripts show the=20
> binding-state of
> each lease and the split was roughly 50/50 free/backup with only one
> address reported as expired and a few active.
>=20
> From the primary log:
>=20
> Oct 15 00:58:42 ip-srv-3 dhcpd: DHCPDISCOVER from=20
> 00:0b:ac:e6:b3:c8 via
> 137.138.194.65: load balance to peer boson
>=20
> From the secondary log:
>=20
> Oct 15 00:58:42 ip-srv-4 dhcpd: DHCPDISCOVER from=20
> 00:0b:ac:e6:b3:c8 via
> 137.138.194.65: peer holds all free leases
>=20
> So here is the problem.....when this situation occurs NO=20
> OFFER WAS SENT
> TO THE CLIENT :-(
> I do not understand why the primary has tried to communicate with the
> failover server on receiving a DISCOVER message.  IMO the=20
> load balancing
> should only occur AFTER a DHCPACK has been received.
>=20
> I conclude that there are two choices.
>=20
> (1) Roll-back to rc11.
>=20
> (2) Play with MCLT as suggested in the DHCP Handbook 2nd ed.  The  HA
> part of the conf file looks like this:
>=20
>   max-response-delay 60;
>   max-unacked-updates 10;
>   mclt 600;
>   split 128;
>   load balance max seconds 3;
>=20
> and the global options in the conf file are as follows:
>=20
> not authoritative; #by default, see PB services below
> allow bootp;
> ddns-update-style none;
> deny duplicates;
> default-lease-time 1800; # for all devices but network devices
> max-lease-time 3600;
>=20
> I believe the MCLT and lease-times are configured correctly,=20
> therefore I
> plan to roll-back to rc11, unless anyone can give me a good=20
> reason to do
> otherwise.
>=20
> The total conf file is 30,000 lines (26,000 lines used in host
> statements), therefore too large to send in an email.
> Both deny/allow- unknown/known clients and client-classing (blocking)
> are used in the config.
>=20
> I would be pleased to provide the maintainer with the configurations,
> leases files, diagrams and trace files if they would be of any help.
>=20
> If anyone has any useful suggestions about why the above is happening
> then I would be very pleased to hear about your experiences with rc12.
>=20
> Thanks,
>=20
> Nick
>=20
>=20
>=20
>=20
> Nick Garfield
> IT/CS Campus Networking Section
> CERN
> Geneva
> Switzerland
>=20
> Tel:+41 22 76 74 533=3D20
>=20
>=20