mclt Values, was 2 Instances of dhcp on Same Platform

Mon Feb 26 12:35:41 UTC 2007

Glenn Satchell writes:
> I'm curious as to why you have MCLT issues when restarting the servers.
> 
> I have run several large sites with failover, and we always used the
> procedure to stop and then start the secondary. Wait for it to sync
> (usually only a few seconds), then stop and start the primary. Both
> servers continue to hand out leases with the normal lease times after
> restarting this way.

	Hmm.  I have a log file from a week day last Fall which
summarizes what almost always happened when we did exactly what
you describe.  I had an expect script that ran omshell to do an
orderly shutdown and then ran a tail -1f on syslog until it saw
the right message come across.  In this annotated example, landlord is the
primary and slumlord is the secondary.  Our failover-specific
directives were as follows:

authoritative;
ddns-update-style interim;

failover peer "stw" {
  primary; # declare this to be the primary server
  address 10.1.2.3;
  port 520;
  peer address 10.1.2.5;
  peer port 520;
  max-response-delay 30;
  max-unacked-updates 10;
  load balance max seconds 3;
  mclt 300;
  split 128;
}
include "/usr/local/etc/dhcpd.conf.tested";


	Here's a log.  We had started out with the recommended
MCLT value and changed it to 300 seconds or 5 minutes after we
didn't get back to normal until an hour had passed after adding
some static bootp records:

	We shutdown the secondary:

Nov 28 09:14:59 landlord dhcpd: 
failover peer stw: peer moves from normal to shutdown
Nov 28 09:14:59 landlord dhcpd: failover peer stw: 
I move from normal to partner-down
Nov 28 09:20:00 landlord dhcpd: 
failover peer stw: I move from partner-down to normal
Nov 28 09:20:00 landlord dhcpd: 
failover peer stw: peer moves from recover-done to normal

	Okay.  We're half done.  Bounce landlord.

Nov 28 09:20:16 landlord dhcpd: 
failover peer stw: I move from normal to shutdown
Nov 28 09:20:16 landlord dhcpd: 
failover peer stw: peer moves from normal to partner-down
Nov 28 09:25:17 landlord dhcpd: 
failover peer stw: peer moves from partner-down to normal
Nov 28 09:25:17 landlord dhcpd: failover peer stw: 
I move from recover-done to normal

	On rare occasions, the secondary came up to normal within
a few seconds.  I think I saw once where the primary came
immediately back up, but I never got out of one of those
transactions without at least one MCLT wait.  Our sites are very
busy as I am sure yours are.  On the day in question, we had 1,269,677
lines of dhcpd messages.

	On our network, the routers forward dhcp broadcasts to
the servers and we do not appear to be having connectivity
issues.

	I am open for any suggestions as to possible causes and
preventative measures to avoid the mclt issue on update because
we really should keep the recommended mclt value of 30 minutes
but even a 30-minute mclt is way too long for us.  We probably
modify static dhcp/bootp data five to ten times a day on a random
basis which would mean recover-wait states most of the time.:-(

	The other solution I am looking at is to run a jail on
both servers and put all the bootp data on one instance of dhcpd
and all the dynamic operation on another.  Each box would run one
of each for a total of 4 dhcp servers.  If everything we have can
eventually talk to all 4, we've got it made.

	Again, thanks for any thoughts.  When we had failover
going, it was otherwise very good.  As I mentioned last week, the
final blow was when we discovered that our present wireless
network authentication devices could only talk to one dhcp server
of the pair.

Martin McCormick WB5AGZ  Stillwater, OK 
Systems Engineer
OSU Information Technology Department Network Operations Group