Questions and Suggested mclt on DHCP with failover

Wed Jan 15 23:58:45 UTC 2014

NERCI> We are beginning to get failover DHCP set up on our network
NERCI> after a recent server failure. We have most of the set up
NERCI> ready, but have some concerns about mclt. We have a little over
NERCI> 4,000 customers that will be using these servers when all is
NERCI> said and done and don't want to cause too many issues if something does fail.

NERCI> That being said, with mclt, when a server is done with
NERCI> recovery and waiting for the mclt to expire, does it respond to
NERCI> new discover requests, or do they not respond to any requests
NERCI> at all? All of the documentation I have seen seems to point to the latter, not the former.

NERCI> With that question in mind, do most of you use the default
NERCI> time of 3600, or do you shorten the timer to allow a faster recovery?

I can't answer your root question - but there's a new feature [in
2.4.1+ IIRC] that essentially allows a "communications interrupted"
state server to "rewind" the lease and re-issue an active lease.

See this in the release notes:

- An optimization described in the failover protocol draft is now included,
  which permits a DHCP server operating in communications-interrupted state
  to 'rewind' a lease to the state most recently transmitted to its peer,
  greatly increasing a server's endurance in communications-interrupted.
  This is supported using a new 'rewind state' record on the dhcpd.leases
  entry for each lease.

I think this is a very key item that would allow a tight pool of IP's
to survive an extended communications interrupted state.

[I've personally run into a case where running a very tight pool and
going into a communications interrupted state exhausted the pool where
the one "up" DHCPd server ran out of IP's to lease, because all the
remaining IP's in the split-pool were "stuck" on the down server.]

One other possible help, but more risky is   "auto-partner-down"

[I'd guess this isn't an appropriate "solution" for you as an ISP,
since if the peer server is not actually down, but simply unable to
communicate then both peers go into partner-down mode and start
leasing the whole pool as though the other partner doesn't exist -
meaning duplicate IP's will probably get leased to more than one
client. This would be bad (tm). Can you say, "Grasshopper has weak
dhcp-foo?" :) ]

HTH

-Greg