DHCP peer failure and pool exhaustion...

Mon Sep 9 19:58:41 UTC 2013

For what it is worth, ISC DHCP 4.2 has added an "auto partner-down" configuration option:

The auto-partner-down statement

auto-partner-down seconds;

This statement instructs the server to initiate a timed delay upon entering the communications-interrupted state (any situation of being out-of-contact with the remote failover peer). At the conclusion of the timer, the server will automatically enter the partner-down state. This permits the server to allocate leases from the partner's free lease pool after an STOS+MCLT timer expires, which can be dangerous if the partner is in fact operating at the time (the two servers will give conflicting bindings).

Think very carefully before enabling this feature. The partner-down and communications-interrupted states are intentionally segregated because there do exist situations where a failover server can fail to communicate with its peer, but still has the ability to receive and reply to requests from DHCP clients. In general, this feature should only be used in those deployments where the failover servers are directly connected to one another, such as by a dedicated hardwired link ("a heartbeat cable").

A zero value disables the auto-partner-down feature (also the default), and any positive value indicates the time in seconds to wait before automatically entering partner-down.

Regards,
Greg Rabil

-----Original Message-----
From: dhcp-users-bounces+greg.rabil=bt.com at lists.isc.org [mailto:dhcp-users-bounces+greg.rabil=bt.com at lists.isc.org] On Behalf Of Steven Carr
Sent: Monday, September 09, 2013 3:33 PM
To: Users of ISC DHCP
Subject: Re: DHCP peer failure and pool exhaustion...

On 9 September 2013 20:13, Simon Hobson <dhcp1 at thehobsons.co.uk> wrote:
> I believe part of the reason for the current state of affairs is from a viewpoint that there are network topologies that could mean the peers are unable to communicate with each other, but both of them can communicate with their clients. If you were to put both peers into partner down state - then chaos would ensue as they proceeded to issue duplicate leases.

That's precisely my reasoning for it being a "bad thing".

Putting a peer into partner-down when it's not actually down causes chaos, and if both systems were put into partner-down then you can end up in the situation where neither peer is issuing leases for MCLT (which I believe someone on the list has had in the past IIRC), your network ends up in more sh*t than it already was in, at least some clients could get online, now none can.

Unless you have a 100% guarantee that your script is flawless and can only trigger partner-down when a peer is actually dead then the only other method is human intervention.

And Greg, yes it's a sucky answer, but that's only because it's the answer you didn't want to hear. At some point you need to deal with the legacy crap you've been left with and fix it, the tools can only go so far to assist. DHCP failover isn't perfect, no-one said it was, and it does have it's gotchas, sadly you've ran into this one.

As a temporary measure you could have your monitoring system alert you when a peer goes down, DHCPD isn't running, or "COMMUNICATIONS-INTERRUPTED" appears in syslog so that you can then access the systems, see if it is really down and apply a band aid (set
partner-down) before it has a detrimental impact on production systems.

Steve
_______________________________________________
dhcp-users mailing list
dhcp-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/dhcp-users