DHCP peer failure and pool exhaustion...

Mon Sep 9 21:41:49 UTC 2013

Several answers:

1) Spare blocks to get me some breathing room. Pretty much exhausted.
We have another pool we can start moving people into, and are doing
so. [Even more quickly now! :) ]

2) Lets assume the worst. They are both in partner down state, and
issuing leases willy-nilly. Recovery, as I see it, would entail:
-stopping both peers,
-deleting the dhcpd.leases file on each peer
-making sure communication was working [or that only a single peer
would come up]
-bring up one or both peers
-force every DHCP client to get a new lease.
[does that sound right?]

---
Well, thanks all for an enlightening discussion.

Yeah, I can see that the technical details are more complex than I
appreciated - and ISC's trying to make one DHCP server that satisfies
everyone from ten-of-thousands enterprise setups to mom-and-pop little
business setups. And doing that means you probably design for the
enterprise setups - which means if it's not totally bullet-proof it's
not getting rolled out.

And I can see that ensuring that the peer isn't only failing to
communicate with the peer, but NOT down for DHCP clients is a hard and
thorny problem.

I still *wish* for something better - but the auto-partner-down in
4.2 probably resolves that. [Though I'll have to compile 4.2 myself -
which I always hate doing.]

Anyway, as always, the discussion has been excellent, and I find
myself grumpily agreeing with the dev's reasoning and admitting that
what I want probably isn't such a great idea as I thought. :)

I'll consider my options, but thanks for the full info to base a
decision on!

-Greg

grbc> For what it is worth, ISC DHCP 4.2 has added an "auto
grbc> partner-down" configuration option:

grbc> The auto-partner-down statement

grbc> auto-partner-down seconds;

grbc> This statement instructs the server to initiate a timed delay
grbc> upon entering the communications-interrupted state (any
grbc> situation of being out-of-contact with the remote failover
grbc> peer). At the conclusion of the timer, the server will
grbc> automatically enter the partner-down state. This permits the
grbc> server to allocate leases from the partner's free lease pool
grbc> after an STOS+MCLT timer expires, which can be dangerous if the
grbc> partner is in fact operating at the time (the two servers will give conflicting bindings).

grbc> Think very carefully before enabling this feature. The
grbc> partner-down and communications-interrupted states are
grbc> intentionally segregated because there do exist situations where
grbc> a failover server can fail to communicate with its peer, but
grbc> still has the ability to receive and reply to requests from DHCP
grbc> clients. In general, this feature should only be used in those
grbc> deployments where the failover servers are directly connected to
grbc> one another, such as by a dedicated hardwired link ("a heartbeat cable").

grbc> A zero value disables the auto-partner-down feature (also the
grbc> default), and any positive value indicates the time in seconds
grbc> to wait before automatically entering partner-down.

grbc> Regards,
grbc> Greg Rabil

grbc> -----Original Message-----
grbc> From: dhcp-users-bounces+greg.rabil=bt.com at lists.isc.org
grbc> [mailto:dhcp-users-bounces+greg.rabil=bt.com at lists.isc.org] On Behalf Of Steven Carr
grbc> Sent: Monday, September 09, 2013 3:33 PM
grbc> To: Users of ISC DHCP
grbc> Subject: Re: DHCP peer failure and pool exhaustion...

grbc> On 9 September 2013 20:13, Simon Hobson <dhcp1 at thehobsons.co.uk> wrote:
>> I believe part of the reason for the current state of affairs is from a viewpoint that there are network topologies that could mean the peers are unable to communicate with each other, but both of them can communicate with their clients. If you were to put both peers into partner down state - then chaos would ensue as they proceeded to issue duplicate leases.

grbc> That's precisely my reasoning for it being a "bad thing".

grbc> Putting a peer into partner-down when it's not actually down
grbc> causes chaos, and if both systems were put into partner-down
grbc> then you can end up in the situation where neither peer is
grbc> issuing leases for MCLT (which I believe someone on the list has
grbc> had in the past IIRC), your network ends up in more sh*t than it
grbc> already was in, at least some clients could get online, now none can.

grbc> Unless you have a 100% guarantee that your script is flawless
grbc> and can only trigger partner-down when a peer is actually dead
grbc> then the only other method is human intervention.

grbc> And Greg, yes it's a sucky answer, but that's only because it's
grbc> the answer you didn't want to hear. At some point you need to
grbc> deal with the legacy crap you've been left with and fix it, the
grbc> tools can only go so far to assist. DHCP failover isn't perfect,
grbc> no-one said it was, and it does have it's gotchas, sadly you've ran into this one.

grbc> As a temporary measure you could have your monitoring system
grbc> alert you when a peer goes down, DHCPD isn't running, or
grbc> "COMMUNICATIONS-INTERRUPTED" appears in syslog so that you can
grbc> then access the systems, see if it is really down and apply a band aid (set
grbc> partner-down) before it has a detrimental impact on production systems.

grbc> Steve
grbc> _______________________________________________
grbc> dhcp-users mailing list
grbc> dhcp-users at lists.isc.org
grbc> https://lists.isc.org/mailman/listinfo/dhcp-users
grbc> _______________________________________________
grbc> dhcp-users mailing list
grbc> dhcp-users at lists.isc.org
grbc> https://lists.isc.org/mailman/listinfo/dhcp-users

-- 
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gregs at sloop.net
http://www.sloop.net
---