DHCP peer failure and pool exhaustion...

Mon Sep 9 19:13:10 UTC 2013

Gregory Sloop wrote:
>SC> No, and that's generally a bad thing to do, that type of DR scenario
>SC> should not be automated as it could be triggered accidentally by a
>SC> number of situations.
>
>Yes, I can see how that could happen - and if I understand things
>correctly, it would be a bad thing(tm) if the peer that was
>automagically put into "partner-down" suddenly came up.

Actually, no it isn't a bad thing. When it comes up, the first thing it will try and do is talk to it's peer. Assuming it can, then they go through a process of getting each other "up to date" as to the state of the enetwork - which principally means syncing all the lease changes to the previously down partner.

>SC> You should calculate the pools to ensure that there is enough IPs for
>SC> all the clients, plus some buffer space to allow for a host to fail. I
>SC> would usually size the pool at 120-150% of the number of clients
>SC> depending on the size of the subnet to start with.
>
>Yeah, I understand this too. It's simply a failing of the network I'm
>working this in, and it's something we can't just change easily.
>[Though we are working to change it.] There is a lot of churn. Yeah,
>it's ugly too.

Yes, it can be a pain inheriting something in a "we wouldn't start from here" state. Are there really no odd small blocks dotted around the subnet that you could steal ?

>However, I *still* think that the DHCP server fail-over/peer could
>work in a way to handle this. I understand it probably requires some
>structural changes and some heavy lifting, but I think it's generally
>what people expect fail-over/peering to look like - not what we
>currently have.

Well if you are happy that if "the partner is not visible" almost certainly means it's "not there and working" then it shouldn't take too much scripting to look either at the logs or poll the server for status. Then you can wait a bit and put the live peer into partner down state.

There has been a certain amount of discussion about this over the years - and one suggestion has been to add a config option to control this. Keep the default as it is, but allow the administrator to change it. I believe part of the reason for the current state of affairs is from a viewpoint that there are network topologies that could mean the peers are unable to communicate with each other, but both of them can communicate with their clients. If you were to put both peers into partner down state - then chaos would ensue as they proceeded to issue duplicate leases.