Restarting DHCP safely whilst avoiding partner-down state

Fri May 13 16:01:53 UTC 2016

On 13 May 2016 at 15:48, Pallissard, Matthew
<matthew.paul at pallissard.net> wrote:
> I just tested it on a standalone box I use for testing to see if it brought
> down dhcp cleanly.  Give it a whirl on a test environment and let us know
> how it goes.

Testing shows that as alluded to in my original description of the
problem the OMAPI method is unworkable for the restart of a failover
pair.

In its dying breath the server transitions from normal -> shutdown -> recover:

May 13 16:34:12 dhcp1 dhcpd: failover peer dhcp1: I move from normal to shutdown
May 13 16:34:17 dhcp1 dhcpd: failover peer dhcp1: I move from shutdown
to recover

Upon start it resumes its recover state, sends a "update request all"
message and enters recover-wait (pushing the peer into partner-down):

May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: I move from recover to startup
May 13 16:34:20 dhcp1 dhcpd: Server starting service.
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: peer moves from
normal to communications-interrupted
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: I move from startup to recover
May 13 16:34:20 dhcp1 dhcpd: Sent update request all message to dhcp1
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: peer moves from
communications-interrupted to partner-down
May 13 16:34:22 dhcp1 dhcpd: failover peer dhcp1: peer update completed.
May 13 16:34:22 dhcp1 dhcpd: failover peer dhcp1: I move from recover
to recover-wait

This is far too disruptive for a frequent reload of the configuration
of both instances in a pair.

> On 05/13/2016 09:42 AM, Terry Burton wrote:
>>
>> On 13 May 2016 at 15:37, Pallissard, Matthew
>> <matthew.paul at pallissard.net> wrote:
>>>
>>> I just tested this and it seemed to work for me.
>>
>>
>> Do you not find if you tail the log on the partner that it transitions
>> to "partner-down" rather than "communications-interrupted"?
>>
>> Thanks!
>>
>>
>>> #dhcpd4.service
>>> [Unit]
>>> Description=IPv4 DHCP server
>>> After=network.target
>>>
>>> [Service]
>>> Type=forking
>>> PIDFile=/run/dhcpd4.pid
>>> ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
>>> ExecStop=/path/to/shutdown/script.sh
>>>
>>> [Install]
>>> WantedBy=multi-user.target
>>>
>>> #/path/to/shutdown/script.sh
>>> #copy-pasted from
>>>
>>> https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
>>> #
>>> #!/bin/sh
>>>
>>> #  uses omshell to connect to a dhcp server on the
>>> #  local machine, create a control object, set the
>>> #  state of the control object, and update the
>>> #  running server to cause that server to shut down
>>> #  gracefully.
>>> #
>>> #  per dhcpd man page, server shutdown can take
>>> #  several seconds as the server waits for close
>>> #  on all OMAPI connections.  Watching log files
>>> #  for shutdown messages is recommended.
>>>
>>> omshell << END_OF_INPUT > /dev/null 2> /dev/null
>>> server localhost
>>> port 7911
>>> key omapi_key Ofakekeyfakekeyfakekey==
>>> connect
>>> new control
>>> open
>>> set state=2
>>> update
>>> END_OF_INPUT
>>>
>>> echo "done sending shutdown instruction to dhcp server.."
>>>
>>> Matt Pallissard
>>>
>>>
>>> On 05/13/2016 09:33 AM, Terry Burton wrote:
>>>>
>>>>
>>>> On 13 May 2016 at 15:10, Steve van der Burg
>>>> <steve.vanderburg at lhsc.on.ca>
>>>> wrote:
>>>>>
>>>>>
>>>>> Here we push out new configs to a partner pair from a central server.
>>>>> The config for one of the partners contains an extra file
>>>>> (dhcpd.i.am.secondary).  Each of the partners runs this every minute
>>>>> (perl
>>>>> script):
>>>>>
>>>>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>>>>      exit if (localtime)[1] % 2 == 0;
>>>>>   }
>>>>>   else {
>>>>>      exit if (localtime)[1] % 2 == 1;
>>>>>   }
>>>>>
>>>>>   ... continue (test new config, kill running server, start new one,
>>>>> etc)
>>>>>
>>>>> So the config change, stop, start, etc, can only happen on odd minutes
>>>>> for one server and even minutes for the other.  As long as startup time
>>>>> is
>>>>> less than a minute (and it's much, much less than that) it all works
>>>>> smoothly.
>>>>
>>>>
>>>>
>>>> Thanks Steve. We've also been pushing configs around then
>>>> synchronously restarting servers back-to-back (without sleeping) for
>>>> several years without incident.
>>>>
>>>> It makes me a little suspicious about whether just killing the process
>>>> is indeed unsafe... But then maybe we've been lucky.
>>>>
>>>> As mentioned I want to improve on what distributions are currently
>>>> doing so I'm deliberately setting the bar high and it would be great
>>>> if ISC could provide a single, approved, safe shutdown/restart
>>>> mechanism or describe what is required to develop such a mechanism.
>>>> Unfortunately the detail of Bug #36066 (retracting support for gentle
>>>> shutdown) isn't available as it would be interesting to see what
>>>> issues were encountered with the previous approach.
>>>>
>>>>
>>>>> Chuck Anderson <cra at WPI.EDU> wrote:
>>>>>>
>>>>>>
>>>>>> FWIW, we've been using the "kill" method for over a decade without any
>>>>>> noticable side-effects (the default init.d scripts from RHEL 6
>>>>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>>>>> manually clean up a corrupted lease file.  We restart the services
>>>>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>>>>> immediately do the other.  We do not wait to restart the other, and we
>>>>>> do not monitor to see if failover has reconnected and rebalanced
>>>>>> before restarting the other, but since we are SSH-ing into each server
>>>>>> to do the restart, there might be enough of a built-in delay between
>>>>>> restarting each server.
>>>>>>
>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>> server that was caused by a lease file issue.
>>>>>>
>>>>>> Our script does test the config file before doing the restart:
>>>>>>
>>>>>> #!/bin/bash
>>>>>> echo -n "Testing DHCP configuration: "
>>>>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>>>>         echo "Restarting DHCP"
>>>>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>>>>> else
>>>>>>         echo "FAIL: Not restarting DHCP"
>>>>>> fi
>>>>>>
>>>>>> which in CentOS 6 does the following:
>>>>>>
>>>>>> exec=/usr/sbin/dhcpd
>>>>>> configtest() {
>>>>>>     [ -x $exec ] || return 5
>>>>>>     [ -f $config ] || return 6
>>>>>>     $exec -q -t -cf $config
>>>>>>     RETVAL=$?
>>>>>>     if [ $RETVAL -eq 1 ]; then
>>>>>>         $exec -t -cf $config
>>>>>>     else
>>>>>>         echo "Syntax: OK" >&2
>>>>>>     fi
>>>>>>     return $RETVAL
>>>>>> }
>>>>>>
>>>>>>
>>>>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm attempting to write a systemd .service file for my own uses of
>>>>>>> ISC
>>>>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>>>>> intend to push this upstream or at least into distributions.
>>>>>>>
>>>>>>> It needs to be suitable for managing failover pairs and I'm
>>>>>>> struggling
>>>>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>>>>> around there does not currently appear to be a method for restarting
>>>>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>>>>
>>>>>>>
>>>>>>> Restarting with signals:
>>>>>>>
>>>>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>>>>> option, except where there is a high turnover of leases and the
>>>>>>> production environment requires a high degree of reliability from
>>>>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>>>>> OMAPI to control the daemon instead and to request a graceful
>>>>>>> shutdown. The reason for this is that there is the slight possibility
>>>>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>>>>> appending a lease to the leases file (in which case it may become
>>>>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>>>>> administrators to prefer to use OMAPI instead."
>>>>>>>
>>>>>>> In other words this is recommending that casual users take the risk
>>>>>>> that their service might not recover after restarting. This may be
>>>>>>> unlikely but it's still dangerous advice! The documentation does
>>>>>>> indicates that a feature for "gentle shutdown" in response to a
>>>>>>> signal
>>>>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>>>>
>>>>>>> "Added support for gentle shutdown after signal is received.
>>>>>>> [ISC-Bugs
>>>>>>> #32692] [ISC-Bugs 34945]"
>>>>>>> "Disable the gentle shutdown functionality until we can determine the
>>>>>>> best way to present it to remove or reduce the side effects.
>>>>>>> [ISC-Bugs
>>>>>>> #36066]"
>>>>>>>
>>>>>>> Is it still the case that kill isn't suitable for production
>>>>>>> purposes?
>>>>>>>
>>>>>>>
>>>>>>> With OMAPI:
>>>>>>>
>>>>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>>>>> effect on the failover protocol is less-ideal than with signals.
>>>>>>>
>>>>>>> OMAPI shutdown will place the partner into "partner-down" state
>>>>>>> making
>>>>>>> it become active for all leases in the failover pools which isn't
>>>>>>> ideal when brief restarting an instance. Contrast this with the
>>>>>>> effect
>>>>>>> of restarting an instance with kill which is to briefly place the
>>>>>>> partner into "communications-interrupted" state from which it
>>>>>>> immediate revert to "normal" once the restarted instance is available
>>>>>>> (with auto-partner-down taking care for things if the instance does
>>>>>>> not recover.)
>>>>>>>
>>>>>>>
>>>>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>>>>> failover protocol?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Terry