Restarting DHCP safely whilst avoiding partner-down state

Terry Burton tez at terryburton.co.uk
Fri May 13 19:30:58 UTC 2016


On 13 May 2016 at 20:06, Terry Burton <tez at terryburton.co.uk> wrote:
> On 13 May 2016 at 19:25, dave c <dhcp at gvtc.drakkar.org> wrote:
>> Are folks forgetting that the default action of the kill command is to send
>> the TERM signal? That signal should tell the daemon to do an orderly
>> shutdown, close the leases file cleanly, send whatever signals to the
>> partner that are required and then exit when everything is ready.
>>
>> All the concern I am seeing below would be true if folks were issuing a kill
>> -9 to stop the service. At which point the leases file would get potentially
>> corrupted.
> <...snip...>
>> So it sounds like a lot of angst over nothing... a TERM signal is defined as
>> closing all processes and threads cleanly, writing out the last bits of data
>> and stopping things in an orderly fashion. So seems that issuing kill {dhcpd
>> pid} would be perfectly acceptable to close things down even in a partner
>> scenario.
>
> Where do you get the definition of a SIGTERM causing a graceful
> shutdown (other than by convention) and if this were the case for ISC
> DHCP then why the warning about truncated leases given in AA-01043?
>
> The effect of receiving a handleable signal is to immediately jump
> into the trap handler if one is configured for that signal, otherwise
> to die.
>
> Unless a handler takes care to ensure that everything is consistent
> and then exit then SIGTERM, SIGINT, etc. are potentially dangerous.
>
> The release notes indicate that a "gentle shutdown" feature was added
> in the past and then subsequently removed because the semantics chosen
> caused operational issues - but what these were isn't known because
> the associated bug report isn't publicly available.
>
> I need to find time to understand the current codebase, but what I'd
> like to know the intended semantics and what issues are encountered
> with implementing these in the way that Simon Hobson suggests.

So currently there are no trap handlers for SIGTERM or SIGINT and
therefore no cleanup whatsoever at exit.

There is a compiled-out option ENABLE_GENTLE_SHUTDOWN which installs
handlers for these signals but when this was activated it implemented
the harmful semantics of putting the server through a
recovery+partner-down transition which isn't useful for a quick
configuration reload:

/* Enable the gentle shutdown signal handling.  Currently this
   means that on SIGINT or SIGTERM a client will release its
   address and a server in a failover pair will go through
   partner down.  Both of which can be undesireable in some
   situations.  We plan to revisit this feature and may
   make non-backwards compatible changes including the
   removal of this define.  Use at your own risk.  */
/* #define ENABLE_GENTLE_SHUTDOWN */

#if defined(ENABLE_GENTLE_SHUTDOWN)
        /* no signal handlers until we deal with the side effects */
        /* install signal handlers */
        signal(SIGINT, dhcp_signal_handler);   /* control-c */
        signal(SIGTERM, dhcp_signal_handler);  /* kill */
#endif

Having a more basic signal handler that defers the exit in order to
continue to write out an outstanding lease seems better. Perhaps once
could even differentiate these exit semantics based on SIGINT vs
SIGTERM.

If someone who can speak for ISC is able to indicate whether this
would be a sensible approach then I am happy to work up a patch.


>> On 5/13/16 13:02, Chuck Anderson wrote:
>>>
>>> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>>>>
>>>> On 13 May 2016 at 15:57, Chuck Anderson <cra at wpi.edu> wrote:
>>>>>
>>>>> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>>>>>>
>>>>>> On 13 May 2016 at 14:22, Chuck Anderson <cra at wpi.edu> wrote:
>>>>>>>
>>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>>> server that was caused by a lease file issue.
>>>>>>
>>>>>>
>>>>>> In our experience leases files corrupted by other means can cause a
>>>>>> failure to start. I don't recall whether that was due to mere
>>>>>> truncation though...
>>>>>
>>>>>
>>>>> There is also the -T parameter to test the lease file:
>>>>>
>>>>>        The -T flag can be used to test the lease database file in a
>>>>> similar way.
>>>>>
>>>>> It might be a good idea to also use this test before restarting.
>>>>> While it won't fix a corrupted lease file, it may prevent you from
>>>>> losing all DHCP service due to a failure to restart.
>>>>
>>>>
>>>> I think this will require the leases file to be closed at the point of
>>>> testing, i.e. the daemon has already exited.
>>>>
>>>> For the more general issue with systemd verifying the configuration
>>>> see:
>>>> https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>>>
>>>
>>> Is there a way to signal dhcpd to write out the lease file so it can
>>> be checked?
>>>
>>> It seems that dhcpd needs a journaling mechanism similar to named,
>>> where it writes the changes to a .jnl file and periodically
>>> incorporates those changes into the main zone file.


More information about the dhcp-users mailing list