Restarting DHCP safely whilst avoiding partner-down state

Terry Burton tez at terryburton.co.uk
Fri May 13 19:06:51 UTC 2016


On 13 May 2016 at 19:25, dave c <dhcp at gvtc.drakkar.org> wrote:
> Are folks forgetting that the default action of the kill command is to send
> the TERM signal? That signal should tell the daemon to do an orderly
> shutdown, close the leases file cleanly, send whatever signals to the
> partner that are required and then exit when everything is ready.
>
> All the concern I am seeing below would be true if folks were issuing a kill
> -9 to stop the service. At which point the leases file would get potentially
> corrupted.
<...snip...>
> So it sounds like a lot of angst over nothing... a TERM signal is defined as
> closing all processes and threads cleanly, writing out the last bits of data
> and stopping things in an orderly fashion. So seems that issuing kill {dhcpd
> pid} would be perfectly acceptable to close things down even in a partner
> scenario.

Where do you get the definition of a SIGTERM causing a graceful
shutdown (other than by convention) and if this were the case for ISC
DHCP then why the warning about truncated leases given in AA-01043?

The effect of receiving a handleable signal is to immediately jump
into the trap handler if one is configured for that signal, otherwise
to die.

Unless a handler takes care to ensure that everything is consistent
and then exit then SIGTERM, SIGINT, etc. are potentially dangerous.

The release notes indicate that a "gentle shutdown" feature was added
in the past and then subsequently removed because the semantics chosen
caused operational issues - but what these were isn't known because
the associated bug report isn't publicly available.

I need to find time to understand the current codebase, but what I'd
like to know the intended semantics and what issues are encountered
with implementing these in the way that Simon Hobson suggests.

> What I don't yet have a clear handle on is the timing considerations of a
> partner system being manipulated by external command and control processes
> e.g. adding a new vlan definition to both servers and restarting them at the
> same time or within seconds of each other.
>
> Do I need to incorporate a delay as was done by one of the earlier posters
> on this thread or is that precaution an unneeded complication? What happens
> when both partners are restarted at the same time? Does it delay the startup
> and cause DHCP responses to be ignored until they work things out among
> themselves?
>
> I am seeing reports in this thread from both extremes... one who forces a
> delay with even/odd minute detection and another who seems to not care how
> closely in time the two restart.
>
> That's the question I believe we should be caring about here...

That's *your question* (perhaps an interesting one, no offence
intended) but it is not the one I'm asking here so feel free to open a
new thread.


Thanks,

Terry


> On 5/13/16 13:02, Chuck Anderson wrote:
>>
>> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>>>
>>> On 13 May 2016 at 15:57, Chuck Anderson <cra at wpi.edu> wrote:
>>>>
>>>> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>>>>>
>>>>> On 13 May 2016 at 14:22, Chuck Anderson <cra at wpi.edu> wrote:
>>>>>>
>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>> server that was caused by a lease file issue.
>>>>>
>>>>>
>>>>> In our experience leases files corrupted by other means can cause a
>>>>> failure to start. I don't recall whether that was due to mere
>>>>> truncation though...
>>>>
>>>>
>>>> There is also the -T parameter to test the lease file:
>>>>
>>>>        The -T flag can be used to test the lease database file in a
>>>> similar way.
>>>>
>>>> It might be a good idea to also use this test before restarting.
>>>> While it won't fix a corrupted lease file, it may prevent you from
>>>> losing all DHCP service due to a failure to restart.
>>>
>>>
>>> I think this will require the leases file to be closed at the point of
>>> testing, i.e. the daemon has already exited.
>>>
>>> For the more general issue with systemd verifying the configuration
>>> see:
>>> https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>>
>>
>> Is there a way to signal dhcpd to write out the lease file so it can
>> be checked?
>>
>> It seems that dhcpd needs a journaling mechanism similar to named,
>> where it writes the changes to a .jnl file and periodically
>> incorporates those changes into the main zone file.


More information about the dhcp-users mailing list