[bind10-dev] shutdown problem

Fri Oct 15 09:41:23 UTC 2010

Jinmei,

On Thu, 2010-10-07 at 23:40 +0900, JINMEI Tatuya / 神明達哉 wrote:
> At Wed, 06 Oct 2010 16:04:13 +0200,
> Shane Kerr <shane at isc.org> wrote:
> 
> > So the main problem is Python programs. We could make a simple module
> > used like:
> > 
> > import bind10shutdown
> > 
> > This could then be something like:
> [snip]
> > I think if we used something like this in our various Python programs we
> > would be okay.
> > 
> > Opinions?
> 
> It would depend on how much of our signal handling code is a duplicate
> and could be unified.  I don't have a comprehensive view on this point
> yet (so I cannot say it makes sense or not).

Two points before I go into detail:

     1. We should probably pick a policy for fatal signal handling
        across BIND 10 applications.
     2. Our current Python implementations often us an unsafe approach.

Policy For Fatal Signal Handling
--------------------------------

Thinking about this more, I think the best general approach may be to
adopt a methodology where we let don't try to take any evasive action
when we get a fatal signal.

We need our software to survive when it crashes, whether this is a
hardware crash, kernel panic, "kill -9", or a simple shutdown. This
means that it always needs to be able to restart when it comes back up.
Since we have this need, we should be able to simply kill any component
at any time and have things return to their normal state as quickly as
possible.

There will be exceptions to this, when you can perform some cleanup that
is helpful, although not completely necessary. For example, a DHCP
client may want to release a lease. Not strictly necessary, but nice. Or
a DNS cache using Unix shared memory may want to destroy the segment.

A current example of something that should do cleanup is the boss, which
needs to take the rest of the system down when it dies. (In the original
requirements for the boss, I wrote that it should check the system
status when it restarts, in case someone issued a "kill -9" or it failed
for some reason. However, this was removed for simplicity - but this
requirement could be added back at any time if we thought it was
important.)

Because of this cleanup, it might be useful to have most processes
ignore SIGINT. This is to avoid the case where the administrator hits
Ctrl-C and:

      * a BIND 10 child process gets the SIGINT
      * the child process dies
      * the boss gets SIGCHLD
      * the boss restarts the child process
      * the boss gets SIGINT
      * the boss kill the child process

So, I'd like to propose a general BIND 10 policy that most processes
should only catch SIGINT, and that with SIG_IGN, unless there is a
specific reason for handling other signals.

Signal Handling Anti-Pattern
----------------------------
I looked a bit more closely, and I think rather than something like
this, we need to fix a pattern we are using in our Python code. Or
rather, an anti-pattern. :)

So, for code that *does* need to handle signals, we should make sure we
do it right.

What we now have is something like this:

-----------------------------------------------------------------------
def signal_handler(signal, frame):
    if zonemgrd:
        zonemgrd.shutdown()
        sys.exit(0)

def set_signal_handler():
    signal.signal(signal.SIGTERM, signal_handler)
    signal.signal(signal.SIGINT, signal_handler)
-----------------------------------------------------------------------

The problem here is the same one that you have in a C/C++ daemon, which
is that signals are asynchronous in the extreme sense. The typical C
solution is to only set a variable in the signal handler. The cfgmgr
program does something like that:

-----------------------------------------------------------------------
cm = None

def signal_handler(signal, frame):
    global cm
    if cm:
        cm.running = False

def main():
    global cm
    try:
        cm = ConfigManager(DATA_PATH)
        signal.signal(signal.SIGINT, signal_handler)
        signal.signal(signal.SIGTERM, signal_handler)
        cm.read_config()
        cm.notify_boss()
        cm.run()
-----------------------------------------------------------------------

This is a good approach, and one we should use in most cases.

There is the slight drawback that if the ConfigManager is waiting on I/O
it won't be notified of the signal until it gets something and checks
it.

If we need more timely alerts of fatal signals, you can use the
technique from the boss process, which uses the Python signal's concept
of a "wakeup file descriptor":

-----------------------------------------------------------------------
    # Create wakeup pipe for signal handlers
    wakeup_pipe = os.pipe()
    signal.set_wakeup_fd(wakeup_pipe[1])

    # Set signal handlers for catching child termination, as well
    # as our own demise.
    signal.signal(signal.SIGCHLD, reaper)
    signal.siginterrupt(signal.SIGCHLD, False)
    signal.signal(signal.SIGINT, fatal_signal)
    signal.signal(signal.SIGTERM, fatal_signal)

    # Block SIGPIPE, as we don't want it to end this process
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)

      .
      .
      .

    wakeup_fd = wakeup_pipe[0]
    ccs_fd = boss_of_bind.ccs.get_socket().fileno()

      .
      .
      .

        # select() can raise EINTR when a signal arrives, 
        # even if they are resumable, so we have to catch
        # the exception
        try:
            (rlist, wlist, xlist) = select.select([wakeup_fd, ccs_fd], [], [],
                                                  wait_time)
        except select.error as err:
            if err.args[0] == errno.EINTR:
                (rlist, wlist, xlist) = ([], [], [])
            else:
                sys.stderr.write("[bind10] Error with select(); %s\n" % err)
                break
-----------------------------------------------------------------------

In this case the signal handlers run, set a flag, and then continue on.
If the process is waiting on I/O, it gets woken up, and starts to
shutdown right away.

--
Shane