[bind10-dev] Handling Disappearing Terminals

Fri Mar 18 13:30:01 UTC 2011

All,

We need to think about what happens to the server when the terminal it
is running in disappears.

History
-------
(Skip if you are impatient for the good stuff.)

At the end of last month, Jeremy sent a mail about his problems setting
up a forwarding resolver:

https://lists.isc.org/pipermail/bind10-dev/2011-February/002038.html

He reported this:

        I know why my bind10 was killed; it doesn't daemonize so when I
        closed terminal it was running in, it was killed -- but
        sometimes children didn't get killed. HUP or whatever signal was
        not trapped or passed to children?

This led me to make a ticket so that we handle SIGHUP and other signals
that might kill the boss process:

http://bind10.isc.org/ticket/642

However, Michal noted that this didn't seem to do anything at all when
he started a process in the background and the terminal was closed. So I
had a look and discovered that the behavior for processes varies quite a
bit depending on the exact details of how the controlling terminal goes
away.

Details of Terminal Closing
---------------------------
I looked at what happens to a process under 3 ways of being started:

1. Running the program
2. Using "su" and then running the program 
3. Using "sudo" to run the program

My theory was that there may be slightly different things done, and it
turns out that is true.

I tried 3 types of test:

A. Start program and close the terminal window
B. Start program in the background (with & at the shell) then logout
C. Start program in the background then close the terminal window

I wrote small Python programs to use for this test, to concentrate on
figuring out the behavior.

My 1st program intercepted all signals possible, and then just waited
around for a KILL signal. :)

My 2nd program intercepted all signals possible, and then wrote a stream
to STDOUT in a loop.

My 3rd program intercepted all signals possible, and then used select()
to see if anything was available for reading, and tried to read if it
was.

Results:

--[ 1: idle ]----------------------------------------------------------
              Start/Close     Background/Logout    Background/Close
normal          SIGHUP            nothing               SIGHUP
su              nothing           nothing               nothing
sudo          SIGHUP (3x)         nothing               SIGHUP

--[ 2: writing ]-------------------------------------------------------
              Start/Close      Background/Logout   Background/Close
normal   SIGHUP, SIGTSTP, err         err             SIGHUP, err
su       err, SIGHUP, SIGTSTP         err                 err
sudo     SIGHUP, SIGTSTP, err         err             SIGHUP, err

--[ 3: reading ]-------------------------------------------------------
              Start/Close      Background/Logout   Background/Close
normal        SIGHUP, EOF             EOF             SIGHUP, EOF
su                EOF             SIGTTIN, EOF       SIGTTIN, EOF
sudo          SIGHUP, err      SIGTTIN, SIGTSTP,      SIGHUP, EOF
                             SIGTERM, SIGTSTP, EOF

If more than one thing happened, they are listed in the order they
occurred.

Key:
  SIGXXX is a signal arriving
  err is an I/O error (either writing or reading)
  EOF means a read return 0 bytes, indicating EOF

Michal's Observation
--------------------
I think we can understand Michal's results:

      * When the terminal window closed, the boss got no signal at all. 
      * Then when one of the child processes tried to output some
        message, it got a write error.
      * When the boss caught the dying child, it tried to output a
        message explaining this and *also* got a write error.
      * Over time, more and more children got write errors and died.

Analysis
--------
The boss process can adapt itself to handle the terminal going away,
because based on the research above, we can detect this and change
outputs so that they go to /dev/null (or better yet so they call empty
functions).

The problem becomes what we do with child processes. If we want them to
write to the console, then they will get some sort of error too.

      * We could let the children die, and restart them, but this is...
        inelegant.
      * We could perhaps have the boss act as a proxy and use pipes to
        read the output.
      * We could do the same thing, but with pseudo-ttys. Python even
        has a module for this:
        http://docs.python.org/py3k/library/pty.html
      * We could shut down.

I realize some people want us to 'properly' daemonize. This would make
the problem go away, but we'll have to change all of the processes to
live in such an environment, and we'll *still* have to deal with these
issues when the program is run in the equivalent of '-f' or '-g' from
BIND 9 (run in foreground).

Please let me know what you think.

--
Shane