BIND 10 #1365: Restarts in #213

BIND 10 Development do-not-reply at isc.org
Wed Nov 2 23:59:13 UTC 2011


#1365: Restarts in #213
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  vorner                             |                Status:  reviewing
                       Type:  task   |             Milestone:
                   Priority:  major  |  Sprint-20111108
                  Component:  Boss   |            Resolution:
  of BIND                            |             Sensitive:  0
                   Keywords:         |           Sub-Project:  Core
            Defect Severity:  N/A    |  Estimated Difficulty:  0
Feature Depending on Ticket:         |           Total Hours:  0
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:1 vorner]:
 > Hello
 >
 > I looked at the system tests. It looks like the socket creator (together
 with some more processes) refused to stop in a peaceful way, so boss
 started to send SIGTERMs and SIGKILLs. However, socket creator reset it's
 internal process object for some reason, so it couldn't send it, therefore
 not being able to stop. The patch should solve the problem in the boss
 side, but I'm not clear on why the processes didn't want to stop in the
 first place.

 I chased it further, and found the real cause.  If we reset the creator,
 we implicitly delete the Popen object.  It internally waits() for the
 child process in `__del__()`, which results in incosistent state
 between the boss (who believes the process is still alive) and the system
 (where the process doesn't exist).

 This is tricky, so I added some more detailed documentation about this
 point (pushed).

 Other general comments about this subbranch:

 - maybe rename BoB.processes to BoB.components?
 - Likewise, I'd rename the 'info' parameter of register_process() to
   'component' and document that it's expected to be an object of
 (sub)class
   of BaseComponent.

 - restart_processes() is now never-called but kept for reference,
   right?  I'd add a comment or doc string about that not to confuse
   people who happen to see this code.

 - some part of restart_processes() are now buggy.  e.g. this is
   obviously wrong:
 {{{
                     self.processes[proc_info.pid] = proc_info
 }}}
   but I'm okay with leaving it if we address the previous bullet point.

 - It seems we can now remove BoB.sockcreator.
 - Some log messages should be cleaned up, e.g. BIND10_SOCKCREATOR_CRASHED.

 - test_show_processes_started: I'm not sure if this replacement is
   okay.  Actually I'm not sure about the original intent of the test,
   but from the test name it seems to try to test whether the intended
   set of processes (components) have started up.  If so, the specific
   list of components had a meaning.  (We may eventually have to
   replace them from a list of processes based on the default config,
   but right now they are hardcoded).

 The following are suggestions that I found useful when I chased the
 systest failure further:

 - suggestion: adding this to the end of SockCreator._start_internal()
         self._boss.log_started(self.pid())

 - suggestion: in reap_children(), log the event of final child
   shutdown:
 {{{
             if pid in self.processes:
                 # One of the processes we know about.  Get information on
 it.
                 component = self.processes.pop(pid)
                 if component.running() and self.runnable:
                     # Tell it it failed. But only if it matters (we are
                     # not shutting down and the component considers itself
                     # to be running.
                     component.failed(exit_status);

                 # Additional log here:
                 logger.info(something_like_indicating_shutdown,
                             component.name(), component.pid(),
                             exit_status)
 }}}

-- 
Ticket URL: <http://bind10.isc.org/ticket/1365#comment:3>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development


More information about the bind10-tickets mailing list