BIND 10 #530: [kean] msgq error message for unable to create socket

Tue Nov 19 09:32:44 UTC 2013

#530: [kean] msgq error message for unable to create socket
-------------------------------------+-------------------------------------
            Reporter:  jreed         |                        Owner:  muks
                Type:  defect        |                       Status:
            Priority:  medium        |  closed
           Component:  msgq          |                    Milestone:
            Keywords:                |  Sprint-20131015
           Sensitive:  0             |                   Resolution:  fixed
         Sub-Project:  DNS           |                 CVSS Scoring:
Estimated Difficulty:  0.0           |              Defect Severity:  N/A
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
Changes (by kean):

 * status:  reopened => closed
 * resolution:   => fixed

Comment:

 Replying to [comment:10 muks]:
 > Replying to [comment:8 kean]:
 > > if you look at the code I pointed you to you will see how the decision
 is made.
 >
 > I think I was not clear in my last comments. I asked about **why**
 SIGKILL has to be sent, i.e., why it has to be shutdown forcefully like
 that. It is not about the code that sends the SIGKILL signal. In the
 normal way of things, components should gracefully shutdown.

 If you won't look at the code my responses to you will be meaningless. It
 **explains** why it sends SIGKILL. I will quote the code here so that
 hopefully this will be clear to you as it addresses all your questions
 except 1 (which component is being killed - the answer is - any - it
 doesn't matter they will all be killed the same way but in this particular
 case it was msgq).

 {{{
         # Send TERM and KILL signals to modules if we're not prevented
         # from doing so
         if not self.nokill:
             # next try sending a SIGTERM
             self.__kill_children(False)
             # finally, send SIGKILL (unmaskable termination) until
 everybody
             # dies
             while self.components:
                 # XXX: some delay probably useful... how much is uncertain
                 time.sleep(0.1)
                 self.reap_children()
                 self.__kill_children(True)
             logger.info(BIND10_SHUTDOWN_COMPLETE)
 }}}

 So what happens first is the call to {{{__kill_children(False)}}}. This
 sends the SIGTERM signal to each component to kill it. If any of those
 processes has a SIGTERM handler and doesn't exit *immediately* - and by
 immediately I mean in the time it takes this loop to execute, it drops
 into the {{{while}}} loop and calls {{{__kill_children(True)}}} which
 sends a SIGKILL.

 The root of the problem is that there is no pause between sending the
 SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will
 take longer than the amount of time it takes for the code to drop into
 sending SIGKILL. Therefore, as stated above in comment 8 the thing to do
 is to put in some kind of pause, possibly even the Python equivalent of a
 wait() in that loop.

 > What happens that causes the ticket description's log?
 There was a permissions problem on the socket file or directory and the
 system was unable to start up.

 > * What component is being shutdown? Is it `b10-msgq`?
 Yes.

 > * Why is the component being shutdown at this point by `b10-init` when
 startup itself has failed and there is no process to kill?
 Because it is msgq itself that was being started up, one of the very first
 things init attempts. So until that point startup **hasn't** failed yet -
 it is the failure to start msgq that **is** the failure.

 > * If `b10-msgq` startup fails, it's fatal. How should `b10-init` handle
 it? Getting such a traceback would be bad. It would need to be logged
 appropriately and BIND 10 should stop gracefully. This problem will recur
 if we leave it as-is.
 As was stated about there **is no more traceback**. I quoted the output
 showing what the failure case looks like. Your statement about the problem
 recurring isn't correct. In the 3 years since this bug was filed changes
 have been made to catch the exceptions and log them cleanly rather than
 generating tracebacks due to unhandled exceptions.

 It is **very important to note** that the above diagnosis pertains only to
 the old code. As the code currently stands in master, none of the above is
 relevant at all because the correct exception handling is done. The entire
 line of questions is moot. The only error that needed to be fixed was not
 squelching the stdout output from b10-msgq so that startup errors are
 visible, and doing a better job of displaying the socket name during
 exception handling. However the above issue of not waiting for SIGTERM to
 be able to do its job is still relevant.

-- 
Ticket URL: <http://bind10.isc.org/ticket/530#comment:11>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development