BIND 10 #530: [kean] msgq error message for unable to create socket

Tue Nov 19 10:23:36 UTC 2013

#530: [kean] msgq error message for unable to create socket
-------------------------------------+-------------------------------------
            Reporter:  jreed         |                        Owner:  muks
                Type:  defect        |                       Status:
            Priority:  medium        |  closed
           Component:  msgq          |                    Milestone:
            Keywords:                |  Sprint-20131015
           Sensitive:  0             |                   Resolution:  fixed
         Sub-Project:  DNS           |                 CVSS Scoring:
Estimated Difficulty:  0.0           |              Defect Severity:  N/A
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------

Comment (by muks):

 Replying to [comment:11 kean]:
 > If you won't look at the code my responses to you will be meaningless.
 It **explains** why it sends SIGKILL. I will quote the code here so that
 hopefully this will be clear to you as it addresses all your questions
 except 1 (which component is being killed - the answer is - any - it
 doesn't matter they will all be killed the same way but in this particular
 case it was msgq).

 You have pointed at this code several times now. I'm aware of what this
 code does having written it. Let me put it in another way. The questions
 are not about this code at all. What was asked is:

 * Why does the component have to be forcefully terminated? This is not
 normal behavior. Components need to gracefully shutdown themselves.
 Sending TERM or KILL is a worst-case scenario. That code should not even
 exist if components shutdown properly. But because this is a `b10-msgq`
 startup failure,

 * If the component failed startup, why is its pid being passed to
 `os.kill()` ? It'll always not find the pid.

 {{{
     os.kill(self.pid, sig)
 OSError: [Errno 3] No such process
 }}}

 > The root of the problem is that there is no pause between sending the
 SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will
 take longer than the amount of time it takes for the code to drop into
 sending SIGKILL. Therefore, as stated above in comment 8 the thing to do
 is to put in some kind of pause, possibly even the Python equivalent of a
 wait() in that loop.

 The topic of such a pause has been discussed before. Check the lists and
 earlier bug reports. I don't remember what the exact discussion was, but
 what we do is based on that.

 > > * If `b10-msgq` startup fails, it's fatal. How should `b10-init`
 handle it? Getting such a traceback would be bad. It would need to be
 logged appropriately and BIND 10 should stop gracefully. This problem will
 recur if we leave it as-is.
 > As was stated about there **is no more traceback**. I quoted the output
 showing what the failure case looks like. Your statement about the problem
 recurring isn't correct. In the 3 years since this bug was filed changes
 have been made to catch the exceptions and log them cleanly rather than
 generating tracebacks due to unhandled exceptions.

 I looked at the entire startup codepath in `init.py.in`, and it's the
 `try/except` in `__kill_children()` that catches this `OSError` due to
 #1858. This is the explanation I was asking for, for the `OSError`
 traceback in the ticket description.

-- 
Ticket URL: <http://bind10.isc.org/ticket/530#comment:12>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development