BIND 10 #530: [kean] msgq error message for unable to create socket
BIND 10 Development
do-not-reply at isc.org
Tue Nov 19 10:23:36 UTC 2013
#530: [kean] msgq error message for unable to create socket
-------------------------------------+-------------------------------------
Reporter: jreed | Owner: muks
Type: defect | Status:
Priority: medium | closed
Component: msgq | Milestone:
Keywords: | Sprint-20131015
Sensitive: 0 | Resolution: fixed
Sub-Project: DNS | CVSS Scoring:
Estimated Difficulty: 0.0 | Defect Severity: N/A
Total Hours: 0 | Feature Depending on Ticket:
| Add Hours to Ticket: 0
| Internal?: 0
-------------------------------------+-------------------------------------
Comment (by muks):
Replying to [comment:11 kean]:
> If you won't look at the code my responses to you will be meaningless.
It **explains** why it sends SIGKILL. I will quote the code here so that
hopefully this will be clear to you as it addresses all your questions
except 1 (which component is being killed - the answer is - any - it
doesn't matter they will all be killed the same way but in this particular
case it was msgq).
You have pointed at this code several times now. I'm aware of what this
code does having written it. Let me put it in another way. The questions
are not about this code at all. What was asked is:
* Why does the component have to be forcefully terminated? This is not
normal behavior. Components need to gracefully shutdown themselves.
Sending TERM or KILL is a worst-case scenario. That code should not even
exist if components shutdown properly. But because this is a `b10-msgq`
startup failure,
* If the component failed startup, why is its pid being passed to
`os.kill()` ? It'll always not find the pid.
{{{
os.kill(self.pid, sig)
OSError: [Errno 3] No such process
}}}
> The root of the problem is that there is no pause between sending the
SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will
take longer than the amount of time it takes for the code to drop into
sending SIGKILL. Therefore, as stated above in comment 8 the thing to do
is to put in some kind of pause, possibly even the Python equivalent of a
wait() in that loop.
The topic of such a pause has been discussed before. Check the lists and
earlier bug reports. I don't remember what the exact discussion was, but
what we do is based on that.
> > * If `b10-msgq` startup fails, it's fatal. How should `b10-init`
handle it? Getting such a traceback would be bad. It would need to be
logged appropriately and BIND 10 should stop gracefully. This problem will
recur if we leave it as-is.
> As was stated about there **is no more traceback**. I quoted the output
showing what the failure case looks like. Your statement about the problem
recurring isn't correct. In the 3 years since this bug was filed changes
have been made to catch the exceptions and log them cleanly rather than
generating tracebacks due to unhandled exceptions.
I looked at the entire startup codepath in `init.py.in`, and it's the
`try/except` in `__kill_children()` that catches this `OSError` due to
#1858. This is the explanation I was asking for, for the `OSError`
traceback in the ticket description.
--
Ticket URL: <http://bind10.isc.org/ticket/530#comment:12>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list