BIND 10 #530: [kean] msgq error message for unable to create socket
BIND 10 Development
do-not-reply at isc.org
Tue Nov 19 09:32:44 UTC 2013
#530: [kean] msgq error message for unable to create socket
-------------------------------------+-------------------------------------
Reporter: jreed | Owner: muks
Type: defect | Status:
Priority: medium | closed
Component: msgq | Milestone:
Keywords: | Sprint-20131015
Sensitive: 0 | Resolution: fixed
Sub-Project: DNS | CVSS Scoring:
Estimated Difficulty: 0.0 | Defect Severity: N/A
Total Hours: 0 | Feature Depending on Ticket:
| Add Hours to Ticket: 0
| Internal?: 0
-------------------------------------+-------------------------------------
Changes (by kean):
* status: reopened => closed
* resolution: => fixed
Comment:
Replying to [comment:10 muks]:
> Replying to [comment:8 kean]:
> > if you look at the code I pointed you to you will see how the decision
is made.
>
> I think I was not clear in my last comments. I asked about **why**
SIGKILL has to be sent, i.e., why it has to be shutdown forcefully like
that. It is not about the code that sends the SIGKILL signal. In the
normal way of things, components should gracefully shutdown.
If you won't look at the code my responses to you will be meaningless. It
**explains** why it sends SIGKILL. I will quote the code here so that
hopefully this will be clear to you as it addresses all your questions
except 1 (which component is being killed - the answer is - any - it
doesn't matter they will all be killed the same way but in this particular
case it was msgq).
{{{
# Send TERM and KILL signals to modules if we're not prevented
# from doing so
if not self.nokill:
# next try sending a SIGTERM
self.__kill_children(False)
# finally, send SIGKILL (unmaskable termination) until
everybody
# dies
while self.components:
# XXX: some delay probably useful... how much is uncertain
time.sleep(0.1)
self.reap_children()
self.__kill_children(True)
logger.info(BIND10_SHUTDOWN_COMPLETE)
}}}
So what happens first is the call to {{{__kill_children(False)}}}. This
sends the SIGTERM signal to each component to kill it. If any of those
processes has a SIGTERM handler and doesn't exit *immediately* - and by
immediately I mean in the time it takes this loop to execute, it drops
into the {{{while}}} loop and calls {{{__kill_children(True)}}} which
sends a SIGKILL.
The root of the problem is that there is no pause between sending the
SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will
take longer than the amount of time it takes for the code to drop into
sending SIGKILL. Therefore, as stated above in comment 8 the thing to do
is to put in some kind of pause, possibly even the Python equivalent of a
wait() in that loop.
> What happens that causes the ticket description's log?
There was a permissions problem on the socket file or directory and the
system was unable to start up.
> * What component is being shutdown? Is it `b10-msgq`?
Yes.
> * Why is the component being shutdown at this point by `b10-init` when
startup itself has failed and there is no process to kill?
Because it is msgq itself that was being started up, one of the very first
things init attempts. So until that point startup **hasn't** failed yet -
it is the failure to start msgq that **is** the failure.
> * If `b10-msgq` startup fails, it's fatal. How should `b10-init` handle
it? Getting such a traceback would be bad. It would need to be logged
appropriately and BIND 10 should stop gracefully. This problem will recur
if we leave it as-is.
As was stated about there **is no more traceback**. I quoted the output
showing what the failure case looks like. Your statement about the problem
recurring isn't correct. In the 3 years since this bug was filed changes
have been made to catch the exceptions and log them cleanly rather than
generating tracebacks due to unhandled exceptions.
It is **very important to note** that the above diagnosis pertains only to
the old code. As the code currently stands in master, none of the above is
relevant at all because the correct exception handling is done. The entire
line of questions is moot. The only error that needed to be fixed was not
squelching the stdout output from b10-msgq so that startup errors are
visible, and doing a better job of displaying the socket name during
exception handling. However the above issue of not waiting for SIGTERM to
be able to do its job is still relevant.
--
Ticket URL: <http://bind10.isc.org/ticket/530#comment:11>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list