BIND 10 #1365: Restarts in #213
BIND 10 Development
do-not-reply at isc.org
Wed Nov 2 23:59:13 UTC 2011
#1365: Restarts in #213
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
vorner | Status: reviewing
Type: task | Milestone:
Priority: major | Sprint-20111108
Component: Boss | Resolution:
of BIND | Sensitive: 0
Keywords: | Sub-Project: Core
Defect Severity: N/A | Estimated Difficulty: 0
Feature Depending on Ticket: | Total Hours: 0
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:1 vorner]:
> Hello
>
> I looked at the system tests. It looks like the socket creator (together
with some more processes) refused to stop in a peaceful way, so boss
started to send SIGTERMs and SIGKILLs. However, socket creator reset it's
internal process object for some reason, so it couldn't send it, therefore
not being able to stop. The patch should solve the problem in the boss
side, but I'm not clear on why the processes didn't want to stop in the
first place.
I chased it further, and found the real cause. If we reset the creator,
we implicitly delete the Popen object. It internally waits() for the
child process in `__del__()`, which results in incosistent state
between the boss (who believes the process is still alive) and the system
(where the process doesn't exist).
This is tricky, so I added some more detailed documentation about this
point (pushed).
Other general comments about this subbranch:
- maybe rename BoB.processes to BoB.components?
- Likewise, I'd rename the 'info' parameter of register_process() to
'component' and document that it's expected to be an object of
(sub)class
of BaseComponent.
- restart_processes() is now never-called but kept for reference,
right? I'd add a comment or doc string about that not to confuse
people who happen to see this code.
- some part of restart_processes() are now buggy. e.g. this is
obviously wrong:
{{{
self.processes[proc_info.pid] = proc_info
}}}
but I'm okay with leaving it if we address the previous bullet point.
- It seems we can now remove BoB.sockcreator.
- Some log messages should be cleaned up, e.g. BIND10_SOCKCREATOR_CRASHED.
- test_show_processes_started: I'm not sure if this replacement is
okay. Actually I'm not sure about the original intent of the test,
but from the test name it seems to try to test whether the intended
set of processes (components) have started up. If so, the specific
list of components had a meaning. (We may eventually have to
replace them from a list of processes based on the default config,
but right now they are hardcoded).
The following are suggestions that I found useful when I chased the
systest failure further:
- suggestion: adding this to the end of SockCreator._start_internal()
self._boss.log_started(self.pid())
- suggestion: in reap_children(), log the event of final child
shutdown:
{{{
if pid in self.processes:
# One of the processes we know about. Get information on
it.
component = self.processes.pop(pid)
if component.running() and self.runnable:
# Tell it it failed. But only if it matters (we are
# not shutting down and the component considers itself
# to be running.
component.failed(exit_status);
# Additional log here:
logger.info(something_like_indicating_shutdown,
component.name(), component.pid(),
exit_status)
}}}
--
Ticket URL: <http://bind10.isc.org/ticket/1365#comment:3>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list