BIND 10 #516: xfrin/zonmgr will cause system stall
BIND 10 Development
do-not-reply at isc.org
Thu Jan 20 00:10:32 UTC 2011
#516: xfrin/zonmgr will cause system stall
-------------------------------------+-------------------------------------
Reporter: jinmei | Owner:
Type: defect | UnAssigned
Priority: critical | Status: new
Component: Inter-module | Milestone: A-Team-
communication | Sprint-20110126
Sensitive: 0 | Keywords:
Add Hours to Ticket: 0 | Estimated Number of Hours: 0
Total Hours: 0 | Billable?: 1
| Internal?: 0
-------------------------------------+-------------------------------------
I just figured out (at least one of) the root cause of b10-auth hang
I mentioned in #513. b10-xfrin and b10-zonemgr were the culprit.
These programs send control commands via CC channels, but never receive
responses. e.g. b10-xfrin has the following code:
{{{
def publish_xfrin_news(self, zone_name, zone_class, xfr_result):
...
try:
self._send_cc_session.group_sendmsg(msg,
XFROUT_MODULE_NAME)
self._send_cc_session.group_sendmsg(msg,
ZONE_MANAGER_MODULE_NAME)
except socket.error as err:
...
}}}
But the other end of the command always returns a (either error/success)
response. So the response messages will eventually fill up the socket
receive queue. At this point, b10-msgq will start blocking in sending
further commands to these processes, and eventually the entire system
will stall.
An easy fix is to add group_recvmsg() and just ignore the result after
each
call to group_sendmsg(). Since this bug is quite critical I think we
should need such a quick fix soon anyway.
On top of that, we'd have to think about cases whether and how to handle
the case where sendmsg() succeeds but recvmsg() doesn't. We'll also need
to make these command exchanges asynchronous eventually. And, as we
discussed in f2f, we may even have to think about adopting a different
messaging framework.
I'm going to add this ticket to the current A team sprint since this bug
is quite critical and I don't want us to forget it in the backlog status.
But I suspect it's beyond the scope of this sprint, and should actually
consider it a fodder of the next sprint.
--
Ticket URL: <https://bind10.isc.org/ticket/516>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list