BIND 10 #516: xfrin/zonmgr will cause system stall

BIND 10 Development do-not-reply at isc.org
Thu Jan 20 00:10:32 UTC 2011


#516: xfrin/zonmgr will cause system stall
-------------------------------------+-------------------------------------
           Reporter:  jinmei         |                      Owner:
               Type:  defect         |  UnAssigned
           Priority:  critical       |                     Status:  new
          Component:  Inter-module   |                  Milestone:  A-Team-
  communication                      |  Sprint-20110126
          Sensitive:  0              |                   Keywords:
Add Hours to Ticket:  0              |  Estimated Number of Hours:  0
        Total Hours:  0              |                  Billable?:  1
                                     |                  Internal?:  0
-------------------------------------+-------------------------------------
 I just figured out (at least one of) the root cause of b10-auth hang
 I mentioned in #513.  b10-xfrin and b10-zonemgr were the culprit.

 These programs send control commands via CC channels, but never receive
 responses.  e.g. b10-xfrin has the following code:
 {{{
     def publish_xfrin_news(self, zone_name, zone_class,  xfr_result):
 ...
             try:
                 self._send_cc_session.group_sendmsg(msg,
 XFROUT_MODULE_NAME)
                 self._send_cc_session.group_sendmsg(msg,
 ZONE_MANAGER_MODULE_NAME)
             except socket.error as err:
 ...
 }}}

 But the other end of the command always returns a (either error/success)
 response.  So the response messages will eventually fill up the socket
 receive queue.  At this point, b10-msgq will start blocking in sending
 further commands to these processes, and eventually the entire system
 will stall.

 An easy fix is to add group_recvmsg() and just ignore the result after
 each
 call to group_sendmsg().  Since this bug is quite critical I think we
 should need such a quick fix soon anyway.

 On top of that, we'd have to think about cases whether and how to handle
 the case where sendmsg() succeeds but recvmsg() doesn't.  We'll also need
 to make these command exchanges asynchronous eventually.  And, as we
 discussed in f2f, we may even have to think about adopting a different
 messaging framework.

 I'm going to add this ticket to the current A team sprint since this bug
 is quite critical and I don't want us to forget it in the backlog status.
 But I suspect it's beyond the scope of this sprint, and should actually
 consider it a fodder of the next sprint.

-- 
Ticket URL: <https://bind10.isc.org/ticket/516>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development


More information about the bind10-tickets mailing list