BIND 10 #2790: Lettuce tests timing & missed messages

Fri Mar 15 03:09:46 UTC 2013

#2790: Lettuce tests timing & missed messages
-------------------------------------+-------------------------------------
            Reporter:  vorner        |                        Owner:
                Type:  defect        |                       Status:  new
            Priority:  medium        |                    Milestone:
           Component:  Unclassified  |  Sprint-20130319
            Keywords:                |                   Resolution:
           Sensitive:  0             |                 CVSS Scoring:
         Sub-Project:  Core          |              Defect Severity:  N/A
Estimated Difficulty:  9             |  Feature Depending on Ticket:
         Total Hours:  0             |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------

Comment (by jinmei):

 I figured out one of the common failure cases, where it appears
 as if xfrin hanged.  Actually, what happened is one of the zonemgr
 threads blocked in a lock and failed to send the refresh command to
 xfrin.

 The blocking thread is running `ZonemgrRefresh`.  On receiving a
 NOTIFY, this thread is eventually invoked and tries to send the
 refresh command to xfrin:
 {{{#!python
     def _send_command(self, module_name, command_name, params):
         """Send command between modules."""
         try:
             self._mccs.rpc_call(command_name, module_name, params=params)
 }}}

 rpc_call eventually reaches `Session.sendmsg()`, where the thread
 tries to acquire a lock:
 {{{#!python
     def sendmsg(self, env, msg=None):
         with self._lock:
 }}}

 On the other hand, main thread normally blocks on the same socket
 (`Session.recvmsg` via `ModuleCCSession.check_command), while
 acquiring the lock:

 {{{#!python
     def recvmsg(self, nonblock = True, seq = None):
         with self._lock:
 ...
             data = self._receive_full_buffer(nonblock) # <= this can block
 }}}

 So, if the main thread blocks before the refresh thread completes the
 command exchange, zonemgr falls into semi-deadlock (not a complete
 deadlock, because it's unlocked on receiving a new command).  And
 that's what actually happens when the lettuce test fails this way.

 At the very least a thread shouldn't block with holding a lock.  We
 should probably also clarify which attributes the `_lock` tries to
 protect and try to narrow the critical section (if possible).

 BTW, in case it's not clear: this is a real bug that can happen on a
 deployed environment, not just a test-only regression.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2790#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development