BIND 10 #2790: Lettuce tests timing & missed messages
BIND 10 Development
do-not-reply at isc.org
Fri Mar 15 03:09:46 UTC 2013
#2790: Lettuce tests timing & missed messages
-------------------------------------+-------------------------------------
Reporter: vorner | Owner:
Type: defect | Status: new
Priority: medium | Milestone:
Component: Unclassified | Sprint-20130319
Keywords: | Resolution:
Sensitive: 0 | CVSS Scoring:
Sub-Project: Core | Defect Severity: N/A
Estimated Difficulty: 9 | Feature Depending on Ticket:
Total Hours: 0 | Add Hours to Ticket: 0
| Internal?: 0
-------------------------------------+-------------------------------------
Comment (by jinmei):
I figured out one of the common failure cases, where it appears
as if xfrin hanged. Actually, what happened is one of the zonemgr
threads blocked in a lock and failed to send the refresh command to
xfrin.
The blocking thread is running `ZonemgrRefresh`. On receiving a
NOTIFY, this thread is eventually invoked and tries to send the
refresh command to xfrin:
{{{#!python
def _send_command(self, module_name, command_name, params):
"""Send command between modules."""
try:
self._mccs.rpc_call(command_name, module_name, params=params)
}}}
rpc_call eventually reaches `Session.sendmsg()`, where the thread
tries to acquire a lock:
{{{#!python
def sendmsg(self, env, msg=None):
with self._lock:
}}}
On the other hand, main thread normally blocks on the same socket
(`Session.recvmsg` via `ModuleCCSession.check_command), while
acquiring the lock:
{{{#!python
def recvmsg(self, nonblock = True, seq = None):
with self._lock:
...
data = self._receive_full_buffer(nonblock) # <= this can block
}}}
So, if the main thread blocks before the refresh thread completes the
command exchange, zonemgr falls into semi-deadlock (not a complete
deadlock, because it's unlocked on receiving a new command). And
that's what actually happens when the lettuce test fails this way.
At the very least a thread shouldn't block with holding a lock. We
should probably also clarify which attributes the `_lock` tries to
protect and try to narrow the critical section (if possible).
BTW, in case it's not clear: this is a real bug that can happen on a
deployed environment, not just a test-only regression.
--
Ticket URL: <http://bind10.isc.org/ticket/2790#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list