BIND 10 #2398: msgq crash
BIND 10 Development
do-not-reply at isc.org
Fri Nov 23 15:03:33 UTC 2012
#2398: msgq crash
-------------------------------------+-------------------------------------
Reporter: jreed | Owner: vorner
Type: | Status: reviewing
defect | Milestone:
Priority: | Sprint-20121204
medium | Resolution:
Component: msgq | Sensitive: 0
Keywords: | Sub-Project: Core
Defect Severity: High | Estimated Difficulty: 4
Feature Depending on Ticket: | Total Hours: 0
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Changes (by jelte):
* owner: jelte => vorner
Comment:
Replying to [comment:7 vorner]:
> Hello
>
> Replying to [comment:5 jelte]:
> > It does *not* handle and recover any other exceptions (except the
socket errors that were already handled). I don't think we should catchall
here, as depending on the exception we should handle it differently (and
perhaps simply crash, if it is a syntax error for example). Also, I kept
it to send() calls only. If we want we can do something similar for recv()
(which should now be a lot easier, as I think the tests could be extended
to test that too).
>
> I don't know. For one, I think syntax errors are not descendants of
Exceptions.
yeah they are; http://docs.python.org/2/library/exceptions.html#exception-
hierarchy, though a few tend to be thrown from the python parser (like
IndentationError), and therefore not be caught normally, unless something
like eval() is used, a number of them are (like NameError). This has
bitten me quite a few times already in parts of the code enclosed in
catchalls that ignore errors.
> Even when leaving these out, looking at `man send`, I think there are
more
> interesting cases there:
> * `ECONNRESET` (it would probably be similar to `EPIPE`).
> * `ENOBUFS` (the other side probably does not read, or something, so
the same).
> * `EINTR` (just retry like with `EAGAIN`).
>
Added these (and moved each type into a test that loops to reduce code
duplication). I wonder if we should maybe also kill it in the case of any
other socket errors, but for now I think this is good enough.
> Also, there are few points to the code:
> * What happens if the `kill_socket` is called (because of `EPIPE`), but
the
> socket has some data to read and has the read even queued?
I suspect we'll need to do something similar for any problem that may
arise while *reading*. Hopefully doing that would be a lot easier now that
we at least have somewhat of a testing framework.
Or maybe we should just go async completely :p (which will of course come
with all kinds of other fun problems)
> * I think there's +-3 error in this comment:
> {{{
> raise_on_send: integer. If send_exception is not None, it will
be
> raised on this byte (i.e. 1 = on the
first
> call to send(), 1 = on the 4th call to
send)
> Note: if 0, send_exception will not be
raised.
> }}}
That will teach me to change behaviour without rewriting the entire
docstring. Simplified it.
> * I don't like timeouts in testing code much. These are probably
reasonable, because they time out only in case something in the test fail.
But is 0.2 enough even on slow systems?
> {{{#!python
> # Give it a chance to stop, if it doesn't, no problem, it'll
> # die when the program does
> msgq_thread.join(0.2)
> }}}
> * Also, the code directly below, is it safe? In case the timeout
happens, the
> thread could still be running. And it could actually be modifying
some of
> the variables that are being examined. I know this is mostly a
theoretical
> concern and I know that if it was in the unsafe case, the test would
be
> failing anyway, but maybe it would be better to assert first the
thread is
> not alive.
> {{{#!python
> # Check the exception from the thread, if any
> self.assertEqual(expect_send_exception,
msgq_thread.caught_exception)
> }}}
While playing around with it a bit more, I ran into more problems. The
base problem was that in the case of *success* there was no way to
actually stop msgq (apart from outright killing it).
So I caved in and did what i had originally tried not to; built in a
mechanism to stop msgq. It is not a full command (yet), though when we
hook msgq up to itself, the stopping mechanism at least is there already
:) (oh and we can consider using it in boss)
This involved making the main poll loops not while True but while
self.running, and adding a timeout to the calls to poll() and
kqueue.control(). I chose 2 seconds for this, rather arbitrarily, but it's
a choice between not looping too often and not having to wait too long for
it to realize it should shut down.
That allowed the testing code to have much less timeouts, and timeouts
that can be much higher; I still have 1 second for an initial join() in
the cases where a kill_socket should occur (to give it a chance to handle
existing events before killing it), but the main and final join now waits
a full minute. If it hasn't closed down by then it is considered an error
and the test fails.
I'll keep an eye on that 1 second if it causes problems on slower systems.
--
Ticket URL: <http://bind10.isc.org/ticket/2398#comment:8>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list