BIND 10 #2398: msgq crash

Fri Nov 23 15:03:33 UTC 2012

#2398: msgq crash
-------------------------------------+-------------------------------------
                   Reporter:  jreed  |                 Owner:  vorner
                       Type:         |                Status:  reviewing
  defect                             |             Milestone:
                   Priority:         |  Sprint-20121204
  medium                             |            Resolution:
                  Component:  msgq   |             Sensitive:  0
                   Keywords:         |           Sub-Project:  Core
            Defect Severity:  High   |  Estimated Difficulty:  4
Feature Depending on Ticket:         |           Total Hours:  0
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------
Changes (by jelte):

 * owner:  jelte => vorner

Comment:

 Replying to [comment:7 vorner]:
 > Hello
 >
 > Replying to [comment:5 jelte]:
 > > It does *not* handle and recover any other exceptions (except the
 socket errors that were already handled). I don't think we should catchall
 here, as depending on the exception we should handle it differently (and
 perhaps simply crash, if it is a syntax error for example). Also, I kept
 it to send() calls only. If we want we can do something similar for recv()
 (which should now be a lot easier, as I think the tests could be extended
 to test that too).
 >
 > I don't know. For one, I think syntax errors are not descendants of
 Exceptions.

 yeah they are; http://docs.python.org/2/library/exceptions.html#exception-
 hierarchy, though a few tend to be thrown from the python parser (like
 IndentationError), and therefore not be caught normally, unless something
 like eval() is used, a number of them are (like NameError). This has
 bitten me quite a few times already in parts of the code enclosed in
 catchalls that ignore errors.

 > Even when leaving these out, looking at `man send`, I think there are
 more
 > interesting cases there:
 >  * `ECONNRESET` (it would probably be similar to `EPIPE`).
 >  * `ENOBUFS` (the other side probably does not read, or something, so
 the same).
 >  * `EINTR` (just retry like with `EAGAIN`).
 >

 Added these (and moved each type into a test that loops to reduce code
 duplication). I wonder if we should maybe also kill it in the case of any
 other socket errors, but for now I think this is good enough.

 > Also, there are few points to the code:
 >  * What happens if the `kill_socket` is called (because of `EPIPE`), but
 the
 >    socket has some data to read and has the read even queued?

 I suspect we'll need to do something similar for any problem that may
 arise while *reading*. Hopefully doing that would be a lot easier now that
 we at least have somewhat of a testing framework.

 Or maybe we should just go async completely :p (which will of course come
 with all kinds of other fun problems)

 >  * I think there's +-3 error in this comment:
 > {{{
 >         raise_on_send: integer. If send_exception is not None, it will
 be
 >                                 raised on this byte (i.e. 1 = on the
 first
 >                                 call to send(), 1 = on the 4th call to
 send)
 >                                 Note: if 0, send_exception will not be
 raised.
 > }}}

 That will teach me to change behaviour without rewriting the entire
 docstring. Simplified it.

 >  * I don't like timeouts in testing code much. These are probably
 reasonable, because they time out only in case something in the test fail.
 But is 0.2 enough even on slow systems?
 > {{{#!python
 >         # Give it a chance to stop, if it doesn't, no problem, it'll
 >         # die when the program does
 >         msgq_thread.join(0.2)
 > }}}
 >  * Also, the code directly below, is it safe? In case the timeout
 happens, the
 >    thread could still be running. And it could actually be modifying
 some of
 >    the variables that are being examined. I know this is mostly a
 theoretical
 >    concern and I know that if it was in the unsafe case, the test would
 be
 >    failing anyway, but maybe it would be better to assert first the
 thread is
 >    not alive.
 > {{{#!python
 >         # Check the exception from the thread, if any
 >         self.assertEqual(expect_send_exception,
 msgq_thread.caught_exception)
 > }}}

 While playing around with it a bit more, I ran into more problems. The
 base problem was that in the case of *success* there was no way to
 actually stop msgq (apart from outright killing it).

 So I caved in and did what i had originally tried not to; built in a
 mechanism to stop msgq. It is not a full command (yet), though when we
 hook msgq up to itself, the stopping mechanism at least is there already
 :) (oh and we can consider using it in boss)

 This involved making the main poll loops not while True but while
 self.running, and adding a timeout to the calls to poll() and
 kqueue.control(). I chose 2 seconds for this, rather arbitrarily, but it's
 a choice between not looping too often and not having to wait too long for
 it to realize it should shut down.

 That allowed the testing code to have much less timeouts, and timeouts
 that can be much higher; I still have 1 second for an initial join() in
 the cases where a kill_socket should occur (to give it a chance to handle
 existing events before killing it), but the main and final join now waits
 a full minute. If it hasn't closed down by then it is considered an error
 and the test fails.
 I'll keep an eye on that 1 second if it causes problems on slower systems.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2398#comment:8>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development