BIND 10 #640: cfgmgr hanging and command and configurations for no-longer-running components

Wed Feb 8 16:26:23 UTC 2012

#640: cfgmgr hanging and command and configurations for no-longer-running
components
-------------------------------------+-------------------------------------
                   Reporter:  jreed  |                 Owner:  jinmei
                       Type:         |                Status:  reviewing
  defect                             |             Milestone:
                   Priority:  major  |  Sprint-20120221
                  Component:         |            Resolution:
  Unclassified                       |             Sensitive:  0
                   Keywords:         |           Sub-Project:  DNS
            Defect Severity:  N/A    |  Estimated Difficulty:  0.0
Feature Depending on Ticket:         |           Total Hours:  0
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------
Changes (by jelte):

 * owner:  jelte => jinmei

Comment:

 Answering both your comment fields as one, in an attempt to cope with the
 potentially confusing non-threaded trac comment view :)

 Replying to [comment:10 jinmei]:
 > First, comments on the rest of the original branch (before you updated
 > it based on the initial comments).
 >
 > '''some higher level points'''
 >
 > The relationship between bindctl and module CC session is complicated
 > and difficult to understand.  For example, it took some time for me to
 > understand the interaction of UIModuleCCSession and bindctl - these
 > two (defined in separate modules) are tightly coupled and work based
 > on some implicit assumption of each other's features.  So, to
 > understand changes to the module CC session I needed to look into
 > bindctl.  Is it only me, or is this a known issue and kind of a longer
 > term TODO?  In any case, however, I'm not requesting it be resolved
 > within this ticket.
 >

 It isn't an explicit TODO so far, but it grew this way partly because of
 the way bindctl was originally written and how it changed over time. The
 idea should be that bindctl is just one tool that can use 'the API'
 (UIModuleCCSession) to control everything (hence a lot of functionality
 does not reside in bindctl itself), but I readily admit that it may have
 gotten quite muddled over time. Also see next point.

 > Maybe unrelated to this ticket but: does bindctl try to get the full
 > latest spec and config data of all running modules on accepting every
 > command?  If so, is the overhead acceptable?
 >

 Yes, it (currently) does; IIRC this got put in by Likun when he made
 bindctl a stateless http server, and we found that on the one hand you
 want to modify things locally (i.e. you need to be able change multiple
 settings and submit them as one), but do want things like tab-completion,
 for which you need at least the full specs and config. Rather than
 inventing a 'given your previous state, these are the changes' protocol he
 opted for a full fetch of the remote data on every command, since that was
 easiest at the time. As long as it's run locally and we don't have
 kilobytes of data to send, I think the overhead is acceptable. There are a
 few easy targets for improvements; we could consider having a very simple
 'anything changed?' request/response prior to requesting all data, and we
 could make the actual data one big request (instead of a small one for
 each module). But I must say I have not yet though much about doing it
 completely differently (we could for instance go the other way completely
 and make it more like AJAX websites; putting all interactive knowledge on
 the server side in a session and making bindctl a much more basic tool).

 > When the application doesn't die gracefully, sending the stop message
 > can be skipped.  Is it a TODO thing when it supports some kind of
 > RAII?  For that matter, the whatever successor of msgq may make it
 > possible for cfgmgr to (easily) detect it when a specific module is
 > disconnected.  If so, we may not have to have a "send stopping"
 > primitive, either directly or indirectly.
 >

 Yes, this is what I meant by making ModuleCCSession a Context; the way to
 do RAII in python is different from C++, since the often implicit
 reference counting messes up much of the reliability of __del__(), so what
 you do is make your objects implement the Context interface; give it a
 method __enter__() and __exit()__, and use your object with the with-
 statement. So what I had in mind would look something like
 {{{
     with ModuleCCSession(args) as mccs:
         main_proccessing_loop()
 }}}

 But since this requires some fundamental changes in nearly all modules, I
 deemed it out of scope for this ticket itself, but have not made a new
 ticket yet (want to work out a bit of an approach for that first).

 > '''ccsession.py'''
 >
 > I've come up with additional comments on this file while reading other
 > part of the branch.
 >
 > - request_specifications: it would be helpful if it had more detailed
 >   description, including it clears the previous data.  It would also
 >   be helpful how this method (via update_specs_and_config()) is
 >   expected to be used - and, for that matter, whether
 >   request_specification() is expected to be public or essentially
 >   private (in the latter case adding `__` will clarify the intent a
 >   bit clearer, although this will make some update to the test code).
 > - request_specifications: not directly related to this ticket, but I
 >   suspect we can remove or at least revise the comment.
 >

 updated the docs a bit (and moved the update function to below the two it
 calls). In short, the update one is a convenience function, since most of
 the time you need them both. There are exceptions (mainly in the tests, at
 this moment, but if we change the way we do the updates to something more
 efficient, this may or may not be the case for bindctl too).

 > '''ccsession_mock.py'''
 >
 > - I'd like to have some pydoc description for MockModuleCCSession.
 >

 ok

 > '''cfgmgr_test.py'''
 >
 > - This part is a bit confusing:
 > {{{#!python
 >         # drop the two messages for now
 >         self.assertEqual(len(self.fake_session.message_queue), 2)
 >         self.fake_session.get_message("Cmdctl", None)
 >         self.fake_session.get_message("TestModule", None)
 >
 >         self.assertEqual(len(self.fake_session.message_queue), 0)
 > }}}
 >   It's probably for clearing the faked message queue for the
 >   convenience of subsequent tests, but this part itself raises some
 >   unrelated questions: what are these two messages?  Where do the
 >   magic string of "Cmdctl" or "TestModule" come from?  Maybe the
 >   answer can be found if I read the other tests above this part (I
 >   didn't), but unless these preceding tests are prerequisite of the
 >   added tests, I'd rather be free from the duty of understanding it.
 >   Also related, test_handle_msg() now looks too long and is generally
 >   difficult to understand when something is added.  I'd suggest
 >   separating the message handling tests into independent sub tests.  I
 >   suspect the handling of "stopping" can be a separate test case, and
 >   if so, it will much easier to understand.
 >

 Refactored the entire test, split it into four separate ones (and redid a
 bit of the internals)

 > '''cmdctl.py.in'''
 >
 > - While the comment says: "If it is None, delete the module if it is
 >   known (and ignore otherwise)", it seems that None will be set in the
 >   "otherwise" case.  And, is it okay to ignore it anyway?  If this is
 >   a bogus input I think we should rather return an error, log it, or
 >   both.   Or, if it's an okay value, the comment should be updated.
 >

 i originally opted for lenience, but returning an error is probably a good
 idea (though i'm not sure what we should do with that error). Anyway the
 code itself was wrong to put in that none, so added a test for that as
 well, and it now sends back an error

 > '''cmdctl_test.py'''
 >
 > - I think it should also test the case when arg[1] is NULL and args[0]
 >   is not in modules_spec.  See also the comment on cmdctl.py.
 >

 ah, yes, just did :)

 > '''bind10_src.py'''
 >
 > - Is "rest of the system" correct (or appropriate)?
 > {{{#!python
 >         # If ccsession is still there, inform rest of the system this
 module
 >         # is stopping.
 > }}}
 >   I guess it's only delivered to cfgmgr and cmdctl (and perhaps
 >   bindctl via cmdctl).

 I used 'system' here because the fact that this message goes to the
 ConfigManager channel is kind of an implementation detail for
 ModuleCCSession (and if we want to be pedantic, this is just a channel
 that cfgmgr happens to listen on, its name being the same as the module
 name by convention).

 > - Is it possible that self.ccs is None?
 > {{{#!python
 >         if self.ccs is not None:
 >             self.ccs.send_stopping()
 > }}}
 >

 hmm no maybe i was overly conservative there, removed the check

 > '''ddns_test.py'''
 >
 > - Why doesn't this use `MockModuleCCSession`?
 >

 these tests use much more details about the CCSession than the rest, and
 tbh i figured if we're gonna seriously refactor those parts of the tests
 anyway, we could better do it then (for starters, mockmoduleccsession
 would need to be a subclass of configdata for these tests, and so it needs
 specfiles for input, etc.). So I chose to defer that part, and left the
 MyCCSession in here.

 > '''Others'''
 >
 > - stats and xfrout don't seem to be tested.

 i knew i missed some :p

 Or actually just one; in the first commit after the review I've added the
 b10-stats test.

 Xfrout went into a separate commit; the main Xfrout code wasn't tested at
 all, and there is a problem with the interaction between
 XfroutServer.shutdown() and threads in unit tests. I've modified the
 shutdown code to only wait for other threads when the was a
 UnixSocketServer, but simply go on and not join() them if there wasn't.
 I'm not entirely sure whether this is correct, so i've put it in a
 separate commit for now.

 (next points will go into a third commit)

 Replying to [comment:11 jinmei]:
 > Replying to [comment:9 jelte]:
 >
 > > Oh, sorry for not being clear; that ConfigManager timeout was actually
 closely related; the timeout values for communication from bindctl to
 configmanager and from configmanager to any other (in this case non-
 running) module are the same; so the 'configmanager is not responding' was
 not because configmanager was hanging, but because it was simply waiting
 for a timeout itself. So this would address the root cause, but not that
 specific problem (we probably need to make cmdctl->cfgmgr timeouts a bit
 higher than the rest).
 >
 > Okay, I actually suspected it was not hanging, just a long timeout.
 > What about the traceback?
 >

 I could not reproduce it at this time, and if it is an existing bug I
 consider it a separate problem.

 > > > - logging itself could cause an exception, so the destructor could
 > > >   still throw.  However, this is not the only point that has this
 > > >   problem, so maybe we can leave it open for now...
 > > >
 > >
 > > hmm, imo we should either make logging calls exception-free or add
 exception-free ones, since it is usually about the only thing you can do
 in a destructor.
 >
 > Yeah, I made a ticket before, but it didn't seem to be very easy.
 >

 ok, added a catchall there, but we should probably 'try again' :)

 > > > - In our convention we'd explicitly add 'virtual' (for clarity) to
 the
 > > >   destructor declaration.
 > >
 > > actually, it's even needed for technical reasons for classes that are
 subclassed, but this one isn't supposed to be subclassed afaik I realize
 we can't stop people from doing it, but personally i tend to look at the
 virtualness of the destructor to see if it should be subclassed in the
 first place. Of course, one could argue that it would be better to add it
 when it's not necessary than to not have it when it is, so if you insist
 i'll make it virtual.
 >
 > I meant this is because it has a base class and the destructor is
 > derived from it, not because this class could be derived from others.
 > I thought it was in the coding guideline, but I couldn't find it...
 >

 oh, ok. Can't find it either, but do recall some discussion about this.
 OK, made it virtual.

 > > > '''cfgmgr.py'''
 > > >
 > > > - if `_handle_module_stopping` is private, should it better be
 > > >   prefixed with `__`?  Same for `_send_module_spec_to_cmdctl`.
 > >
 > > as you can see from the rest of the code here, i tend to only use the
 one when I don't think the class/method name mangling is necessary. Should
 we start doing this as a principle? (and if so, should we update all
 'private' methods?)
 >
 > I think it's a good practice, but not a strong opinion.  In Python
 > it's not fully enforcible anyway.
 >

 ok, changed, there were a number of tests that actually used them, but
 they were easily converted to use the general public method

 > > > - Not directly related to this ticket, but the if-else in handle_msg
 > > >   is getting too big to understand/manage.  I guess it's time to
 > > >   consider refactoring, maybe in some OO way.
 > >
 > > depends on how; as it is now, it is indeed getting long, but is
 essentially a switch. I'm open to proposals, but some 'smart' solutions
 might actually be more complicated :p
 >
 > It's probably related to extract common initial patterns of Python
 > apps.  In any case it'd be beyond the scope of this ticket.
 >

 noted

 > > > '''ccsession_test.py'''
 > > >
 > > > - test_stop: not necessarily a problem, but it seems the same
 pattern
 > > >   repeats in multiple tests.  It would be better to unify the common
 > > >   sequence of tests for conciseness.
 > >
 > > Yes, actually, more generally, this is partly why I was looking at the
 callcounter thingy a few nights ago; [...]
 >
 > I'm not sure if we are talking about the same thing...what I meant is
 > something rather simple: repeating something like this:
 > {{{#!python
 >         fake_session = FakeModuleCCSession()
 >         self.assertFalse("Spec1" in fake_session.subscriptions)
 >         mccs = self.create_session("spec1.spec", None, None,
 fake_session)
 >         self.assertTrue("Spec1" in fake_session.subscriptions)
 > }}}
 > but anyway, I don't request it be solved within this ticket.  The
 > branch is already quite large, so it's better to complete it only with
 > focused and relevant changes.
 >

 ah, right, noted again :)

 > One other comment on the revised diff:
 >
 > '''(lettuce)bind10_control.py'''
 >
 > - run_bindctl: I'd set the default for cmdctl_port to None to avoid
 >   duplicate hardcoding of 47805.
 >

 ack, done

-- 
Ticket URL: <http://bind10.isc.org/ticket/640#comment:14>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development