BIND 10 #213: Change hard-coded process startups to configuration-driven

Sat Oct 22 13:31:01 UTC 2011

#213: Change hard-coded process startups to configuration-driven
-------------------------------------+-------------------------------------
                   Reporter:  shane  |                 Owner:  vorner
                       Type:         |                Status:  reviewing
  enhancement                        |             Milestone:
                   Priority:  major  |  Sprint-20111025
                  Component:  Boss   |            Resolution:
  of BIND                            |             Sensitive:  0
                   Keywords:         |           Sub-Project:  Core
            Defect Severity:  N/A    |  Estimated Difficulty:  9
Feature Depending on Ticket:         |           Total Hours:
        Add Hours to Ticket:         |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by vorner):

 Hello

 Replying to [comment:14 jinmei]:
 > (In order to not disappoint you due to the amount of comment:-) In
 > general the design looks sensible to me.  But since this is a newly
 > introduced concept, I encountered many points to discuss.

 I kind of expected a lot of comments, as the branch is quite large.

 > - This branch seems to break system tests (maybe related to the next
 one)

 Hmm, maybe I need to look at the configurations there. I'll handle it
 sometime soon.

 > - When I stopped the system via bindctl 'Boss shutdown', the bind10
 >   process raised an exception:
 > {{{
 > Traceback (most recent call last):
 >   File "bind10", line 952, in <module>
 >     main()
 >   File "bind10", line 947, in main
 >     boss_of_bind.shutdown()
 >   File "bind10", line 646, in shutdown
 >     component.pid())
 >   File
 "/Users/jinmei/src/isc/git/bind10-213/src/lib/python/isc/bind10/component.py",
 line 232, in pid
 >     return self.__creator.pid()
 > AttributeError: 'NoneType' object has no attribute 'pid'
 > }}}
 >   and b10-msgq process remained alive.

 I couldn't reproduce this, it worked for me. But I added some handling of
 this, so it shouldn't cause an exception any more (I expect this happens
 in some time when the component is already stopped, but it didn't yet get
 the SIGCHLD or something).

 Anyway, I find it strange it did bring the system down without killing
 msgq. Or maybe not, considering where the exception happened.

 > - Do we allow multiple instances (processes) of the same component?
 >   Like multiple auth processes for multiple cores?  If we do, can we
 >   handle that scenario in this framework?

 This framework should be able to handle that, provided their names are
 different. I actually expected it to be used that way (or maybe having a
 'count': 64 option for a component sometime in future, if copy-pasting
 bunch of components is deemed uncomfortable).

 However, the rest of the system can't handle it yet (like we need them to
 have different addresses on the message bus or something). Maybe we should
 warn the user about it in config, that starting two auths won't do what he
 wants.

 > - Don't we need an interval (and/or allow having an interval) before
 >   restarting a component?  (and should the interval also be
 >   configured?).  This is probably a TODO thing, which is okay for now,
 >   but just checking.

 We do, it's in the TODO somewhere.

 > - I wouldn't consider Auth/Resolver/CmdCtl "needed" components.  For
 >   example, if the system is intended to be DHCP only, we don't need
 >   either auth or resolver.

 I'm not sure, maybe the "needed" name is a bit misleading. It says that it
 should not start if these can't be started, but not bring the system down
 if they crash later on.

 If someone uses boss to start dhcp, he would just remove auth and resolver
 from configuration and have the dhcp part as needed.

 > - I'd keep this module independent from the knowledge of which
 >   component is special for the boss, and let it focus on the generic
 >   framework.  In that sense, it's awkward to see classes like
 >   SockCreator, Msgq, etc and the mapping dictionary of 'special' be
 >   defined in this module.  It seems to me to be more reasonable if
 >   they are defined in the boss module (or even independently from the
 >   module implementation itself, e.g. in a separate configuration file
 >   or something) and passed to this module as kind of generic
 >   parameters (see also the comment about 'special' in Configurator
 >   below).

 I put it to a different module.

 > - An object of this class is a sort of finite state machine, but since
 >   its transition is represented via changes to multiple mutable
 >   attributes (__running (which can be True or False), __dead) it's
 >   difficult to follow at least to me (I needed to draw a state
 >   transition diagram to understand the code and be sure about its
 >   soundness).  I'd suggest making this point more explicit by, e.g.,
 >   introducing a single attribute representing the current state. (see
 >   bullets below for more specifics)

 Yes, you're right. It happened in kind of evolutionary way, the __running
 one was there first, then the __dead appeared later on and I didn't think
 about it. This way it looks simpler. I also added the diagram.

 > - What if stop_internal() raises an exception?

 Then we have a problem.

 Actually, the component is considered shut down at the time and the
 exception is propagated. The idea behind this is, we can't really consider
 it running, because it might be already stopped and if there's problem
 stopping, if we try again (during system shutdown or sometime else), it
 would fail again. This way, if it happens during real shutdown and the
 process is still running, it will be at last killed. If it happens during
 reconfiguration, I don't know. Any ideas what to do then?

 > - It seems to make sense to allow derived class to override some of
 >   the methods.  But I suspect the requirement to the overridden
 >   methods should be more clarified in terms of which part (attribute)
 >   of the base class they can modify/ignore/refer to, or even which
 >   method can be overridden:

 I added some documentation.

 > - A related point to the previous general issue, but not specific to
 >   that: strange thing will happen if pid() is called before _procinfo
 >   is set (e.g., at the "Init" state).  Also, what should happen if
 >   this method is called without the underlying process?  Further, this
 >   is actually not specific to PID, but also generally applicable to
 >   _procinfo.

 The _procinfo should not be accessed if it's not running and it should in
 fact be more private than protected. I'm not sure if I access it from some
 kind of tests, or if it's leftover from some older version. Anyway, pid()
 now checks and returns None if it's not running (or derived component is
 bad).

 > - On the specific point of the relationship between start_internal()
 >   and _procinfo, I guess it's better to request derived method not
 >   modify base attributes and instead return an ProcessInfo object as a
 >   contract.  The start() method will then set _procinfo to the
 >   returned value, and we can make sure _procinfo is always valid as
 >   long as start() succeeds.

 I think there are none that modify it. This is actually what _start_func
 hook is for, it returns the procinfo and the rest is handled the usual
 way.

 Overriding start_internal is for bigger differences, where there might be
 no procinfo involved or no process (therefore PID) at all.

 I added some documentation there.

 > '''SockCreator class'''
 >
 > - stop_internal: as long as the guideline is held __creator should
 >   never be None when this method is called (in other words should this
 >   ever happen it means a bug).  So I guess raising an exception is
 >   probably better.

 Yes, you're right. It's leftover from the copy-paste from boss.

 > - isn't it better (at least in terms of conciseness) to pass 'address'
 >   to the base class constructor rather than setting it by themselves?
 >   that is, instead of

 Yes, evolution O:-). Modified.

 > - (A related point) Isn't it okay to not perform syntax check against
 >   the given parameters?  What if an unknown parameter is given?  What
 >   if a parameter value is of an unexpected type?  Or is this assumed
 >   to be checked at the caller side? (if so, it should be documented
 >   so).

 It's not completely OK, but it's one of the TODO items (better error
 handling and checks). It currently implicitly relies on bindctl not
 allowing anything that's not allowed by spec file, but the checks should
 be definitelly added.

 >   essentially specifies a specific derived Component class (if
 >   explicitly specified).  So I'd rather simply call it something like
 >   'component_class', specify the actual class Component class name
 >   (e.g. 'SockCreator'), and use some reflection technique to
 >   instantiate an object of the specified class.

 I'm not against some renaming (but the component_class is rather long and
 expects that someone has an idea what a class is). But I'd be against the
 reflection techniques, for one, they complicate tests, for another, I
 don't want the user to be able to type any class name there and expect
 things to work.

 > - _run_plan(): Like the configuration, I'd like to see doc string that
 >   explains more details of 'plan'.

 I added a little bit, but that thing is mostly private to the configurator
 and tests.

 > - A related comment: since _old_config and _components are inter
 >   related (and have one-to-one relationships if I understand it
 >   correctly), I'd rather manage them in a combined way, e.g.
 > {{{
 >   _components = { 'msgq': (comonent_obj, componet_config), ... }
 > }}}
 >   That way we can be more sure, when reading the code, about the
 >   consistency between them, and it will be easier to ensure the
 >   consistency at least per component basis.

 I implemented this. But should we try some kind of rollback in case of
 failure-to-start or something?

 > - The exception message seems to be too broad:
 > {{{
 >                         raise NotImplementedError('Changing
 configuration of' +
 >                                                   ' a running component
 is ' +
 >                                                   'not yet supported.
 Remove' +
 >                                                   ' and re-add ' + cname
 +
 >                                                   'to get the same
 effect')
 > }}}
 >   because it only checks a subset of known configuration parameters
 >   (e.g. even if 'address' is modified it doesn't complain)

 Not checking address was actually a bug (as the check was there before the
 address was added). Anyway, I'd like this one to go sometime soon and be
 able to handle the change if at all possible without bothering the user.

 > - is it okay to allow Configurator.reconfigure() to even stop core
 >   components?  It seems to contradict the idea that the system cannot
 >   even start if such components fail to start.

 Yes, as described at the class's docstring, it's up to the caller to
 ensure this doesn't happen (and boss does that by making sure the three
 core components are always there).

 I actually want to allow user to provide more core components (eg. say
 that auth is core) and simulate kind of semi-brittle mode with it. But
 then, the user should be able to remove the auth and shut it down. And the
 configurator is not the place which should know which are system-core
 components and which are core because user wants them so.

 > - what should happen if we do Configurator.shutdown() and try to
 >   start it again?  Is it expected to be an okay behavior?  Unexpected
 >   but not prohibited?  Or should be prohibited?

 I didn't think about that one. Boss doesn't use it so and I don't think
 there ever will be another user than a boss.

 But I guess it would work, so I wouldn't bother forbidding it explicitly
 or making sure it does unless it's really needed.

 > {{{
 >     isc.log.init("bind10") # FIXME Should this be needed?
 > }}}

 Hmm, it seems to fail without it. It's just a copied fixme from some other
 test, I'd like the tests to be able to start by themself without the init
 in tests (as there's a test-specific call just below it).

 More answers and code will follow, I just need a flush O:-).

-- 
Ticket URL: <https://bind10.isc.org/ticket/213#comment:19>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development