BIND 10 #213: Change hard-coded process startups to configuration-driven

Thu Oct 20 03:57:07 UTC 2011

#213: Change hard-coded process startups to configuration-driven
-------------------------------------+-------------------------------------
                   Reporter:  shane  |                 Owner:  jinmei
                       Type:         |                Status:  reviewing
  enhancement                        |             Milestone:
                   Priority:  major  |  Sprint-20111025
                  Component:  Boss   |            Resolution:
  of BIND                            |             Sensitive:  0
                   Keywords:         |           Sub-Project:  Core
            Defect Severity:  N/A    |  Estimated Difficulty:  9
Feature Depending on Ticket:         |           Total Hours:
        Add Hours to Ticket:         |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 I'll first provide comments on the component module (and its tests).

 (In order to not disappoint you due to the amount of comment:-) In
 general the design looks sensible to me.  But since this is a newly
 introduced concept, I encountered many points to discuss.

 '''Regressions'''
 - This branch seems to break system tests (maybe related to the next one)
 - When I stopped the system via bindctl 'Boss shutdown', the bind10
   process raised an exception:
 {{{
 Traceback (most recent call last):
   File "bind10", line 952, in <module>
     main()
   File "bind10", line 947, in main
     boss_of_bind.shutdown()
   File "bind10", line 646, in shutdown
     component.pid())
   File
 "/Users/jinmei/src/isc/git/bind10-213/src/lib/python/isc/bind10/component.py",
 line 232, in pid
     return self.__creator.pid()
 AttributeError: 'NoneType' object has no attribute 'pid'
 }}}
   and b10-msgq process remained alive.

 '''Design/General issues'''

 - Do we allow multiple instances (processes) of the same component?
   Like multiple auth processes for multiple cores?  If we do, can we
   handle that scenario in this framework?

 - Don't we need an interval (and/or allow having an interval) before
   restarting a component?  (and should the interval also be
   configured?).  This is probably a TODO thing, which is okay for now,
   but just checking.

 - I wouldn't consider Auth/Resolver/CmdCtl "needed" components.  For
   example, if the system is intended to be DHCP only, we don't need
   either auth or resolver.

 - I'd avoid using magic numbers/values directly in the source code as
   much as possible: e.g. 10 seconds, some well-known command names
   like 'start', 'stop', so that we can unify the place where they are
   defined and can be easily changed at once (when necessary) and also
   so that the intent of the magic values is clearer.

 - I'd keep this module independent from the knowledge of which
   component is special for the boss, and let it focus on the generic
   framework.  In that sense, it's awkward to see classes like
   SockCreator, Msgq, etc and the mapping dictionary of 'special' be
   defined in this module.  It seems to me to be more reasonable if
   they are defined in the boss module (or even independently from the
   module implementation itself, e.g. in a separate configuration file
   or something) and passed to this module as kind of generic
   parameters (see also the comment about 'special' in Configurator
   below).

 '''Component class'''

 - An object of this class is a sort of finite state machine, but since
   its transition is represented via changes to multiple mutable
   attributes (__running (which can be True or False), __dead) it's
   difficult to follow at least to me (I needed to draw a state
   transition diagram to understand the code and be sure about its
   soundness).  I'd suggest making this point more explicit by, e.g.,
   introducing a single attribute representing the current state. (see
   bullets below for more specifics)

 - Whether or not adopting the idea of explicit state transition, a
   diagram in the documentation would still help for others to read and
   understand the code.  It would look something like:
 {{{
                          start_intl fail
                            OR failed()
         start()          /---------->Failed----*----->Dead
   Init----------->Running<----+----/(call
                   (call            failed_intl)
                  start_intl)
                     ^  |           *: if kind == core
              start()|  |stop()        or kind == needed and
                     |  v              it fails too soon
                   Stopped          +: otherwise
                (call stop_intl)
 }}}

 - What if stop_internal() raises an exception?

 - The notion of "dead" is not documented.

 - failed() doesn't check if it's called with __dead being True while
   start() does this check.  While it may not be a bug because if
   __dead is true __running must be False, in order to understand that
   you need to follow how and where these values can be changed.  If we
   had a single attribute of e.g., __state, it would be immediately
   clear that the check of "__state != RUNNING" covers "__state ==
   DEAD".  (You may or may not agree with this idea itself, but at
   least it was the case for me that it took time to understand this
   implementation was okay due to the existence of multiple
   attributes).

 - It seems to make sense to allow derived class to override some of
   the methods.  But I suspect the requirement to the overridden
   methods should be more clarified in terms of which part (attribute)
   of the base class they can modify/ignore/refer to, or even which
   method can be overridden:
   - Can running() be overridden?  If it can, and if the overridden
     version ignores __running, behaviors that depend on the result of
     running() may not meet the assumption of the implementation (and
     may cause disruption).
   - if a derived class overrides start_internal() and in its version
     it doesn't set _procinfo, then pid() won't work unless pid() is
     also overridden.
   - there may be some more similar cases that I'm missing

 - A related point to the previous general issue, but not specific to
   that: strange thing will happen if pid() is called before _procinfo
   is set (e.g., at the "Init" state).  Also, what should happen if
   this method is called without the underlying process?  Further, this
   is actually not specific to PID, but also generally applicable to
   _procinfo.

 - On the specific point of the relationship between start_internal()
   and _procinfo, I guess it's better to request derived method not
   modify base attributes and instead return an ProcessInfo object as a
   contract.  The start() method will then set _procinfo to the
   returned value, and we can make sure _procinfo is always valid as
   long as start() succeeds.

 - Shouldn't xxx_internal better be named like "protected" methods,
   e.g. _start_internal()?

 '''SockCreator class'''

 - stop_internal: as long as the guideline is held __creator should
   never be None when this method is called (in other words should this
   ever happen it means a bug).  So I guess raising an exception is
   probably better.

 '''CfgMgr class, etc'''
 - isn't it better (at least in terms of conciseness) to pass 'address'
   to the base class constructor rather than setting it by themselves?
   that is, instead of
 {{{
         self._address = 'ConfigManager'
 }}}
   do this:
 {{{
         Component.__init__(self, process, boss, kind, 'ConfigManager')
 }}}

 '''Configurator class'''

 - The syntax and semantics of the 'configuration' (argument to
   startup() and reconfigure()) are not clear enough in the
   documentation.  I needed to look into the implementation and test
   cases to understand how it should be and how it works.  This should
   be documented somewhere (probably part of the class doc string),
   more explicitly.  It would cover the following points:
   - 'configuration' is a dictionary, whose keys are the identifiers of
     components (often derived from the program invoked for the
     component)
   - each value of a dictionary entry is another dictionary, which
     gives the configuration for the component.  The (currently
     available) keys are: 'special', 'process', 'kind', 'address',
     and 'param'.  The expected type of each key value is explained
     with other restrictions (e.g., 'kind' can only be one of the
     well-known kinds).  For those configuration parameters that can be
     omitted, the default values are shown.  And explain how these
     values are used.

 - (A related point) Isn't it okay to not perform syntax check against
   the given parameters?  What if an unknown parameter is given?  What
   if a parameter value is of an unexpected type?  Or is this assumed
   to be checked at the caller side? (if so, it should be documented
   so).

 - Configurator.running doesn't have doc string.

 - The notion of 'special' is confusing, at least to me.  This
   essentially specifies a specific derived Component class (if
   explicitly specified).  So I'd rather simply call it something like
   'component_class', specify the actual class Component class name
   (e.g. 'SockCreator'), and use some reflection technique to
   instantiate an object of the specified class.

 - _run_plan(): Like the configuration, I'd like to see doc string that
   explains more details of 'plan'.

 - _run_plan(): s/beforehead/beforehand/?
 {{{
         Run a plan, created beforehead by _build_plan.
 }}}

 - _run_plan(): this doesn't seem to provide the strong exception
    guarantee, i.e. self._components can have an incomplete update when
    an exception is raised.  Is that okay?  (for example, _old_config
    and _components will be inconsistent once this happens).

 - A related comment: since _old_config and _components are inter
   related (and have one-to-one relationships if I understand it
   correctly), I'd rather manage them in a combined way, e.g.
 {{{
   _components = { 'msgq': (comonent_obj, componet_config), ... }
 }}}
   That way we can be more sure, when reading the code, about the
   consistency between them, and it will be easier to ensure the
   consistency at least per component basis.

 - The exception message seems to be too broad:
 {{{
                         raise NotImplementedError('Changing configuration
 of' +
                                                   ' a running component is
 ' +
                                                   'not yet supported.
 Remove' +
                                                   ' and re-add ' + cname +
                                                   'to get the same
 effect')
 }}}
   because it only checks a subset of known configuration parameters
   (e.g. even if 'address' is modified it doesn't complain)

 - reconfigure() docstring: I'd like to see more desc, like the fact
   that currently running components that are not specified in
   'configuration' will be stopped.

 - build_plan: using "params" as a local variable while also having
   "'params'" in the configuration is a bit confusing.  I'd rename the
   former to something else.

 - is it okay to allow Configurator.reconfigure() to even stop core
   components?  It seems to contradict the idea that the system cannot
   even start if such components fail to start.

 - what should happen if we do Configurator.shutdown() and try to
   start it again?  Is it expected to be an okay behavior?  Unexpected
   but not prohibited?  Or should be prohibited?

 '''component_test.py'''

 === ComponentTests ===
 - __check_dead: what about calling failed()? (also see above for the
   comments about the class)
 - __check_dead: 'failed' instead of 'stopped'?  ('dead' is a result of
   'failed' not 'stop')
 {{{
         # Surely it can't be stopped again
 }}}
 - __check_dead: should also check it's not None?
 {{{
         self.assertNotEqual(0, self._exitcode)
 }}}

 === ConfiguratorTest ===

 - Is it intentional to test these two? (same for the shutdown case.
   also note it's inconsistent with the other two cases)
 {{{
         self.assertTrue(configurator._running)
         self.assertTrue(configurator.running())
 }}}

 === main ===
 - I don't know about this, but referring to other similar cases it
   doesn't seem to be necessary:
 {{{
     isc.log.init("bind10") # FIXME Should this be needed?
 }}}

-- 
Ticket URL: <http://bind10.isc.org/ticket/213#comment:14>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development