BIND 10 #213: Change hard-coded process startups to configuration-driven

Tue Oct 25 18:29:01 UTC 2011

#213: Change hard-coded process startups to configuration-driven
-------------------------------------+-------------------------------------
                   Reporter:  shane  |                 Owner:  vorner
                       Type:         |                Status:  reviewing
  enhancement                        |             Milestone:
                   Priority:  major  |  Sprint-20111108
                  Component:  Boss   |            Resolution:
  of BIND                            |             Sensitive:  0
                   Keywords:         |           Sub-Project:  Core
            Defect Severity:  N/A    |  Estimated Difficulty:  9
Feature Depending on Ticket:         |           Total Hours:
        Add Hours to Ticket:         |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Some response:

 > > - Do we allow multiple instances (processes) of the same component?
 > >   Like multiple auth processes for multiple cores?  If we do, can we
 > >   handle that scenario in this framework?
 >
 > This framework should be able to handle that, provided their names are
 different. I actually expected it to be used that way (or maybe having a
 'count': 64 option for a component sometime in future, if copy-pasting
 bunch of components is deemed uncomfortable).
 >
 > However, the rest of the system can't handle it yet (like we need them
 to have different addresses on the message bus or something). Maybe we
 should warn the user about it in config, that starting two auths won't do
 what he wants.

 It would be nice to document it somewhere.  In any case actually
 realizing it is far beyond the scope of this ticket.

 > > - I wouldn't consider Auth/Resolver/CmdCtl "needed" components.  For
 > >   example, if the system is intended to be DHCP only, we don't need
 > >   either auth or resolver.
 >
 > I'm not sure, maybe the "needed" name is a bit misleading. It says that
 it should not start if these can't be started, but not bring the system
 down if they crash later on.
 >
 > If someone uses boss to start dhcp, he would just remove auth and
 resolver from configuration and have the dhcp part as needed.

 Perhaps the point to consider is what should be specified as 'needed'
 by default in bob.spec.  If we see BIND 10 as a generic framework for
 various kind of Internet servers (starting with DNS, then DHCP, and
 perhaps even HTTP, etc), it would be more reasonable to begin with an
 empty list of specific servers.  If a user wants to use the framework
 for DNS services, auth (and/or resolver) will then be specified as
 'needed'.  On the other hand, realistically speaking most people will
 see BIND 10 as DNS software (at least for initial N years), so it
 might be over generalization and just increase the configuration
 overhead.  Right now I have a strong opinion either way.  Maybe one
 option is to decide it at ./configure time, and make its default DNS
 related servers.  But in any case I'm okay with deferring this point.

 I don't have a strong opinion about the naming of 'needed', btw.

 > > - I'd keep this module independent from the knowledge of which
 > >   component is special for the boss, and let it focus on the generic
 > >   framework. [...]
 >
 > I put it to a different module.

 Okay.

 > > - An object of this class is a sort of finite state machine, [...]
 >
 > Yes, you're right. It happened in kind of evolutionary way, the
 __running one was there first, then the __dead appeared later on and I
 didn't think about it. This way it looks simpler. I also added the
 diagram.
 >
 > > - What if stop_internal() raises an exception?
 >
 > Then we have a problem.
 >
 > Actually, the component is considered shut down at the time and the
 exception is propagated. The idea behind this is, we can't really consider
 it running, because it might be already stopped and if there's problem
 stopping, if we try again (during system shutdown or sometime else), it
 would fail again. This way, if it happens during real shutdown and the
 process is still running, it will be at last killed. If it happens during
 reconfiguration, I don't know. Any ideas what to do then?

 On thinking about it more as being explicitly asked, I think we should
 keep truck of the status of spawned processes more precisely.  Right
 now (both before or after this branch), it seems that we are not very
 accurate on this point.

 A child process can have the following states:
 - dead (process doesn't exist)
 - alive but not ready to run (in initialization)
 - alive and running
 - alive and shutting down (boss has sent a shutdown command)
 - alive but hang (process exists but cannot do any active work and
   cannot even receive a shutdown command)

 We (at least partly) manage these states via BoB.processes and
 BoB._componet_configurator(._components), but the relationship among
 these doesn't seem to be well clarified.  And, it causes some real bad
 things:

 - since we don't explicitly recognize the 'not ready' state, we have a
   problem like #1271.  We can (should) fix individual problems, but I
   suspect it's a tip of iceberg.
 - as far as I know we don't have any explicit way to detect the "hang"
   state.
 - we don't explicitly recognize the "shutting down" state, and once
   the boss sends a shutdown command the boss basically forgets that
   component (and cannot deal with the situation the process somehow
   doesn't die)

 My original question about stop_internal() is related to the last
 point.  Based on this observation, for this particular issue I believe
 we should keep truck of the transition from "shutting down" to "dead"
 more closely.  For example, we don't immediately remove the component
 on stop() it but maintain it in some "shutting down queue" and watch
 the process.  If it doesn't die for a certain amount of period the
 boss will kill it more forcefully.  (It's just an example sketch of
 idea, rather than a concrete proposal).

 In longer term, I believe we should clarify the above relationship,
 then define, implement, and test the behavior based on the
 clarification.

 I think we should also check what other multi-process systems such as
 postfix and xorp handle the issue of managing child processes.

 All that said, this will be beyond the scope of this already-fat
 task.  After all, the pre-213 implementation isn't good in this sense,
 so in the sense of porting the current behavior under a new framework
 we don't have to solve it now.  So, at the moment, I'm okay with just
 leaving a comment that e.g. stop_process() is generally expected to be
 exception free (for now) and the behavior is undefined if and when
 that happens.

-- 
Ticket URL: <http://bind10.isc.org/ticket/213#comment:25>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development