BIND 10 #2676: Provide ccsession::rpc_call and replace usual calls with it

Tue Feb 19 18:33:39 UTC 2013

#2676: Provide ccsession::rpc_call and replace usual calls with it
-------------------------------------+-------------------------------------
            Reporter:  vorner        |                        Owner:
                Type:  enhancement   |  jinmei
            Priority:  medium        |                       Status:
           Component:  Inter-module  |  reviewing
  communication                      |                    Milestone:
            Keywords:                |  Sprint-20130305
           Sensitive:  0             |                   Resolution:
         Sub-Project:  Core          |                 CVSS Scoring:
Estimated Difficulty:  2.5           |              Defect Severity:  N/A
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:9 vorner]:

 > It seems the ModuleCCSession is busy locked in one thread in the loop
 checking
 > commands. It doesn't cope with another request to wait for messages
 there. We
 > may want to refactor the libraries in around there a lot, because we had
 some
 > similar problems in C++ library as well (not with threads though, during
 the
 > startup).
 >
 > Anyway, thinking about this, it might be solved by some kind of
 asynchronous
 > wait on the session. Or, just by creating a ModuleCCSession instead of
 > CCSession as the second one. That could be easy. Should I try if it
 helps?

 It depends on the amount of work, which I'm not really sure about.
 But my personal impression at that moment is that it's better to
 complete this task and create another one for it.

 > > - a lettuce test (xfrin_notify_handling.feature) often, but not
 > >   always, fails.  I'm not sure if it's related to this branch.
 [...]
 > I'd propose not trying to hunt down the lettuce timing here, but create
 a
 > ticket for it.

 That's fine.

 > > - `__notify_start_forwarder`: also not related to this branch, but
 > >   this case is one example where just getting a response from one
 > >   instance (of possibly multiple ones) of the component isn't
 > >   sufficient.  We'll have to clarify this in the protocol doc (and
 > >   refinement) task and revise the implementation so it'll work
 > >   robustly.
 >
 > That is true, but just logging here seems like not the way to go either.
 I
 > believe it should just crash the whole DDNS in such case.

 Probably.

 > Anyway, it is not
 > related to this ticket.

 Right, this can be left open for this ticket.

 > >   see the detailed description of DDNS_UPDATE_NOTIFY_FAIL.
 [...]
 > Actually, the current behaviour with waiting for one (random) answer is
 wrong too, because:
 >  * They may be no recipients on purpose. We can have no XfrOut running
 (because
 >    the server is not primary). We can have a hidden master updated by
 DDNS, not
 >    running Auth. We can have neither, for example user might have added
 their
 >    own tool to write blog articles automatically about the changes being
 made
 >    and the server just gets copies of the updates sent to the main
 server.
 >  * What if the first of Auths succeeds and the others fail?

 At least ddns confirmed auth was running at the time of its startup,
 so it's reasonable that it expects to get a response from auth.  DDNS
 also confirms the updated zone is primary (in an ad hoc way though,
 and it's still true that xfrout may not be running).  Also, although
 not the point of this discussion, a "hidden master" can't work without
 auth anyway; xfrout doesn't accept xfr requests directly, at least
 right now.

 All these said, your point is valid: it's true that it's wrong that
 DDNS is satisfied with getting only one answer.

 > What I think should be done is:
 >  * Make sure the sender (DDNS in this case) does not need to know what
 the
 >    recipients will be. In that case, we can run no Auth without any
 problem and
 >    the customer also can run a custom module that wants to know about
 the
 >    updates the developers are not aware of.
 >  * Leave the responsibility of guaranteed delivery to all on the msgq or
 >    libraries. Modules should not babysit the communication system and
 check if
 >    it was delivered.
 >  * If something fails, we currently only log. Let the logging be done in
 the
 >    recipient state.
 >  * Maybe introduce advanced response tracking and collate it into one
 answer of
 >    form:
 >    - All OK
 >    - or These modules failed

 I don't fully understand all of them, but I agree these kinds of
 solution are generally what we should actually do for this type of
 issue.

 > > - do_polling: doesn't it change the behavior?  Previously if the
 > >   recipient doesn't exist it should have resulted in a timeout and an
 > >   exception.  It's now reported as `RPCError` and ignored.
 > >   Furthermore, there now seems to be a possibility that `modules` can
 > >   be used without set (even if detailed conditions prevent that from
 > >   happening, it's not obvious from the code, and it means the quality
 > >   of this code is low - although it's not a fault of your change).
 >
 > I changed it to mimic the original. But I don't understand what you mean
 by the
 > „without set“ part.

 I guess the revised code highlights the concern more clearly.
 Consider what happens in `type(value)` after rpc_call raises
 `RPCError`.

 {{{#!python
         try:
             value = self.mccs.rpc_call('show_processes', 'Init')
         except isc.config.RPCRecipientMissing:
             # This has been SessionTimeout before, so we keep the original
             # behavior.
             raise
         except isc.config.RPCError:
             # TODO: Is it OK to just pass? As part of refactoring,
 preserving
             # the original behaviour.
             pass
         if type(value) is list:
 }}}

 > > - update_modules: likewise, the behavior seems to be slightly changed.
 >
 > It is changed only slightly. If the module doesn't exist, it raises
 StatsError
 > instead of SessionTimeout. But I don't think either is handled on the
 above
 > levels. I can return the original, but I don't know it is important
 enough to
 > complicate the code because of that.

 I'm okay with keeping the code without converting the exception per
 se.  This should have caught tests (and made us think about it),
 though; it suggests stats are only poorly tested (not an issue of this
 branch itself).

 > > - about the FIXME comment: as stated above, IMO it's another
 > >   incident that xfr's are a student's hobby project, not a
 > >   production-quality system, that it simply ignores the response or
 > >   timeout.  (But this specific case may not have to be an "RPC" anyway
 > >   - all xfrin should know and handle if the message was surely
 > >   delivered to zonemgr).
 >
 > Do you propose to fix it in this branch? I don't think it is in scope.
 The
 > errors were ignored before.

 No, it's just a comment about the FIXME as it contained a question.
 It should be fixed so it won't be a student toy, but that's outside
 the scope of this task.

 On the revised branch:

 '''other changes'''

 I made a few unrelated cleanups.

 '''ddns.py.in'''

 - apparently I forgot to note this before, but see (and confirm)
   3bec0f1.  We should now be able to remove that comment.

 '''stats.py.in'''
 - s/round-time trips/round-trip time(s)/?
 {{{
             # collect all the answers. This eliminates some of the round-
 time
             # trips. Unfortunately, rpc_call is not flexible enough to
 allow
 }}}

-- 
Ticket URL: <http://bind10.isc.org/ticket/2676#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development