BIND 10 #2676: Provide ccsession::rpc_call and replace usual calls with it

Tue Feb 19 10:04:41 UTC 2013

#2676: Provide ccsession::rpc_call and replace usual calls with it
-------------------------------------+-------------------------------------
            Reporter:  vorner        |                        Owner:
                Type:  enhancement   |  jinmei
            Priority:  medium        |                       Status:
           Component:  Inter-module  |  reviewing
  communication                      |                    Milestone:
            Keywords:                |  Sprint-20130219
           Sensitive:  0             |                   Resolution:
         Sub-Project:  Core          |                 CVSS Scoring:
Estimated Difficulty:  2.5           |              Defect Severity:  N/A
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
Changes (by vorner):

 * owner:  vorner => jinmei

Comment:

 Hello

 As always, what is not commented has been handled.

 Replying to [comment:7 jinmei]:
 > I didn't yet understand why we needed two sessions in these cases, but
 > if they are about xfrin and zonemgr, I guess it's just another
 > incident that xfrs and zonemgr only have experimental quality,
 > especially about the use of threads.  I'm not sure when we can make
 > them production ready, but I'd generally defer tricky cases related to
 > the buggy use of threads in these modules to fundamental redesign
 > work.

 It seems the ModuleCCSession is busy locked in one thread in the loop
 checking
 commands. It doesn't cope with another request to wait for messages there.
 We
 may want to refactor the libraries in around there a lot, because we had
 some
 similar problems in C++ library as well (not with threads though, during
 the
 startup).

 Anyway, thinking about this, it might be solved by some kind of
 asynchronous
 wait on the session. Or, just by creating a ModuleCCSession instead of
 CCSession as the second one. That could be easy. Should I try if it helps?

 > - a lettuce test (xfrin_notify_handling.feature) often, but not
 >   always, fails.  I'm not sure if it's related to this branch.

 I don't know. When I was running lettuce tests, they failed several times
 for
 me, in various unrelated parts. When I tried to investigate, I did find
 the log
 messages it timed out on in the logs. So I guess it is some strange kind
 of
 race condition in the lettuce framework, possibly happening in case many
 messages are output at once (maybe a full 4kB read happens and splits the
 message in the middle, or something). It might be related to this branch
 in the
 sense that the timing of sends and logs is slightly different, but I don't
 see
 a way how it could directly cause a new race there.

 I'd propose not trying to hunt down the lettuce timing here, but create a
 ticket for it.

 > - totally a nit, but did you intentionally use "ourself" instead of
 >   "ourselves"?

 No, I just don't speak English that much so I'd notice it's wrong.

 > - some log messages seem to have to be updated.  For example, the
 >   description of DDNS_START_FORWARDER_ERROR is not very accurate any
 >   more because it includes the case that auth doesn't exist (or at
 >   least isn't listening to the commands).  I suggest you go through
 >   all relevant log messages (including detailed description) and
 >   revise them appropriately.

 Actually, I changed the behaviour here back to the original. But I did
 change
 one message, removing the several lines about how bad the timeouts are,
 since
 they should no longer happen.

 > - `__notify_start_forwarder`: also not related to this branch, but
 >   this case is one example where just getting a response from one
 >   instance (of possibly multiple ones) of the component isn't
 >   sufficient.  We'll have to clarify this in the protocol doc (and
 >   refinement) task and revise the implementation so it'll work
 >   robustly.

 That is true, but just logging here seems like not the way to go either. I
 believe it should just crash the whole DDNS in such case. Anyway, it is
 not
 related to this ticket.

 >   see the detailed description of DDNS_UPDATE_NOTIFY_FAIL.  And, in
 >   fact, IMO if we want to make production quality system, not a toy of
 >   a student project, we shouldn't just "notify and forget" in many
 >   more cases.  IMO our implementation is rather too loose in this
 >   sense.  In this particular case, the discussion is quite subtle
 >   though...first, if the error is `RPCRecipientMissing`, it can be
 >   less serious: while the fact that auth/xfrout has likely terminated
 >   may indicate something wrong to auth/xfrout themselves, the entire
 >   system can recover from that situation as they were restarted by the
 >   init process.  So we could handle this specific case separately,
 >   with a lower level of log.  But then we face the fundamental
 >   question of the inter-module communication model.  This is another
 >   case where if not all recipients receives the message the system can
 >   be in an inconsistent state.

 Actually, the current behaviour with waiting for one (random) answer is
 wrong too, because:
  * They may be no recipients on purpose. We can have no XfrOut running
 (because
    the server is not primary). We can have a hidden master updated by
 DDNS, not
    running Auth. We can have neither, for example user might have added
 their
    own tool to write blog articles automatically about the changes being
 made
    and the server just gets copies of the updates sent to the main server.
  * What if the first of Auths succeeds and the others fail?

 What I think should be done is:
  * Make sure the sender (DDNS in this case) does not need to know what the
    recipients will be. In that case, we can run no Auth without any
 problem and
    the customer also can run a custom module that wants to know about the
    updates the developers are not aware of.
  * Leave the responsibility of guaranteed delivery to all on the msgq or
    libraries. Modules should not babysit the communication system and
 check if
    it was delivered.
  * If something fails, we currently only log. Let the logging be done in
 the
    recipient state.
  * Maybe introduce advanced response tracking and collate it into one
 answer of
    form:
    - All OK
    - or These modules failed

 However, you worry about not all of them receiving. I think we can
 consider if
 it is not connected to msgq, it doesn't exist. But we can worry about if
 the
 receiving module fails to do whatever is needed.

 > - do_polling: doesn't it change the behavior?  Previously if the
 >   recipient doesn't exist it should have resulted in a timeout and an
 >   exception.  It's now reported as `RPCError` and ignored.
 >   Furthermore, there now seems to be a possibility that `modules` can
 >   be used without set (even if detailed conditions prevent that from
 >   happening, it's not obvious from the code, and it means the quality
 >   of this code is low - although it's not a fault of your change).

 I changed it to mimic the original. But I don't understand what you mean
 by the
 „without set“ part.

 > - update_modules: likewise, the behavior seems to be slightly changed.

 It is changed only slightly. If the module doesn't exist, it raises
 StatsError
 instead of SessionTimeout. But I don't think either is handled on the
 above
 levels. I can return the original, but I don't know it is important enough
 to
 complicate the code because of that.

 > - get_stats_data: raising `StatsHttpdDataError` in case of
 >   `RPCRecipientMissing` doesn't seem to be correct (at least it
 >   doesn't preserve the previous behavior).

 Ah, I misread and thought it's both `StatsHttpdError` and one is a
 different
 one. Fixed.

 > - about the FIXME comment: as stated above, IMO it's another
 >   incident that xfr's are a student's hobby project, not a
 >   production-quality system, that it simply ignores the response or
 >   timeout.  (But this specific case may not have to be an "RPC" anyway
 >   - all xfrin should know and handle if the message was surely
 >   delivered to zonemgr).

 Do you propose to fix it in this branch? I don't think it is in scope. The
 errors were ignored before.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2676#comment:9>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development