BIND 10 #2738: Clarify high-level design of the CC protocol

Mon Apr 8 08:53:32 UTC 2013

#2738: Clarify high-level design of the CC protocol
-------------------------------------+-------------------------------------
            Reporter:  vorner        |                        Owner:
                Type:  task          |  jinmei
            Priority:  medium        |                       Status:
           Component:  Inter-module  |  reviewing
  communication                      |                    Milestone:
            Keywords:                |  Sprint-20130423
           Sensitive:  0             |                   Resolution:
         Sub-Project:  DNS           |                 CVSS Scoring:
Estimated Difficulty:  5             |              Defect Severity:  N/A
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
Changes (by vorner):

 * owner:  vorner => jinmei

Comment:

 Hello

 Replying to [comment:11 jinmei]:
 > First, please be careful about the status of the branch: I
 > accidentally merged a different branch to trac2738, then reverted it,
 > and restored the original commits by cherry-pick (apparently I also
 > did something wrong in the revert).  I hope I've recovered the
 > original state, but please check if it doesn't break anything.

 I examined the branch, concluded the history doesn't look nice (and the
 branch included some non-reverted changes, I believe you managed to revert
 the other side of the merge than you wanted) and rebuilt it completely.
 So, update the branch by:

 {{{
 git fetch
 git reset --hard origin/trac2738
 }}}

 instead of `git pull`.

 > Secondly, after thinking over details I realized I had a higher level
 > concern and/or perhaps I really didn't understand what was expected in
 > this task.  The current documentation seems to mixture of some high
 > (e.g., concept of RPC - which seems to be a convenient wrapper on top
 > of the IPC system, not the essential part of the system itself) and
 > low level concepts (e.g., mention of EINTR), while still missing
 > defining basic notions (client, lname, group, etc).

 Actually, I believe it was me who didn't understand the goal properly;
 after all, the ticket was created because you asked for it in some review.

 On the other hand, I don't really understand the reason for this kind of
 document. Is it clarification for us, how we should use it (which was my
 original impression), or documentation for others? If it's the first, do
 we really need to formally define what a session is? Everybody of us know.
 If the second, how low-level do we want to go? Because anyone above the
 cc.session library will never get in contact with the getlname message,
 for example. And the people who go digging inside the library will need to
 read the code anyway (since there's not enough detail for it here) or at
 least the low-level CC protocol.

 Anyway, I'd like to discuss the goal and also the details in your version
 first, before trying to merge them, so I didn't update anything on the
 branch.

 Is the content meant to be an example only, or really what you envision?
 Because I don't think I can agree with all of the things there:
  * I don't think the session establishment is non-blocking.
  * Is it OK to consider the synchronous read non-blocking, even in the
 case of talking to msgq? I'm OK with calling it fast, but I don't think we
 can call it non-blocking.
  * What error can be reported by asynchronous read (I mean low-level
 error, not a payload signifying an error response)? Would there be error
 reading the message, it is impossible to decide which callback to pick, so
 it would be handled somewhere else than the callback.
  * Also, I believe currently the asynchronous read does not have any
 timeout. Being it asynchronous, it is possible to implement timeout on the
 client side, with synchronous it is not possible.
  * The concept of watch you describe seems different than what I though,
 on several levels:
    - You seem to combine the one-shot command to get list of sessions in a
 group with subscribing to the notifications for a given group. I don't
 think these should be combined together, you might want one without
 another (eg. you don't need to be subscribed to the notifications if you
 want to send a command to the whole group, you just need the current list
 of subscribers).
    - I envisioned the notifications would be done by subscribing to a
 group. By combining it together with answer, you make this impossible, so
 you actually increase the amount of special-case code both in msgq and
 libraries.
    - After thinking about how the command-to-group would work, a client
 would constantly subscribe and unsubscribe to the notifications of various
 groups. So I think it might be better to just subscribe to all these
 notifications (not to a specific group, but just to all the groups at
 once) and pass both the lname and the group in question with the
 notification.
  * As mentioned above, the message types seem very low level. On other
 hand, I think we should document the higher-level (JSON) somewhere too ‒
 format of command, reply, etc.
  * The unique ID is in all messages and it is returned from the
 sendGroupMsg at all times, not only on the ones that need response. It's
 just, the ID is ignored if response is not needed.
  * The ID is unique only per sender, so it can track responses. On the
 recipient (or whole system level) they are not unique. Also, the per-
 sender uniqueness is not actually mandated by the protocol, it would work
 even if they were reused, as long the sender no longer expects responses
 to the original message with the same ID.
  * How can the system know the difference between close and termination?
 It just gets EOF in both cases.
  * I don't think we should forbid sending a command (eg. message that
 wants a response) to a group. As I mentioned, there are „singleton groups“
 ‒ more like aliases than actual groups. Such examples are msgq itself,
 cfgmgr, or stats daemon. There's no sense in having multiple of them and
 it probably wouldn't even work properly. Requiring the whole round of
 getting list of single subscriber, collecting the responses, etc, seems
 suboptimal. Also, I expect there'd be two interfaces, one simpler (for
 single-recipient RPC) that just provides the response (rpcCall) or calls a
 callback with it (rpcCallAsync) and one for collecting all the responses
 (rpcCallMulti). The interface of the later must be more complex, because
 it needs to return multiple answers/errors at once. I don't think it makes
 sense to disallow the simpler interface in the (much more usual) case
 where there's at most one subscriber expected.

 > At the high level, I thought "daemon" is also too implementation
 > specific.  Even though we currently implement the "bus" as a separate
 > daemon process, I've believed we conceptually regard it as an abstract
 > messaging system at this level.  In ipc-high-2, I simply called it the
 > "system".

 I don't know if I can imagine any way without a routing daemon that would
 reasonably work, with all these assumptions. But I don't mind „system“
 either.

 > What's the point of keeping it open?  If we leave it open (not
 > prohibiting it), it would simply open up the possibility of arbitrary
 > understanding and use based on it, just like we currently have it in
 > the implementation.  Now that we are introducing the more reliable
 > membership (subscribers) management/notification framework, it seems
 > to me more helpful at the design level to just prohibit it (while
 > noting until we fully migrate to the membership notification we need
 > to keep using this model of group communication).

 I'm not keeping it open, I'm quite clearly saying „Don't do it“, with
 explanation why it is wrong. OK, it might be more British „Don't do it“
 than American.

 If you write „undefined behaviour“, you are not providing the explanation
 why it is wrong, and you also scare the person that reads it and finds
 occurrence of just this thing in our code (and there are such), because
 they can think „Hmm. Now anything can happen, including, for example,
 termination of msgq“.

 > > I don't really agree here it's only optimisation. There are modules
 that are
 > > not expected to take long to answer. For example the statistics daemon
 doesn't
 > > do anything but collect and answer statistics. But it doesn't have to
 be there.
 >
 > On working on another version of doc, I now actually feel it optional
 > more strongly.  Regarding the statics example above, it would be
 > implemented as a by-group communication, right (the cmdctl, on receipt
 > of the command from bindctl, sends a message to the "Stats" group)?
 > If so, design-wise cmdctl should get the subscriber of the group
 > first, because direct group communication with response is either
 > undefined at best and discouraged (in the current doc) or prohibited
 > (my suggestion).  But then the cmdctl doesn't have to rely on the
 > "undeliverable" result; it can simply avoid sending the hopeless
 > message if there's no subscriber.  There's still a subtle case where
 > the recipient dies during the message exchange, which could lead to a
 > longer timeout, but that also applies to the "undeliverable" case.

 Only if you really insists on forbidding the singleton groups (eg. groups
 where it's expected to be at most 1 client large).

 > - I still don't understand how these would be used:
 > {{{
 >   * Client connected (sent with the lname of the client)
 >   * Client disconnected (sent with the lname of the client)
 > }}}

 Well, I imagined these four kinds of notifications can happen (examples):
 * Notification: A new client with lname = '12345' connected to the system.
 * Notification: Client with lname = '12345' subscribed to group named
 'Group'.
 * Notification: Client with lname = '12345' unsubscribed from group named
 'Group'.
 * Notification: Client with lname = '12345' disconnected from the system.

 These would be sent to whoever would be subscribed to a group
 'SessionManagement' (or any other well-known name).

 > - On writing my own version, I realized the RPC call is rather
 >   considered an application and API sugar on top of the IPC system,
 >   rather than the part of the system itself.  that is, it's an
 >   encapsulation of send-and-receive operations, and the message data
 >   somehow mean executing something at the receiver side (and its
 >   result).  But the semantics of the data is basically a matter of
 >   between two users (clients).  The IPC system itself doesn't have to
 >   care about that level.

 Well, RPC is one part of IPC, from the application level. It's true the
 „system“ itself (msgq, the isc.cc.session library) doesn't have to care
 about it, but the higher level (from the isc.config.Session), the
 applications do care. It's part of what we do with the system and it makes
 sense to mention it. I think it is worth, at least, note that such
 functions exists so clients don't reinvent the wheel (preferably with
 description of how each type of communication works).

 And, I believe `lname` means „link name“.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2738#comment:13>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development