[bind10-dev] Resend: Re: proposed system design and inter-process protocol for shared memory

Thu Mar 7 08:49:00 UTC 2013

Hello

On Thu, Mar 07, 2013 at 12:07:46AM -0800, JINMEI Tatuya / 神明達哉 wrote:
> > So, first, our messaging system uses TCP. That is, deliver or die principle. You
> 
> A minor correction first: it doesn't use TCP; it's IPC between UNIX
> domain sockets.  I'm not really sure if any standard (POSIX?)
> guarantees it's reliable, but at least it seems possible to assume
> that as a matter of de facto.  So the rest of the discussion holds
> whether it's TCP or UNIX domain IPC.

As a matter of fact, I still believe that is TCP. The socket is created with
something like:

  socket(AF_UNIX, SOCK_STREAM, 0)

That is, TCP over unix-domain socket instead of TCP on top of IPv4/IPv6, but TCP
anyway. Or, well, the OS may optimise some unnecessary parts out of it (like
ACKs, but it can do so with IP on 127.0.0.1 as well), but I guess this is clear
enough (from man socket):

   SOCK_STREAM     Provides sequenced, reliable, two-way, connection-based byte  streams.   An  out-of-
                   band data transmission mechanism may be supported.

and:

 The communications protocols which implement a SOCK_STREAM ensure that data is not  lost  or  dupli‐
 cated.   If  a  piece  of  data  for which the peer protocol has buffer space cannot be successfully
 transmitted within a reasonable length of time, then the connection is considered to be dead.   When
 SO_KEEPALIVE is enabled on the socket the protocol checks in a protocol-specific manner if the other
 end is still alive.  A SIGPIPE signal is raised if a process sends or receives on a  broken  stream;
 this causes naive processes, which do not handle the signal, to exit.

> > can never lose a message in the middle of TCP stream. We also say we consider
> > death of the connection an unrecoverable error, since we could lose messages in
> > between. If a module loses connection to msgq, it crashes and it is for this
> > purpose.
> 
> I'm afraid this might be too harsh.  We've been seeing some strange
> network errors on the UNIX domain sockets, so such cases do not seem
> to be super exceptional (and it's not clear whether they are
> recoverable or fatal).  In the C++ case additional layers of ASIO
> makes it more complicated.  But, at least for the initial
> implementation I can live with that assumption and seeing how it goes.
> We should include this requirement as part of the API contract (e.g.,
> require applications not to catch and handle some specific exception
> but to terminate).

What strange errors you mean? Anyway, if part of the data may be lost, we are
doomed anyway, because it does _not_ preserve record boundaries. So, we may have
lost part of one message, header of another, and we don't know how the rest is
long.

> > If msgq itself dies, then whole bind10 goes down. The only way a
> > message might not be delivered to a recipient is when the recipient dies (or
> > enters an infinite loop). The first needs to be solved (see below), the second
> > won't be solved by retransmits anyway. So, please don't require retransmits,
> > retransmits over TCP are wrong.
> 
> I knew we can simplify the protocol if we could clarify the
> reliability of the message system (depending on how we clarify it).
> The initially proposed one was based on the assumption that we won't
> have time for it with accompanying extensions.  So the question we
> should discuss is whether we include these tasks in this shared-memory
> feature.  It depends on the expected amount of work, but personally I
> think it's better to include them.  These clarifications and msgq
> extensions such as the ability of getting a list of live components
> will be necessary for other purposes, and will make help the overall
> system more robust.  But we should discuss this project-wide.

I agree we should look at the msgq first. Not only because it will be more
robust with it. I believe it might actually make the protocol simpler enough to
pay back the time we spend on improving msgq. And the receptionist (assuming we
will implement it) will benefit from it too, and possibly the resolver.

> With this approach, I suspect we also need to ensure the high level
> "send message" operation is non-blocking.  It's now critical since at
> the lower level it can block, and the application may not wait, and we
> now cannot simply let it fail and reconnect (because it now means
> suicide).  We should implement it with an intermediate buffer +
> internal asynchronous write, or by even introducing more generic
> asynchronous message exchanges and using it.

Well, I'm thinking this. We send only small messages. We have the intermediate
buffers in msgq to send and msgq is ready to read at all times. So unless we
really overload msgq so it couldn't keep up, it'll make sure the send buffers of
connected modules are empty. Even if the message is long, the send itself would
block only for a really short time (switch to msgq and back), which is IMO
acceptable.

What we do need to make asynchronous is waiting for answers. The rpcCall is the
blocking version, which might be fine for things like cfgmgr or stats. We need
an asynchronous version of that too, but:
 • There's an asynchronous recv in C++ already, so here it would be only a
   wrapper of similar length as the blocking rpcCall. The method would be
   slightly less convenient, than the blocking one, because you need a callback.
 • The python one needs the async recv to be implemented too. But auth is in C++
   and the manager could in theory block a short while without causing
   disruption. And, do we plan to make it in python or C++?

> And, I wonder whether we can totally eliminate timers.  For example,
> if a user/consumer process has a bug and does not correctly respond to
> an update mapping notification, the entire system will be effectively
> in a deadlock (in terms of shared memory management).  Even if we can
> detect it with a timer there may not be a way of graceful recovery,
> but I guess it's still better than a deadlock like situation.  That
> can be skipped in the initial implementation, though.

Hmm, right. But that might be only single timer in the manager (time out and if
the file is not yet freed, kill everyone still having it active). But that
should be long enough, what if XfrOut is just sending the zone out and needs to
finish that?

We may consider having three versions of the files, instead of two. One is being
updated, one is being switched to and one is being released.

> - what's "the consumer channel"?

Well, currently when the Auth starts, it subscribes to the group named „Auth“
(and some others, like „tsig_keys“). It would subscribe to some group
„ShmemConsumers“, as well as any other module that wants the shared memory
(XfrOut). So the group would be the channel (sorry for confusing the terminology
in an attempt to make it cleaner).

> - how can msgq/manager/consumers distinguish them?  In particular, how
>   can the manager distinguish individual consumers?

Each connection to msgq has a unique address, the l-name in its terminology (I
don't know where that comes from). We use it to answer answers directly to the
originator of command. It is in the from header of each message. Nothing stops
us store the l-name as the identification of a consumer and direct messages to
it (so you would specify the „to“ parameter of the message when doing
groupSendMsg instead of leaving it to default „*“).

> - how does msgq know it when a component (e.g. auth) crashes?  And
>   how does it tell another component (e.g., memmgr) that?

It knows, because the connection is closed. OK, to be exact, it doesn't know if
it crashed or if it shut down gracefully, but that doesn't matter, simply the
component is gone.

It removes the component from any groups it is subscribed to. Unsubscribing from
a group is the event we would notify about.

And it can tell others, since it already keeps a connection to itself, so it is
kind of a full citizen on the bus. It can send a notification that would look
exactly the same as any other notification.

How I imagine a notification would look like ‒ there would be a group for each
kind of notifications (like „Notifications/Subscriptions“ or
„Notifications/ZoneChanges“). Anyone interested in that kind of notifications
just subscribes to that group and gets messages sent there. The message itself
would look slightly differently, so it could be distinguished from a command.
So, it would be {"notification": ["name", {params}]} instead of
{"command": ["name", {params}]}.

But we already have most of the needed building blocks for notifications in
place. We just need a place to register callbacks to be called when a
notification comes.

With regards

-- 
"It can be done in C++" says nothing. "It can be done in C++ easily" says nobody.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130307/cacf21d7/attachment.bin>