[bind10-dev] proposed system design and inter-process protocol for shared memory

Thu Mar 7 07:49:10 UTC 2013

Hello

Resending once more to the list, because I hit the reply button instead of reply
to list :-(. I'm going to forward Jinmei's answer too in the next email.

On Tue, Mar 05, 2013 at 07:25:52PM -0800, JINMEI Tatuya / 神明達哉 wrote:
> It's quite long, but since the tentatively-named "memory manager" and
> other processes need to be synchronized in a reliable way, I suspect
> some level of complexity is inevitable.  I still tried to minimize the
> initial set of description so we won't get lost in a too big design
> picture.

There's something I'd strongly disagree with. The design is a fine example of
how we try to babysit the messaging system and work around some missing parts,
instead of fixing it. This is not the way forward, since, as you admit, the
protocol is complicated and it is needlessly so.

So, first, our messaging system uses TCP. That is, deliver or die principle. You
can never lose a message in the middle of TCP stream. We also say we consider
death of the connection an unrecoverable error, since we could lose messages in
between. If a module loses connection to msgq, it crashes and it is for this
purpose. If msgq itself dies, then whole bind10 goes down. The only way a
message might not be delivered to a recipient is when the recipient dies (or
enters an infinite loop). The first needs to be solved (see below), the second
won't be solved by retransmits anyway. So, please don't require retransmits,
retransmits over TCP are wrong.

This is a counter-proposal:

Obviously, we need to know, in an reliable manner, when a module disconnects.
We need msgq to send notifications about that disconnection/unsubscription from
a group, because no other module can reliably know that. Once it does that, we
can have subscription notifications too for almost the same amount of work. We
can use that.

So, the thing could work more cleanly like this, employing notifications. There
are these types of messages:
 • „Map!“ notification, from manager to consumer (no direct answer to that).
   Says which data source and which version is to be mapped.
 • „Give me mapping.“ command from consumer to manager. The corresponding answer
   contains the information needed to map the file.
 • „I release the mapping.“ command from consumer to manager. Answer is sent
   just to satisfy the CC protocol and contains no useful information.
 • „This one subscribed to the consumer channel.“ notification from msgq to
   manager.
 • „This one unsubscribed from the consumer channel.“ notification from msgq to
   manager.

Now, I don't assume any specific order of starts of components and they can die
any time they like.

Startup:
 • When a consumer connects, it only subscribes to the consumer channel and
   waits (the data source not mapped yet).
 • When the manager starts, it prepares the current version of file for each
   data source (either by already finding it on disk, creating it, updating
   existing one, whatever). Once each file is ready, it broadcasts „Map!“ to
   tell all already connected consumers the data source is ready to be mapped.
 • When each consumer gets the „Map!“ notification, it sends the „Give me
   mapping.“ to manager and gets the information needed to map it. The manager
   remembers this consumer is active on the mapping and the mapping can't be
   modified yet.
 • When (another) consumer connects, it subscribes to the consumer channel and
   waits. The manager gets notification it subscribed and sends „Map!“ to it for
   each file it already has ready.

Update:
 • The manager prepares new file.
 • The manager broadcasts „Map!“ to the channel, saying there's new version.
 • Each consumer sends „Give me mapping.“ and maps the new file.
 • Each consumer then releases the old mapping by saying „I release the
   mapping.“ The manager removes the consumer from the list on the old file. If
   the list is empty, it can do as it wishes with the file.

Shutdown:
 • Consumer sends the „I release the mapping.“ If the manager is still running,
   it removes it from the list. If not, then the consumer gets an error, but
   that's OK.

Crash of consumer:
 • Manager gets the unsubscription notification from msgq and it removes the
   client from all lists, freeing files as appropriate.

Crash of manager:
 • The safest thing would probably be to unmap all files, I'm not sure if we
   want to do that. If so, each consumer would listen to notifications about
   manager unsubscribing.

Update of configuration:
 • We don't know if the consumer or the manager is ready with it first. So, we
   do both:
   ◦ When the manager is reconfigured and new file is ready, it broadcasts
     „Map!“. All consumers already waiting for it ask for mapping and get it,
     the ones not yet configured ignore it.
   ◦ When the consumer is reconfigured, it sends „Give me mapping.“. If the
     manager has the mapping ready already, it sends it. If not, it returns
     error and the consumer just waits for appropriate „Map!“ notification.
 • When removing, it just works ‒ client sends „I release the mapping.“

There's no need for timers, retransmits, it lets the manager and consumers
concentrate on their job and not communication. It also handles all the crashes,
starts and stops of components.

With regards

-- 
Fragile. Do not turn umop ap1sdn!

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130307/683c34be/attachment.bin>