[bind10-dev] Resend: Re: proposed system design and inter-process protocol for shared memory

Thu Mar 7 22:46:07 UTC 2013

At Thu, 7 Mar 2013 09:49:00 +0100,
Michal 'vorner' Vaner <michal.vaner at nic.cz> wrote:

> > I'm afraid this might be too harsh.  We've been seeing some strange
> > network errors on the UNIX domain sockets, so such cases do not seem
> > to be super exceptional (and it's not clear whether they are
> > recoverable or fatal).  In the C++ case additional layers of ASIO
> > makes it more complicated.  But, at least for the initial
> > implementation I can live with that assumption and seeing how it goes.
> > We should include this requirement as part of the API contract (e.g.,
> > require applications not to catch and handle some specific exception
> > but to terminate).
> 
> What strange errors you mean? Anyway, if part of the data may be

Such as this one: http://bind10.isc.org/ticket/1937 (to be very
precise, though, it was not clear whether it happened on the unix
domain socket or on the UDP or TCP DNS socket; the problem had somehow
gone).

> lost, we are doomed anyway, because it does _not_ preserve record
> boundaries. So, we may have lost part of one message, header of
> another, and we don't know how the rest is long.

That's the case in your model.  If we ensure synchronization outside
of the msgq reliability by explicit timeout and resend, we can
re-establish the state.  I'm not saying this to propose
re-introduction of application-side retransmission, but I'd like to
reserve the possibility until we can determine it's sufficiently
exceptional in actual experiments.

> > With this approach, I suspect we also need to ensure the high level
> > "send message" operation is non-blocking. [...]
> 
> Well, I'm thinking this. We send only small messages. We have the
> intermediate buffers in msgq to send and msgq is ready to read at
> all times. So unless we really overload msgq so it couldn't keep up,
> it'll make sure the send buffers of connected modules are
> empty. Even if the message is long, the send itself would block only
> for a really short time (switch to msgq and back), which is IMO
> acceptable.

IMO, cases like this are one example where the difference between
production-quality systems and student's hobby projects reside.  In
extreme environments like "carrier-grade" operations, the reality
always betrays our naive imagination of what could happen.  So,
assumptions like "it should be really short and should be acceptable"
aren't acceptable to me if we want to develop something professional.

As I said in the previous message, I'm okay with skipping that part in
the initial implementation.  But I'm opposed to completely excluding
it in our estimation of total amount of task (and thereby pretending
the overhead is small).

>  • The python one needs the async recv to be implemented too. But auth is in C++
>    and the manager could in theory block a short while without causing
>    disruption. And, do we plan to make it in python or C++?

Is it about which language (C++ or Python) we use for the "memory
manager"?  I'm currently inclined to Python, because libdatasrc (in
C++) will be able to handle most of tricky (and possibly performance
sensitive) jobs and it shouldn't be a big task to extend the current
Python wrapper to support that of the library.

> > And, I wonder whether we can totally eliminate timers.  For example,
> > if a user/consumer process has a bug and does not correctly respond to
> > an update mapping notification, the entire system will be effectively
> > in a deadlock (in terms of shared memory management).  Even if we can
> > detect it with a timer there may not be a way of graceful recovery,
> > but I guess it's still better than a deadlock like situation.  That
> > can be skipped in the initial implementation, though.
> 
> Hmm, right. But that might be only single timer in the manager (time
> out and if the file is not yet freed, kill everyone still having it
> active). But that should be long enough, what if XfrOut is just
> sending the zone out and needs to finish that?

Maybe we need some keep alive mechanism to distinguish this case from
real errors.  But the example of long-standing xfrout raises other
issues.  It's possible that, while the manager cannot update the image
due to a remaining reader, another update happens (e.g., as a result
of incoming xfr or DDNS).  Delaying one such update is itself bad, and
if more updates are coming the situation may become more catastrophic.

I don't have a perfect solution to that situation.  Separating memory
segments (mapped files) for different (sets of) zones may help, but it
introduces additional complexity (and still not really a "perfect"
solution).

> We may consider having three versions of the files, instead of two. One is being
> updated, one is being switched to and one is being released.

I suspect we can't solve this issue by increasing versions as another
update(s) may be required even when the third (or N-th for however
large N) version still has a reader.

One other thing I forgot to mention in the previous message: I guess
we can avoid having "release mapping".  If I understand it correctly
it's only for graceful removal of a segment user/consumer.
Gracefulness is good, but as long as we have a timely notification
mechanism when the user disappears, we do not necessarily rely on it.

And, finally, not really a critical point:

> > A minor correction first: it doesn't use TCP; it's IPC between UNIX
> > domain sockets.
> 
> As a matter of fact, I still believe that is TCP. The socket is created with
> something like:

Could I suggest not playing with words, especially for well
standardized terminologies?  When people are told TCP, they interpret
it to mean the specific protocol defined in RFC793, with all the other
features than reliability, such as flow control and urgent
(out-of-band) data support.  Maybe it's not the case for you; the word
"TCP" may simply mean "(stream based) reliable communication model".
But I bet that's vast *minority* if it's not only you; Francis's
responses showed at least there's one other person in this mailing
list who doesn't use this word as you do.  For those people, it's
simply confusing to be told "TCP" when we only discuss reliability.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.