[bind10-dev] Resend: Re: proposed system design and inter-process protocol for shared memory

Fri Mar 8 09:05:54 UTC 2013

Hello

On Thu, Mar 07, 2013 at 02:46:07PM -0800, JINMEI Tatuya / 神明達哉 wrote:
> > What strange errors you mean? Anyway, if part of the data may be
> 
> Such as this one: http://bind10.isc.org/ticket/1937 (to be very
> precise, though, it was not clear whether it happened on the unix
> domain socket or on the UDP or TCP DNS socket; the problem had somehow
> gone).

There's nothing suggesting that the problem described in 1937 had anything to do
with CC or unix-domain sockets or sockets at all. The server just kept crashing
from time to time.

I did look at the backtrace at the linked email, but that core is obviously from
the unclean shutdown at the end of logging, not from one of the crashes (looking
at the code, it should have logged CC_WRITE_ERROR and these were logged many
times at the end of the log pasted there ‒ it seems msgq terminated too soon, so
all the other processes that wanted to send anything got a broken pipe, which is
bad (msgq terminating too soon), but not related).

> > lost, we are doomed anyway, because it does _not_ preserve record
> > boundaries. So, we may have lost part of one message, header of
> > another, and we don't know how the rest is long.
> 
> That's the case in your model.  If we ensure synchronization outside
> of the msgq reliability by explicit timeout and resend, we can
> re-establish the state.  I'm not saying this to propose
> re-introduction of application-side retransmission, but I'd like to
> reserve the possibility until we can determine it's sufficiently
> exceptional in actual experiments.

Well, if a part of the stream is lost (and we don't know about that), we'll
eventually get an impression the next message is 5378924 bytes long, because
we'll read some random part of the stream and take it for the length. In that
case, we'll either explode because there's not enough memory in the system, or
block for ever, because so much data will never come. Missing the middle of
stream is not recoverable, as SEGFAULTs are not really recoverable (not that it
would be impossible to try with either).

If you're talking about recovering from lost connection, then AFAIK it was a
design decision in earlier times that we would not attempt so. That is the
reason why msgq is „core“ component and we shut down if we lose it.

I believe the decision was made because between the reconnect, there could be
any number of lost messages. Keeping the system consistent without race
conditions and deadlocks would be very hard if at all possible. Who could think
of all the possible things that could happen and handle them? Our main goal is
to write DNS server, not work around problems in IPC system.

Besides, what we have now was written with the assumption that losing the CC
session is fatal and that nothing would reconnect. I don't want to be the one
that reviews all the communication we have to ensure it works if we ever try to
remove that assumption.

> IMO, cases like this are one example where the difference between
> production-quality systems and student's hobby projects reside.  In
> extreme environments like "carrier-grade" operations, the reality
> always betrays our naive imagination of what could happen.  So,
> assumptions like "it should be really short and should be acceptable"
> aren't acceptable to me if we want to develop something professional.

For some reason, it reminds me of the history of Multics. I was said it was so
well designed it was impossible to finish in reasonable time. Then, the simple
and ugly hack of Unix (that was, officially, a system for editing text and the
original purpose can be seen there from time to time too) came and it was much
more successful.

I'm OK with fixing the problem of blocking send on the message bus if I ever see
it happen (or anybody) in case nobody sends SIGSTOP to msgq. Until then, I'll
more worry about unhandled short reads/writes which I still sometimes find in
the CC code ‒ yes, the quality of code there is that poor. That's not that I
would not like the code to be perfect, that's about priorities.

> > Hmm, right. But that might be only single timer in the manager (time
> > out and if the file is not yet freed, kill everyone still having it
> > active). But that should be long enough, what if XfrOut is just
> > sending the zone out and needs to finish that?
> 
> Maybe we need some keep alive mechanism to distinguish this case from
> real errors.  But the example of long-standing xfrout raises other
> issues.  It's possible that, while the manager cannot update the image
> due to a remaining reader, another update happens (e.g., as a result
> of incoming xfr or DDNS).  Delaying one such update is itself bad, and
> if more updates are coming the situation may become more catastrophic.

Do we want to support AXFR from the memory image anyway? Doing it from database
might be enough and we wouldn't need to worry about this problem. We need to
look into the database anyway for IXFR (since the differences are not in the
memory).

Delaying multiple updates would be, IMO, the same as delaying one. Once the
manager starts updating, it must always apply all available changes, because
there may be multiple updates from the last update we made anyway.

> > We may consider having three versions of the files, instead of two. One is being
> > updated, one is being switched to and one is being released.
> 
> I suspect we can't solve this issue by increasing versions as another
> update(s) may be required even when the third (or N-th for however
> large N) version still has a reader.

I don't want to increase the number to 3 because of XFR, but to increase
throughput for auth updates. There's some (nontrivial) time for the IPC to
happen and the auths to switch over to the new one, blocking two images at once
(one auth may be still on the old, another on the new one). At this time, the
manager is doing nothing.

> One other thing I forgot to mention in the previous message: I guess
> we can avoid having "release mapping".  If I understand it correctly
> it's only for graceful removal of a segment user/consumer.
> Gracefulness is good, but as long as we have a timely notification
> mechanism when the user disappears, we do not necessarily rely on it.

We can't. The graceful shutdown is a nice side effect, but the real reason is to
eliminate a race condition on switch on update.

The fact that auth asked for new version does not mean it stopped using the old
one yet. I don't think we want to unmap the old one first and then ask for new
one, because between the time we do this and the answer comes, we couldn't
handle queries. And if we keep the old one all the time until the answer comes,
there's a time when the manager sent the answer and the answer is handled in
auth, so the manager can't know when to really release that.

So, the release is actually an ACK that the new one is used and the old one
no more.

With regards

-- 
Hello, I'm an extension to the infamous signature virus, called spymail.
Could you please copy me into your signature and send back what you were
doing last night between 10pm and 3am?

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130308/d9ea653e/attachment-0001.bin>