[bind10-dev] Resend: Re: proposed system design and inter-process protocol for shared memory

Fri Mar 8 19:17:24 UTC 2013

At Fri, 8 Mar 2013 10:05:54 +0100,
Michal 'vorner' Vaner <michal.vaner at nic.cz> wrote:

> > > What strange errors you mean? Anyway, if part of the data may be
> > 
> > Such as this one: http://bind10.isc.org/ticket/1937 (to be very
> > precise, though, it was not clear whether it happened on the unix
> > domain socket or on the UDP or TCP DNS socket; the problem had somehow
> > gone).
> 
> There's nothing suggesting that the problem described in 1937 had anything to do
> with CC or unix-domain sockets or sockets at all. The server just kept crashing
> from time to time.

True, as I noted myself, but we've not really figured out how those
happened, so we cannot be sure if it wasn't really related to the UNIX
domain socket.

> I did look at the backtrace at the linked email, but that core is obviously from
> the unclean shutdown at the end of logging, not from one of the crashes (looking
> at the code, it should have logged CC_WRITE_ERROR and these were logged many
> times at the end of the log pasted there ‒ it seems msgq terminated too soon, so
> all the other processes that wanted to send anything got a broken pipe, which is
> bad (msgq terminating too soon), but not related).

I don't understand this part...some lower level exceptions are
propagated to the top level of b10-auth and then it terminates the
process, and if the exception isn't caught it can result in a core
dump.  Is that what you mean by core?  Other than that I don't see how
it's related to "unclean shutdown".

> > > lost, we are doomed anyway, because it does _not_ preserve record
> > > boundaries. So, we may have lost part of one message, header of
> > > another, and we don't know how the rest is long.
> > 
> > That's the case in your model.  If we ensure synchronization outside
> > of the msgq reliability by explicit timeout and resend, we can
> > re-establish the state.  I'm not saying this to propose
> > re-introduction of application-side retransmission, but I'd like to
> > reserve the possibility until we can determine it's sufficiently
> > exceptional in actual experiments.
[...]
> If you're talking about recovering from lost connection, then AFAIK it was a
> design decision in earlier times that we would not attempt so. That is the
> reason why msgq is „core“ component and we shut down if we lose it.

I talked about recovering from a lost connection, or, more generally,
errors in socket operations on the unix domain socket between msgq.
Hmm, I don't remember we made such a design decision.  I remember the
decision of allowing msgq to be a single point of failure.  But that
doesn't mean when other module sees an error on the unix domain socket
to msgq it must die.  At the very least, such a policy, if exist, does
not really explicitly and comprehensively to the implementation.  For
example, xfrin catches Python socket.error and keeps going with just
logging it:

                logger.error(XFRIN_MSGQ_SEND_ERROR, XFROUT_MODULE_NAME, ZONE_MANAGER_MODULE_NAME)

> I believe the decision was made because between the reconnect, there could be
> any number of lost messages. Keeping the system consistent without race
> conditions and deadlocks would be very hard if at all possible. Who could think

One thing I actually remember is what we did for distributing sockets
(from sockcreator via init/boss to end modules).  The information of
who uses which sockets must be consistent throughout the system, and
(although I don't remember it was the final decision or considered a
temporary measure until we decide) we explicitly abort the module:

    } catch (const exception& e) {
        // Any other kind of exception is fatal. It might mean we are in
        // inconsistent state with the b10-init/socket creator, so we abort
        // to make sure it doesn't last.
        LOG_FATAL(logger, SRVCOMM_EXCEPTION_ALLOC).arg(e.what());
        abort();

But I don't remember the decision for the communication in the CC bus
(msgq), and, as the above example of xfrin shows, we really don't
fully follow that decision if it exists.  I'm not necessarily opposed
to making that decision now, with the reservation that we'll still see
if the errors are rare enough, but in that case it should be more
explicit and enforceable at the API level; instead of leaving it to
each module whether to catch and handle the generic socket.error,
aborting() the module in the CC library/module immediately after an
error is detected, or throwing a separate exception that cannot be
"handled" (can only be caught at the highest level of app to
gracefully terminate it).

> of all the possible things that could happen and handle them? Our main goal is
> to write DNS server, not work around problems in IPC system.

I see some point here, but BIND 10 is not just a set of DNS servers.
Even ignoring DHCP servers, my understanding is that BIND 10 is an
infrastructure framework for core Internet services, where IPC takes a
key role.  Ideally we could outsource the IPC sub-framework so we can
concentrate on higher level things, but I thought we accepted we'd
need to maintain that part ourselves, too, when we gave up adopting
dbus.  And, (un)fortunately, in such a small group as us the same set
of developers need to work both on higher level applications like DNS
servers and on the system infrastructure such as IPC.

> > IMO, cases like this are one example where the difference between
> > production-quality systems and student's hobby projects reside.  In
> > extreme environments like "carrier-grade" operations, the reality
> > always betrays our naive imagination of what could happen.  So,
> > assumptions like "it should be really short and should be acceptable"
> > aren't acceptable to me if we want to develop something professional.
> 
> For some reason, it reminds me of the history of Multics. I was said it was so
> well designed it was impossible to finish in reasonable time. Then, the simple
> and ugly hack of Unix (that was, officially, a system for editing text and the
> original purpose can be seen there from time to time too) came and it was much
> more successful.

If it was not an attempt of applying cynicism to disregard someone's
opinion, IMO it's irrelevant to this discussion, because such things
are generally case by case basis.  Whether or not that's the reason
why Multics wasn't deployed but Unix was, we have to decide what we
should do by assessing the importance of our specific issue.  Removing
the sarcasm, I guess what you tried to say is that this is a minor
issue and the BIND 10 system can be of professional use even if we
assume this problem won't happen in toughest environments.  I disagree
(and I disagree because I've seen so many "what shouldn't happen" have
actually happened through bug reports for BIND 9 in severe production
environments), but I understand it's a YMMV type of issue.  It's
something we should try to reach group consensus.

> I'm OK with fixing the problem of blocking send on the message bus if I ever see
> it happen (or anybody) in case nobody sends SIGSTOP to msgq. Until then, I'll
> more worry about unhandled short reads/writes which I still sometimes find in
> the CC code ‒ yes, the quality of code there is that poor. That's not that I
> would not like the code to be perfect, that's about priorities.

It's less likely we can see it ourselves, because this type of thing
is something we can only know it happens when we feel embarrassed by
getting a report from professions users that "b10-auth strangely drops
some queries sometimes".  If we were not talking about introducing msgq
improvements now, that could be discussed separately.  But we are
actually discussing that particular topic, I'd argue it's the time to
do it.

> > > Hmm, right. But that might be only single timer in the manager (time
> > > out and if the file is not yet freed, kill everyone still having it
> > > active). But that should be long enough, what if XfrOut is just
> > > sending the zone out and needs to finish that?
> > 
> > Maybe we need some keep alive mechanism to distinguish this case from
> > real errors.  But the example of long-standing xfrout raises other
> > issues.  It's possible that, while the manager cannot update the image
> > due to a remaining reader, another update happens (e.g., as a result
> > of incoming xfr or DDNS).  Delaying one such update is itself bad, and
> > if more updates are coming the situation may become more catastrophic.
> 
> Do we want to support AXFR from the memory image anyway? Doing it from database
> might be enough and we wouldn't need to worry about this problem. We need to
> look into the database anyway for IXFR (since the differences are not in the
> memory).

True, so one other option is to prohibit the use of shared-memory data
source image for such possibly time-consuming operations.  In any
case, it should be okay to exclude this case from the initial
implementation.

> > I suspect we can't solve this issue by increasing versions as another
> > update(s) may be required even when the third (or N-th for however
> > large N) version still has a reader.
> 
> I don't want to increase the number to 3 because of XFR, but to increase
> throughput for auth updates. There's some (nontrivial) time for the IPC to
> happen and the auths to switch over to the new one, blocking two images at once
> (one auth may be still on the old, another on the new one). At this time, the
> manager is doing nothing.

Ah, okay, but I'd assume this migration is quick enough.  If we use
three copies it means we could possibly need to have all the three
copies in memory, even if it's only temporarily.  So I'm inclined to
begin with the two-copy approach.

> > One other thing I forgot to mention in the previous message: I guess
> > we can avoid having "release mapping".  If I understand it correctly
> > it's only for graceful removal of a segment user/consumer.
> > Gracefulness is good, but as long as we have a timely notification
> > mechanism when the user disappears, we do not necessarily rely on it.
[...]
> So, the release is actually an ACK that the new one is used and the old one
> no more.

Ah, okay.  I didn't understand it was an ack to mapping notification.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.