[bind10-dev] Resend: Re: proposed system design and inter-process protocol for shared memory

Mon Mar 11 08:27:34 UTC 2013

Good morning

On Fri, Mar 08, 2013 at 11:17:24AM -0800, JINMEI Tatuya / 神明達哉 wrote:
> At Fri, 8 Mar 2013 10:05:54 +0100,
> > I did look at the backtrace at the linked email, but that core is obviously from
> > the unclean shutdown at the end of logging, not from one of the crashes (looking
> > at the code, it should have logged CC_WRITE_ERROR and these were logged many
> > times at the end of the log pasted there ‒ it seems msgq terminated too soon, so
> > all the other processes that wanted to send anything got a broken pipe, which is
> > bad (msgq terminating too soon), but not related).
> 
> I don't understand this part...some lower level exceptions are
> propagated to the top level of b10-auth and then it terminates the
> process, and if the exception isn't caught it can result in a core
> dump.  Is that what you mean by core?  Other than that I don't see how
> it's related to "unclean shutdown".

Yes, I mean the core dump.

My impression from it is that in the ticket, there's some exception and we never
figured out which one. That one probably happened during some handling of
queries or in ASIO, or, actually, anywhere.

Then there was the email linked from the ticket. I think it was a completely
unrelated problem to the one in the ticket. There, msgq terminated (or dropped
dead, I don't know). As a result, all auths crashed for getting EPIPE in
read/write on the unix domain socket. What I wanted to say is, it is msgq (us)
who screw it up, not the underlying unix domain socket.

> > If you're talking about recovering from lost connection, then AFAIK it was a
> > design decision in earlier times that we would not attempt so. That is the
> > reason why msgq is „core“ component and we shut down if we lose it.
> 
> I talked about recovering from a lost connection, or, more generally,
> errors in socket operations on the unix domain socket between msgq.
> Hmm, I don't remember we made such a design decision.  I remember the
> decision of allowing msgq to be a single point of failure.  But that
> doesn't mean when other module sees an error on the unix domain socket
> to msgq it must die.  At the very least, such a policy, if exist, does
> not really explicitly and comprehensively to the implementation.  For
> example, xfrin catches Python socket.error and keeps going with just
> logging it:
> 
>                 logger.error(XFRIN_MSGQ_SEND_ERROR, XFROUT_MODULE_NAME, ZONE_MANAGER_MODULE_NAME)

I don't remember making the discussion. I just remember it mentioned during some
talk on my F2F meeting some 2.5 years ago by someone, probably Shane. From that
time, I considered it general knowledge, and I believe much of the code is
written with such assumption.

I also remember when someone asked why we don't restart msgq. The answer was
that if we did so, we didn't know if we lost any messages in between and if the
system is inconsistent. If I think about it, if a single module loses the
connection, same problem could happen to it ‒ when it reconnects, it might have
happened several messages, including configuration updates, for example. That
would be bad and we would need to reinitialize it internally anyway.

With the xfrin example, I don't know how compelling the argument is. Few months
ago, I was removing a bit of strange code from it:

try:
  import pydnspp
except ImportError:
  pass # We should keep running even if we don't find the libraries

I guess the part with MSGQ_SEND_ERROR is similar thing.

> But I don't remember the decision for the communication in the CC bus
> (msgq), and, as the above example of xfrin shows, we really don't
> fully follow that decision if it exists.  I'm not necessarily opposed
> to making that decision now, with the reservation that we'll still see
> if the errors are rare enough, but in that case it should be more
> explicit and enforceable at the API level; instead of leaving it to
> each module whether to catch and handle the generic socket.error,
> aborting() the module in the CC library/module immediately after an
> error is detected, or throwing a separate exception that cannot be
> "handled" (can only be caught at the highest level of app to
> gracefully terminate it).

I'd be for clarifying it. Because we don't recover from the errors anyway (the
example in XfrIn, if that happens, it is obviously wrong, since the error is
more or less ignored ‒ logging it is not a solution). We definitely don't
reconnect there or anywhere.

I'd just say we don't need to abort on all errors ‒ EINTR is quite safe to
ignore, for example.

> > of all the possible things that could happen and handle them? Our main goal is
> > to write DNS server, not work around problems in IPC system.
> 
> I see some point here, but BIND 10 is not just a set of DNS servers.
> Even ignoring DHCP servers, my understanding is that BIND 10 is an
> infrastructure framework for core Internet services, where IPC takes a
> key role.  Ideally we could outsource the IPC sub-framework so we can
> concentrate on higher level things, but I thought we accepted we'd
> need to maintain that part ourselves, too, when we gave up adopting
> dbus.  And, (un)fortunately, in such a small group as us the same set
> of developers need to work both on higher level applications like DNS
> servers and on the system infrastructure such as IPC.

I know we need the IPC to work, and we would need that even if our only goal
would be the DNS, because the DNS wouldn't work without it.

What I mean is, if we abort the module on disconnect, it gets reinitialized and
everything is OK (not optimal, but working). If we go with the recovery, we are
likely to do it wrong, because it would be complex. That would also take a long
time. I don't think it is a well invested time.

Also, ideally, the IPC mechanism would be self-sufficient. If you're going to
say an application must learn to recover, then you put the burden onto each
application. The IPC should be handled in msgq, the libraries, but the
applications should be able to just use it and not care too much.

> > For some reason, it reminds me of the history of Multics. I was said it was so
> > well designed it was impossible to finish in reasonable time. Then, the simple
> > and ugly hack of Unix (that was, officially, a system for editing text and the
> > original purpose can be seen there from time to time too) came and it was much
> > more successful.
> 
> If it was not an attempt of applying cynicism to disregard someone's
> opinion, IMO it's irrelevant to this discussion, because such things
> are generally case by case basis.  Whether or not that's the reason
> why Multics wasn't deployed but Unix was, we have to decide what we
> should do by assessing the importance of our specific issue.  Removing
> the sarcasm, I guess what you tried to say is that this is a minor
> issue and the BIND 10 system can be of professional use even if we
> assume this problem won't happen in toughest environments.  I disagree
> (and I disagree because I've seen so many "what shouldn't happen" have
> actually happened through bug reports for BIND 9 in severe production
> environments), but I understand it's a YMMV type of issue.  It's
> something we should try to reach group consensus.

I admit there was a bit of cynicism, but not only to disregard opinions. I wanted
to illustrate more what I mean.

What I generally fear with bind 10 is, it tries to solve many issues by
introducing more complexity. I'm not really fan of that approach, because with
more complexity, you get more bugs. There's a saying that you can make the thing
so simple there's obviously no error. Or you can make it so complex that there's
no obvious error.

DNS server is complex by itself, and there will be bugs, in tough environment
and elsewhere. It is sad, we can try hard not to write buggy code, but AFAIK all
attempts to do so failed everywhere. We are just not capable of handling
everything, there's no manpower for that. What I want to say is, I believe this
to be rare enough so it is a low-priority issue. And, with bit of my cynicism (I
won't try calling it realism), I think it'll be lower priority than some other
issues we will have until it actually happens to someone.

> It's less likely we can see it ourselves, because this type of thing
> is something we can only know it happens when we feel embarrassed by
> getting a report from professions users that "b10-auth strangely drops
> some queries sometimes".  If we were not talking about introducing msgq
> improvements now, that could be discussed separately.  But we are
> actually discussing that particular topic, I'd argue it's the time to
> do it.

As I said, the IPC system is in such poor state this is IMHO much lower priority
issue than many others. I'd much rather clarify the semantics, made sure it is
easily usable so the applications don't make additional bugs just because the
IPC is so cubersome, etc. And there are tests to be written (there's even one
for stress-tests). It may happen these tests show that we really do need a
sending queue. But I'd rather see the tests done than blindly going to implement
the send queue.

> > Do we want to support AXFR from the memory image anyway? Doing it from database
> > might be enough and we wouldn't need to worry about this problem. We need to
> > look into the database anyway for IXFR (since the differences are not in the
> > memory).
> 
> True, so one other option is to prohibit the use of shared-memory data
> source image for such possibly time-consuming operations.  In any
> case, it should be okay to exclude this case from the initial
> implementation.

From implementation, yes. But I don't think we want to exclude it from the
design.

> > I don't want to increase the number to 3 because of XFR, but to increase
> > throughput for auth updates. There's some (nontrivial) time for the IPC to
> > happen and the auths to switch over to the new one, blocking two images at once
> > (one auth may be still on the old, another on the new one). At this time, the
> > manager is doing nothing.
> 
> Ah, okay, but I'd assume this migration is quick enough.  If we use
> three copies it means we could possibly need to have all the three
> copies in memory, even if it's only temporarily.  So I'm inclined to
> begin with the two-copy approach.

I think it could be a configuration option, be turned off by default, or
something. And yes, we can begin with two-copy one. It was just an idea.

With regards

-- 
Never underestimate the bandwidth of a station wagon full of HDDs.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130311/6dc5545f/attachment.bin>