[bind10-dev] proposed design of scalable in-memory zone loading/updating

Fri Jun 22 18:13:19 UTC 2012

At Fri, 22 Jun 2012 10:25:47 +0200,
Michal 'vorner' Vaner <michal.vaner at nic.cz> wrote:

> > The main purpose is to allow asynchronous (background) loading and
> > releasing, but the proposal introduces more generalized framework so
> > it can also be used for shared-memory type architectures (like the
> > case of the other proposal, at the risk of offering something
> > over-engineered).

Thanks for the feedback.  All good points.

> I have few comments there. I see that we'll need something like this in the end,
> and I prefer doing it the final way right away than having 3 refactors on the
> way, so I think it is better.
> 
> However:
> • The „MemoryEventManager“ seems to suggest there's a main-loopish thing hidden
>   inside and that it'll dispatch events somehow.

First, it's "MemoryEventHandler", not manager.  And the event loop is
supposed to be outside of the handler.  So, I'm not really sure about
your concern, but if you mean the naming is confusing (even after
clarifying it's not named "manager"), I'm open to suggestions.  I'm
not very good at naming things.

> • I'm little bit worried about the auth pushing the updates to the in-memory
>   clients. When we switch to the new data sources, the in-memory clients would
>   be effectively hidden from auth and there'd be multiple ones (as caches for
>   multiple data sources). I think we'll need an other way around ‒ each
>   in-memory contacts the manager and asks for data for the zones which should be
>   inside (and, possibly, updates).

Yeah I realized that and considered several other options for the
cleanest (or least ugly) design.  The currently resulting one is a
kind of compromise.

- I wanted to keep the client working as synchronous as possible.  If
  the in-memory client directly relies on the memory manager (via the
  event handler), whether it's within auth or an external process, the
  client needs to manage blocking events like loading large zones
  directly or communicating with an external memory manager.
- auth cannot completely be free from housekeeping anyway, being aware
  of both (list of) clients and memory events, managing periodic event
  invocation on the local event handler or managing the subthread for
  it (if we use thread) or handling incoming commands from external
  manager process.  Because it's auth who has the issue of being able
  to keep responding to queries, manages the internal event loop
  (IOService), and manages config/command sessions.

(BTW, as noted in the wiki page, while auth may seem to directly deal
with in-memory clients, it's just for simplicity.  It actually handles
a higher level abstraction of "client list" (without knowing what
exactly are there), and this design should work with it)

So, the basic idea of the current design is:

- auth knows it needs a ClientList and (possibly) MemoryEventHandler,
  and knows the former (possibly) needs segment information the latter
  provides.  But auth doesn't care about further internal details of
  them - both will be created based on datasrc config by some factory.
  so auth is basically just a transparent message forwarder between them.
- MemoryEventHandler doesn't know who actually uses the memory
  segments it generates, and, especially in the case of shmem or mmap
  variants, it's mostly stateless - it just opens a corresponding
  memory segment based on given parameters associated with events and
  returns it to the caller.
- As described above, InMemoryClient is designed to be lightweight and
  free from any blocking operations.  So the caller (in practice which
  is auth) doesn't have to worry about a call to a client method
  blocks and suspend query processing.

Regarding the fact that InMemoryClient is hidden inside ClientList,
what I expected (not described in wiki) is:

- we give each data source (effectively each entry of ClientList) a
  name: "sqlite3" (or "sqlite3-A/B/C" if there are many sources based
  on sqlite3), "text", etc.
- MemoryEventHandler actually gives a tuple of (data source name,
  table segment, list of zone segments) to auth (or a list of these
  tuples).
- auth passes the tuple to ClientList.  ClientList identifies the
  corresponding list item from the "name' item of the tuple, and
  passes the table and zone segments to the associated InMemoryClient
  (if any).

Here, I assume ClientList knows the "cache" is an instance of
InMemoryClient.  So this part is not polymorphic.  But I suspect since
the concept of cache is quite special it makes more sense for
ClientList to deal with InMemoryClient explicitly.

But in any event, I see this part of the design will require more
consideration.  There may be a better approach or something I simply
missed.  Specific suggestions are welcome.

> • Would it be better to just return the continuation object after some amount of
>   work, no matter if it is IXFR or AXFR (because there can be small AXFR and
>   large IXFR and we will need to keep track of how much was done in the chunk
>   anyway)?

Hmm, in the case of IXFR (or DDNS, or in general "update", not
"replacement"), I thought the local handler makes the update directly
to the "in-use" data (on the other hand the shmem-based memory manager
would keep two copies and always make modifications to the unused
one).  So this update process cannot be done incrementally; otherwise
queries in the middle of the update could be handled using partially
updated version of zone.

I knew there can be a case with a large set of "incremental" updates
that could suspend query handling too long.  In the first design
proposal, my assumption is if we need this we should (develop and) use
the shared-memory based memory management.  But if we need a
middleground solution, maybe one possibility is to let the handler
know the size of the update (e.g. in terms of # of RRs in the form of
IXFR diff) and allow the handler to switch to full reload (making a
separate copy and doing it incrementally) if it's deemed to be too
big.

Aside from these difficulties, however, always doing it asynchronously
may make sense; then the caller side of code can be unified.

> • Would releasing really be time consuming? Wouldn't just dropping the memory
>   segment be enough? That would be fast.

I'm afraid it can be time consuming at a non-negligible level if it's
built locally as we'll need to go through the entire tree and the node
data, releasing corresponding memory chunk one by one.  Applying an
analogy with experience with BIND 9, that could be 10sec-ish task.

...or, perhaps you are assuming we use process-local, exclusive memory
segments preallocated by mmap or something?

> But I'm most worried about the continuation objects (which would be,
> effectively, coroutines). That could get really tricky (because we'll need to
> remember state somehow, etc). In short, I consider working with continuation
> objects a really interesting part of programming, but it might be better to
> avoid it. So I'm wondering two things:
> • Would it make more sense to launch a thread? Because the thread would do no
>   communication with the rest of the app, so it would be safe, we wouldn't need
>   much locking (only to check if it ended already). Then we could do the loading
>   in a more straight-forward sequential logic.

Hmm, possibly.  And, considering that the memory handling part of the
local handler is actually emulating an external "manager", it may be a
better instantiation.

One obvious concern is the "update" case (see the IXFR-case discussion
above).  If the updating thread modifies the zone data used by the
main auth thread (responding to queries) the contention will cause a
disaster.  We should (probably) also think about the case where
multiple replacement requests happen at the same time (may or may not
be an issue).

At this moment I'm not fully confident about which one is better in
terms of development and maintenance cost.  But that's certainly worth
considering.

> • Would it be more work to do the external manager process and shared memories
>   than the continuations? If not, we might want to support only the external
>   manager.

Actually, after completing these initial designs, I'm feeling we may
not be so far from the shmem-based architecture.  We'll still need to
build a completely separate program from the scratch, implement
communication between the related components, and handle other tricky
issues with shared memories (strategy about which zone should go to
which strategy, what to do when a segment is becoming full, etc), and
I'm afraid these shouldn't be underestimated, but I think it's also
worth considering.

That said, I guess we probably also want to keep the "local only"
memory management anyway.  While the issue of asynchronous events or
thread synchronization may cause their own stability issues, it will
be generally stabler than the approach involving shared memories and
another external process, and operators for moderate size of scale may
rather prefer that at the cost of using a bit more memory.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.