[bind10-dev] new in-memory zone data design

Wed Jun 20 22:22:21 UTC 2012

At Wed, 20 Jun 2012 11:29:55 +0200,
Michal 'vorner' Vaner <michal.vaner at nic.cz> wrote:

> > I've described a proposed design of the memory-efficient version of
> > in-memory zone data in the following wiki page:
> > http://bind10.isc.org/wiki/InMemoryZoneDesign
> > (there's a link from http://bind10.isc.org/wiki/SystemComponents)

The biggest one first:

> However, my biggest concern is, this looks complicated like hell. I don't know
> if we have time for all this logic and if we won't get lost in it.

I agree it's not super simple, but I'm not sure specifically which
part you think was hell.  Possibilities are:

1. how to encode names in tree nodes
2. how to encode RDATA
3. how to maintain and use of shortcut

If we don't like #1, we could store all part of the name instead of
its portion (store "www.example.org." even if the corresponding node
is just for "www").  That will make it less memory efficient, but will
make some other processing simpler.

If we don't like #2, we might make it more straightforward like the
current RdataFields implementation does.  It will also be less memory
efficient, though.

Maybe the most complicated part is #3.  If we don't like it, we could
give up making shortcut additional processing and just encode all
RDATA names in the encoded RRset, and use the generic logic
implemented in auth::Query.  This will make query handling
significantly slower, but if we are okay with that, that's one
solution, and it's likely we can do it int shorter development time.

As for the necessary development time, my gut feeling is that it's not
that huge as it might look.  Our current implementation is actually
quite close to the revised version in terms of code logic, and I
myself already did some experiments using the data structure very
similar to this.  So, for example, I'm reasonably confident that I can
complete a prototype version of it in 1 week.  Of course, for a full
development with detailed tests and in the form of team tasks, it will
be much longer.  My gut feeling is that it's a 2-3-sprint feature, but
as usual it can be wrong.

Overall, it's a decision between performance and development time.
I personally think the benefit of the full feature is worth the
development time based on my gut-feeling estimation, but of course
different people have different opinions.

One possible way forward is for me to actually spend one week to see
whether I can make a workable prototype to show it's actually possible
or to realize it's actually too difficult.

Another is to try task breakdown and estimation for these first.

Also we can at least discuss the plan in the next biweekly call.

> I've had a look and I have few questions:
>  * It seems to be taken as a fact that it'll be a red-black tree. Is there a

Actually, not so.  Design-wise, the most of the part doesn't depend on
the fact that the underlying tree is red-black (some structures are
named "RB", but they don't depend on the fact that the tree is
red-black).  As long as it's a tree-of-tree, subtreeing to represent
subdomains, it should be basically applicable (if we change it to a
completely different structure such as a big hash table, that'll be a
totally different story).  It's just that our current implementation
was ported from BIND 9, using red-black tree, and I thought it's
faster for development to reuse the maintenance code of it (mainly
about rotation).

>    reason for that? Are we able to reuse our implementation? If not, we might
>    want to consider different kinds of trees (2-3 tree, 2-4 tree or AVL), since
>    they can have benefits we might like.

So, algorithm-wise, it doesn't have to be a red-black tree as long as
we can afford the development time.  But as I already commented in
trac ticket #2054, I suspect the constant-factor difference on the
lookup performance doesn't buy much in terms of the overall query
handling performance, i.e., qps.  If we want to make it really fast,
we need to improve additional section handling and name compression
(and minimize other overhead in rendering) substantially, and in a non
trivial way.  If we heavily rely on many tree lookups for additional
section handling, for example, the constant factor may matter.  But
the end result will be much slower anyway, compared to implementations
that has more fundamental optimization such as using shortcut
pointers.

I'm also not so sure if the update overhead is so marginal.  I've
heard an environment that relies on very high frequent dynamic
updates (I've never heard how frequent though).  Also, "updating"
includes the initial build, so it affects start-up time, too (unless
we use something like mmap rendered memory image), and, unlike the
query handling case, the tree update complexity would be much more
dominant part of building the entire zone.

>  * What about memory management? What we do with a space where we deleted
>    something? Just keep track of how much „wasted“ space we have and then
>    recreate the structure when too suboptimal?

I'm not sure if I understand the question.  The initial version will
just use the C++ new/delete.  So a deleted space won't be a waste
(modulo fragmentation problems with new/delete).  If and when we
successfully switch to a shared-memory version, we use a dedicated
allocator for the shared memory region (we can begin with the one that
comes with boost managed_shared_memory initially, and if it's not
efficient we can then think about an in-house version).  These
allocators will also recycle de-allocated memory region.  The
different from the normal new/delete is that we'll eventually hit the
originally allocated space for the shared memory segment.  At that
point we'll need to do a non trivial task - reallocating a completely
new space then copy the image, etc.  I've not fully thought about
these advanced cases, but is such a case your question?

>  * Again, when we delete something and something else pointed to it, what
>    happens?

With reference counting, we cannot delete anything as long as
something points to it.

>  * Is it possible to deduce the name from the node only, or do we need the
>    path?

It's not possible without the path in the initially proposed version.
Depending on a tradeoff between memory efficiency and processing
efficiency, there can be other choices.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.