[bind10-dev] Persistence vs. shared memory

Tue Sep 6 18:51:49 UTC 2011

* Michal Vaner:

> Hello
>
> On Mon, Sep 05, 2011 at 07:41:39PM +0200, Florian Weimer wrote:
>> > It seems we already decided to go with the shared memory model, so
>> > it might be off-topic, but I asked the friend I mentioned, about
>> > persistence, just out of curiosity. He sent me some lecture notes
>> > mentioning work of Sleator and Tarjan (but it doesn't contain exact
>> > reference, unfortunately). After reading through them, I can think
>> > of three ways of implementing the persistence for the DNS tree, so
>> > I'm sharing it just in case someone might be interested.
>> 
>> I'm not sure what this is about exactly, but there's also RCU
>> (a variant of persistence).
>
> Sorry, I forgot to mention, we talked about how we will do updates in the
> in-memory data source for storing authoritative data.

Ah, okay.  So you probably need small and quick updates for dynamic
DNS, and atomically applied large updates for incoming AXFR or IXFR.

I would start with something like Ingo Molnar's brlocks: all readers
lock their own thread-specific lock, and the writer locks all of them
around writing.  Under their lock, readers look up two data structures
in order, the first one usually empty.  The writer updates the second
data structure directly for small updates (under the lock).  For large
updates which must be applied atomically, the writer prepares an
overlay which contains new data and prevents lookups to the changing
part in the second data structure, installs it into the first data
structure under the write lock, updates the second data structure
(without holding the lock), uninstalls the overlay under lock, and
deallocates the overlay.  The second locking operation can be avoided
by delaying deallocation until the next update.

Unless BIND 10 has to run on certain increasingly outdated large
multiprocessors, I'm confident that this will be a good start.  On
modern CPUs, taking a non-contented lock which is in cache and not
shared is cheap.  Apart from that, it is very portable because it does
not require multi-word CAS.  It's also fairly easy to implement.  On
the downside, update performance is not as good as it could be with
more effort, and there is a global pause (on the order of
microseconds).

> RCU in the strict sense copies everything, which is too expensive,
> considering the data might be several GB.

The RCU technique I was alluding to uses quite a few tricks to bring
down the cost of reaching a safepoint.  I suspect that many
persistence-based techniques mostly shift pressure to the memory
allocator.

My understanding is that lock-free algorithms are complex and a win if
you have both reads and writes, and that it's quite difficult to get
very good performance under those conditions, no matter what you do.