[bind10-dev] Persistence vs. shared memory

Michal 'vorner' Vaner michal.vaner at nic.cz
Sun Sep 4 18:33:31 UTC 2011


Hello

It seems we already decided to go with the shared memory model, so it might be
off-topic, but I asked the friend I mentioned, about persistence, just out of
curiosity. He sent me some lecture notes mentioning work of Sleator and Tarjan
(but it doesn't contain exact reference, unfortunately). After reading through
them, I can think of three ways of implementing the persistence for the DNS
tree, so I'm sharing it just in case someone might be interested.

• Shared memory: We talked about it a lot, my friend mentioned that the updates
  may be slow due to the fact that the copy-on-write works on whole pages (4k,
  64k, 1M, depending on the mood of kernel at the time).

• Fat nodes: Have two copies of each pointer in each node. Both pointers in each
  pair point to the same thing at the beginning. Lookup is told to use either 0th
  or 1st pointer set.

  Usually, lookup is using the 0th. When an update (or batch of updates)
  arrives, the changes are done on the 1st set. After the update is done,
  lookups switch to the 1st set (which has the new data wherever an update
  happens and is the same as 0th on the rest) and the update thread updates the
  0th copy to have the same data as 1st, deleting all unneeded nodes and such.
  Then the lookups are switched to 0th again and another update may come.

  Disadvantage is that it takes a lot more memory. On the other hand, it looks
  quite simple to implement and shouldn't have any slowdown on the lookup side.

• Propagated copy: When updating a node somewhere in the tree, all nodes on the
  path to the root are copied, and point to the new data (but they keep pointing
  to the original data when leaving the path). To avoid copying the root n times
  in update batch of n items (and copying bunch of other nodes many times), we
  can add a „generation“ field into the node. If the generation is lower, we
  copy it and set current generation, if it is the same, we just update it,
  because we just created it in this batch.

  After the update is done, the root is replaced by the new one and after all
  lookup threads switch, we can traverse the original and new tree in tandem to
  discard the unused nodes (we don't need to enter the areas where the nodes are
  the same ‒ then their whole subtrees must be the same).

  If the generation counter overflow, the whole tree is traversed and
  generations reset. But with 32bit generation counter, it would happen quite
  seldom.

  In this way, an update still costs O(m·log n) (where n is the number of nodes
  in tree and m the size of update batch), the memory usage is smaller than the
  complete set of pointers and there should be no slowdown on the lookup (at
  last in the RAM model).

• Hybrid tag-change approach: Each node has its payload and pointers just as
  usual, but has one field for „change“. The change field can contain
  modification of any one of the pointers (so it has a place for the pointer and
  a place to say which one, if any, is the changed one).

  So, when the lookup operates on the original, unchanged data, it ignores the
  change field. If it operates on new data and it wants to traverse a pointer,
  it looks first if the change field holds a change to this pointer, if so, uses
  the value there. If not, it just uses the usual pointer.

  During an update, if we want to change a pointer and the change field is still
  empty, we put the new value into it. If it is already used, we create a new
  copy of the node, with the changes from the change field and us merged
  into it (therefore leaving the change field empty in the new copy). Then we
  propagate the new address of this node to the parent, again looking if it has
  the change field full or empty.

  After the update is done, we can switch all the lookup threads to the new
  versions and do a cleanup (remove unused versions of nodes, merge the changes
  back to the pointers). Than the lookup threads can start ignoring the change
  fields again (for performance reasons, we might have two versions of the
  lookup function, one with the change checking and one without ‒ actually,
  compiler should be clever enough to generate them for us and switch at entry
  point).

  The advantage here is that we do O(1) write operations per updated datum
  (amortized, it can be proven by some potential function), which creates less
  strain on cache-consistence routines in CPU and does less allocations, than
  the previous.

A side note, these versions here are simplified ‒ the original work expected
arbitrary long history of updates being kept and a general graph/pointer
structure. We don't need the full machinery.

So just if anybody finds it interesting.

With regards

-- 
I have a theory that it's impossible to prove anything, but I can't prove it.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20110904/58af9c67/attachment.bin>


More information about the bind10-dev mailing list