[bind10-dev] Proposal for a BIND 10 shared cache resolver

Fri Feb 22 09:00:07 UTC 2013

Hello

On Fri, Feb 22, 2013 at 08:44:07AM +0100, Carsten Strotmann wrote:
> the attached idea on how a BIND 10 resolver with a shared cache could be
> implemented came to my mind today. Please excuse me if this has already
> been discussed somewhere, or if it is total bogus.
> 
> The idea stretches the DNS protocol quite a bit, but at least for me it
> has some charm.

I admit it looks elegant in a way. But I'm afraid this will have quite poor
performance. Why so? Well, putting the communication over IP/Pipe/whatever
aside, on each query to cache, you do a rendering of the query and parsing by
the cache. These are quite expensive operations and you possibly query several
caches for one query and do several queries for one query from user. If I
compare it with just computing hash of (name, type) and looking into a hash
table in case of direct access to the cache memory, the difference might be
quite huge.

Also, we may need a way to tell the difference between „The cache doesn't know
this record“ and „The cache knows the record doesn't exist“.

Though, I didn't measure the difference.

But, my idea was something slightly similar. There's one or more caches
available for each resolver, in a hierarchy (closest, but smallest to bigger but
further away). All the caches are mapped into the address space of the resolver,
but all but the closest one are read only.

Now, when we need to look into the cache, we start with the closest one and look
if it is there. If yes, we win and are done. If not, we try the next bigger
cache, and continue until we either find it or run out of caches. If we don't
find it, query upstream. Either way, we have the record (or know it doesn't
exist) and ask all the caches down from the one it was stored in to update
themselves (so the far-away caches are updated less often than the closest
ones).

As the L1 cache would be private to the resolver, it would be writable directly,
so updates would be simple. The shared ones would be notified by some kind of
RPC mechanism (shared memory or a pipe or whatever) and they'd update themselves
(as they'd probably be separate processes). The data structure would be
persistent, so the update wouldn't need locking (just updating the pointers so
the new top-pointer points to the newest version, but the old versions still
work). From time to time, each resolver would say which is the newest version it
knows and the cache could delete the old versions nobody cares for.

Some guesses now:
 • If the L1 cache is small (1MB), it fits into the processor cache (therefore
   is very fast), but it still gets something like 80% hit rate. I believe
   Jinmei claimed something like 80% of queries are in top 10 names, or
   something like that.
 • The further caches could be larger (let's say L2 shared by small number of
   resolvers up to 100MB each, machine-global L3 shared by all few GB). The
   top-level cache would probably be large enough to keep timing out faster than
   records being evicted because of size.
 • As the top-level cache would not be very busy, we could put some clever logic
   onto it and not slow down the rest of the system. It could be things like
   synchronization between machines in a cluster or some pre-fetching of records
   that time out soon.

   For that, I'd imagine some kind of protocol over TCP. I'm reluctant to use
   DNS messages, since it looks the design of them was aimed to have them as
   small as possible. We would like to avoid CPU overhead there. On the LAN
   between clusters, the network bandwidth is cheap (with the support of jumbo
   packets and DMA, even the handling in the kernel would be quite cheap). So
   I'd think we want to provide more information than what is in the DNS packet
   (pre-computed hashes, for example) and not do any compression. I'd really
   avoid UDP, since OS network stacks are optimised for TCP more than for UDP
   and we don't want to lose messages.

   If we sent all updates going to the L3 to other L3s on different machines,
   they'd nicely propagate and the smaller caches would receive the data from
   them once the query for them arrives.
 • The caches could do some time-outing and other clean-ups while idle (if there
   are no updates waiting, for example).
 • For NUMA, I don't think we want to care too much. My guess is that the kernel
   knows better than us and maybe even than the administrator how the memories
   are connected to CPUs. If we leave it do it's magic (and provide a design it
   can take advantage of), it can get pretty good, when it can migrate processes
   as it likes and migrate physical pages of memory (or even copy them to
   several NUMA nodes). Also, I don't think the cost for accessing a foreign-CPU
   memory is nearly that high as going to network communication, so even if we
   didn't care at all, it might happen the difference is negligible.

With regards

-- 
BOFH Excuse #452:
Somebody ran the operating system through a spelling checker.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130222/4a40a06a/attachment.bin>