[bind10-dev] Proposal for a BIND 10 shared cache resolver

Carsten Strotmann carsten at strotmann.de
Fri Feb 22 09:14:44 UTC 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Michal,

thanks for your feedback.

Michal 'vorner' Vaner wrote:
> Hello
> 
> On Fri, Feb 22, 2013 at 08:44:07AM +0100, Carsten Strotmann wrote:
>> the attached idea on how a BIND 10 resolver with a shared cache
>> could be implemented came to my mind today. Please excuse me if
>> this has already been discussed somewhere, or if it is total
>> bogus.
>> 
>> The idea stretches the DNS protocol quite a bit, but at least for
>> me it has some charm.
> 
> I admit it looks elegant in a way. But I'm afraid this will have
> quite poor performance.

Yes, it is quite a crazy idea.

> Why so? Well, putting the communication over IP/Pipe/whatever aside,
> on each query to cache, you do a rendering of the query and parsing
> by the cache. These are quite expensive operations and you possibly
> query several caches for one query and do several queries for one
> query from user. If I compare it with just computing hash of (name,
> type) and looking into a hash table in case of direct access to the
> cache memory, the difference might be quite huge.

It might be interesting to do a small test to know the difference once
the other (over TCP protocol) is done.

> Also, we may need a way to tell the difference between ?The cache
> doesn't know this record? and ?The cache knows the record doesn't
> exist?.

That was in my idea:

?The cache doesn't know this record? -> RCODE NOAUTH
?The cache knows the record doesn't exist? -> RCODE NXDOMAIN

> 
> Though, I didn't measure the difference.
> 
> But, my idea was something slightly similar. There's one or more
> caches available for each resolver, in a hierarchy (closest, but
> smallest to bigger but further away). All the caches are mapped into
> the address space of the resolver, but all but the closest one are
> read only.
> 
> Now, when we need to look into the cache, we start with the closest
> one and look if it is there. If yes, we win and are done. If not, we
> try the next bigger cache, and continue until we either find it or
> run out of caches. If we don't find it, query upstream. Either way,
> we have the record (or know it doesn't exist) and ask all the caches
> down from the one it was stored in to update themselves (so the
> far-away caches are updated less often than the closest ones).
> 
> As the L1 cache would be private to the resolver, it would be
> writable directly, so updates would be simple. The shared ones would
> be notified by some kind of RPC mechanism (shared memory or a pipe or
> whatever) and they'd update themselves (as they'd probably be
> separate processes). The data structure would be persistent, so the
> update wouldn't need locking (just updating the pointers so the new
> top-pointer points to the newest version, but the old versions still 
> work). From time to time, each resolver would say which is the newest
> version it knows and the cache could delete the old versions nobody
> cares for.
> 
> Some guesses now: ? If the L1 cache is small (1MB), it fits into the
> processor cache (therefore is very fast), but it still gets something
> like 80% hit rate. I believe Jinmei claimed something like 80% of
> queries are in top 10 names, or something like that. ? The further
> caches could be larger (let's say L2 shared by small number of 
> resolvers up to 100MB each, machine-global L3 shared by all few GB).
> The top-level cache would probably be large enough to keep timing out
> faster than records being evicted because of size. ? As the top-level
> cache would not be very busy, we could put some clever logic onto it
> and not slow down the rest of the system. It could be things like 
> synchronization between machines in a cluster or some pre-fetching of
> records that time out soon.
> 
> For that, I'd imagine some kind of protocol over TCP. I'm reluctant
> to use DNS messages, since it looks the design of them was aimed to
> have them as small as possible. We would like to avoid CPU overhead
> there. On the LAN between clusters, the network bandwidth is cheap
> (with the support of jumbo packets and DMA, even the handling in the
> kernel would be quite cheap). So I'd think we want to provide more
> information than what is in the DNS packet (pre-computed hashes, for
> example) and not do any compression. I'd really avoid UDP, since OS
> network stacks are optimised for TCP more than for UDP and we don't
> want to lose messages.
> 
> If we sent all updates going to the L3 to other L3s on different
> machines, they'd nicely propagate and the smaller caches would
> receive the data from them once the query for them arrives. ? The
> caches could do some time-outing and other clean-ups while idle (if
> there are no updates waiting, for example). ? For NUMA, I don't think
> we want to care too much. My guess is that the kernel knows better
> than us and maybe even than the administrator how the memories are
> connected to CPUs. If we leave it do it's magic (and provide a design
> it can take advantage of), it can get pretty good, when it can
> migrate processes as it likes and migrate physical pages of memory
> (or even copy them to several NUMA nodes). Also, I don't think the
> cost for accessing a foreign-CPU memory is nearly that high as going
> to network communication, so even if we didn't care at all, it might
> happen the difference is negligible.

sounds like a good design. I'm looking forward to the new resolver.

I need to buy a large machine with many CPUs for testing :)

Best regards

Carsten
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlEnNwQACgkQiDbv+TR5q6Ih/QCeKz+vtjz8D2hry9z+vV+1KhcB
6PsAoIMvo7ZM6NM0LLxJ+6JQvXZWwQrK
=hRZU
-----END PGP SIGNATURE-----


More information about the bind10-dev mailing list