[bind10-dev] Thinking aloud: Multi-CPU auth

Mon Dec 20 19:18:38 UTC 2010

Hello

On Mon, Dec 20, 2010 at 09:56:51AM -0800, Jerry Scharf wrote:
> My personal experience is that shared memory becomes almost mandatory 
> when trying to have many accessors against the same data. It would seem 
> that you might  want to go to a multiple reader, single updater model as 
> well. (Auth asking the datastore fronting cache or recursive asking the 
> primary cache.)

Well, if you fork(), all the memory is shared until one of the processes writes
there something. So it is what I meant for free ‒ if none of the working
processes write there (and they shouldn't), only read, they all have the same
mapping, which is equivalent with shared memory (you just got to the situation
in a different way).

If the parent does updates, it gets copies of the changed pages and then forks
again, everyone gets shared again.

> I also do not consider a 2, 4 or even 8 core system to be that large. To 
> satisfy the needs of an ISP for recursive or a registry for auth 
> services, we are talking more like 16-24 cores today, more in the 
> future. If the system has good scaling, this could extend even farther 
> over time (48 cores is the top right now in a simple server.) I would 
> personally recommend that there be a 24 core system (2x12 core opteron 
> or 3x8core xeon) be acquired for testing. Cache control and memory 
> bandwidth becomes a critical issues at these levels.

8 cores was just an example. Surely it can be more. But the point was, having 8
copies of the huge zone is too much. Having 48 copies is even more. So we don't
want that, or better said, can't have that. The point there was, we need to
solve it somehow, the advantage of fork is that we get it solved as a side
effect.

The testing, I'd probably wait, we don't even have a single-core performance at
any reasonable level. I believe when we hunt it down to minimise cache
misses/memory fetches, it will be of even greater effect on more cores, as the
bandwidth is partially shared between the cores.

And I think with oprofile (sorry for doing so much advertisement for that one, I
just like it) with instruction-based sampling (nice AMD trick), we can get
pretty good data on any system (I'm sitting at a quadcore one now). If it is
NUMA, even better, because you can get stuff like fetches from memory of
different northbridge (pity my computer is UMA, only one northbridge, but they
are lot cheaper), with exactly the instruction that caused it.

Maybe I should do some experiment sometime (I'm trying to write some
cache-oblivious algorithms for my thesis, so having some tests and showing the
data here won't hurt, to ilustrate what the effect is).

Anyway, this is little bit off-topic (to the thread, not to the list) If you
want to talk about that, we might start a different thread.

Have a nice day

-- 
Commenting perl code is useless. You have to fully parse it anyway to find comments.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20101220/1066bfa0/attachment.bin>