[bind10-dev] Performance of Various Receptionist Designs

Tue Jul 13 11:50:22 UTC 2010

All,

On Mon, 2010-07-12 at 11:49 +0100, Stephen Morris wrote:
> I've made the first set of measurements with both client and server
> processes running on the same machine.  The results can be found in
> the attachment to ticket 245 (http://bind10.isc.org/ticket/245);
> comments on both methodology and results are invited.

Interesting!

Thanks for the code, Stephen.

I ran the benchmarks on my laptop here (eventually remembering to turn
off CPU scaling), and the results were different from what you came up
with. I didn't do fully rigorous tests, but they seem to be more-or-less
consistent across multiple runs.

I used packets of 256 and a count of 65536.

Running the "server" version (answering directly) I get the following
breakdown by burst size:

  1 -> 4.18581e-05 (roughly 42 usec/query)
  8 -> 1.32895e-05 (roughly 13 usec/query)
 32 -> 1.34901e-05 (roughly 13 usec/query)
 64 -> 1.28801e-05 (roughly 13 usec/query)
128 -> 1.27999e-05 (roughly 13 usec/query)
256 -> 1.23392e-05 (roughly 12 usec/query)

Running the "receptionist" version (passing off to a worker) I get the
following breakdown by burst size:

  1 -> 6.53568e-05 (roughly 65 usec/query)
  8 -> 3.17896e-05 (roughly 32 usec/query)
 32 -> 3.66329e-05 (roughly 37 usec/query)
 64 -> 3.50105e-05 (roughly 35 usec/query)
128 -> 3.50873e-05 (roughly 35 usec/query)
256 -> 3.51004e-05 (roughly 35 usec/query)

Running the "intermediary" version (passing back-and-forth to a
contractor) I get the following breakdown by burst size:

  1 -> 7.35345e-05 (roughly 74 usec/query)
  8 -> 4.26593e-05 (roughly 43 usec/query)
 32 -> 4.0748e-05  (roughly 41 usec/query)
 64 -> 4.17246e-05 (roughly 42 usec/query)
128 -> 4.01508e-05 (roughly 40 usec/query)
256 -> 3.94077e-05 (roughly 39 usec/query)

So, what we end up with is something like:

server       @ about 13 usec/query
receptionist @ about 35 usec/query
intermediary @ about 41 usec/query

Now, this is on a dual-core machine. We might see different results on a
quad-core machine; depending on the exact way data is flowing between
processes it may be possible to keep 3 cores busy (client, receptionist,
and worker for example). In this case the receptionist may look more
like the server in terms of delay, although overall system CPU usage
will be up.

Now, in terms of absolute numbers this is the difference between 77k
operations per second and 29k operations per second on my laptop. This
is exactly the kind of performance difference that I was expecting, and
what made me nervous about the receptionist model.

A different implementation probably would have different absolute
numbers, but the Boost IPC thingy is a shared-memory based communication
library that is likely to be reasonably efficient.

Lets discuss further, but I am leaning away from the receptionist model.

--
Shane