[bind10-dev] Receptionist experiment

Thu Feb 28 14:06:17 UTC 2013

Hello

I did a little experiment with receptionist approach for #2776. I slightly
modified auth and put another process in front of it.

First, little warning. The code is not tested, does not adhere to (probably any)
coding guidelines, and is probably buggy (I've seen it segfault for example
sometimes). Also, there are several limitations, like it is able to send the
answers back only to a single client (it keeps only one address; I imagine we
would send the client address with the message to auth and then auth would send
it back, so the overhead introduced by this would be minimal, I just didn't want
to tweak auth so much). It uses C and low-level network API, since it is easier
to write as a prototype, then fighting with ASIO.

Also, it probably works only on a new-enough linux systems, since I used the
sendmmsg, recvmmsg and epoll calls.

Now, how it works. It reads messages from client on UDP. For each message, it
chooses one auth server to send to and queues it in local queue. It uses
random() to pick now, but deciding by the RD bit should be lightweight. Once the
local queue is large enough, it is sent over TCP to the auth as a single batch
(so we minimise the number of context switches; that seems to be important).
Currently, there's no timeout, so if no more messages come, these are stuck in
the queue for ever, but that's just to keep the experimental code simpler.

The auth server reads the whole batch, produces the answers and sends them all
at once back. The receptionist then sends the whole lot with sendmmsg.

The receptionist is mostly a loop around epoll, reading, copying data and
writing them back to a socket.

The Auth was modified slightly. For one, the TCP now reads a single 4-byte
length first, which is a length of the whole batch. Then it reads the data and
uses the usual TCP lengths of each message to extract separate messages. I also
disabled the synchronous optimisation on UDP, since the TCP does not have this
optimisation. It would be possible to add it, but disabling it in UDP seemed
easier to make the comparison more fair.

Now, the results, done with real zone (vorner.cz) and queryperf, with total 4
auths running on 4-core machine.

With receptionist:

$ ./queryperf -d ~/input -p 5310 -q 100 -l 20 -t 1

DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007-09-05 07:36:04 marka Exp $

[Status] Processing input data
[Status] Sending queries (beginning with 127.0.0.1)
[Timeout] Query timed out: msg id 27218
[Timeout] Query timed out: msg id 27219
[Timeout] Query timed out: msg id 27222
[Timeout] Query timed out: msg id 27226
[Timeout] Query timed out: msg id 27227
[Timeout] Query timed out: msg id 27231
[Timeout] Query timed out: msg id 27233
[Timeout] Query timed out: msg id 27234
[Timeout] Query timed out: msg id 27235
[Timeout] Query timed out: msg id 27238
[Status] Testing complete

Statistics:

  Parse input file:     multiple times
  Run time limit:       20 seconds
  Ran through file:     795511 times

  Queries sent:         2386534 queries
  Queries completed:    2386524 queries
  Queries lost:         10 queries
  Queries delayed(?):   0 queries

  RTT max:         	0.040414 sec
  RTT min:              0.000159 sec
  RTT average:          0.000788 sec
  RTT std deviation:    0.000435 sec
  RTT out of range:     0 queries

  Percentage completed: 100.00%
  Percentage lost:        0.00%

  Started at:           Thu Feb 28 14:23:46 2013
  Finished at:          Thu Feb 28 14:24:07 2013
  Ran for:              21.000004 seconds

  Queries per second:   113643.978354 qps

Directly to auth (no receptionist):

$ ./queryperf -d ~/input -p 5311 -q 70 -l 20 -t 1

DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007-09-05 07:36:04 marka Exp $

[Status] Processing input data
[Status] Sending queries (beginning with 127.0.0.1)
[Timeout] Query timed out: msg id 9181
[Timeout] Query timed out: msg id 32829
[Timeout] Query timed out: msg id 32828
[Status] Testing complete

Statistics:

  Parse input file:     multiple times
  Run time limit:       20 seconds
  Ran through file:     739629 times

  Queries sent:         2218888 queries
  Queries completed:    2218885 queries
  Queries lost:         3 queries
  Queries delayed(?):   0 queries

  RTT max:         	0.596407 sec
  RTT min:              0.000061 sec
  RTT average:          0.000618 sec
  RTT std deviation:    0.000578 sec
  RTT out of range:     0 queries

  Percentage completed: 100.00%
  Percentage lost:        0.00%

  Started at:           Thu Feb 28 14:22:52 2013
  Finished at:          Thu Feb 28 14:23:12 2013
  Ran for:              20.003071 seconds

  Queries per second:   110927.217126 qps

Now, you can notice several things:
 • I used different number of outstanding queries. The receptionist is slower
   when the buffers inside are not properly filled up. On real server, that
   would be no problem, there would either be enough queries to answer, or the
   system would be idle, so having slightly suboptimal performance doesn't
   matter. On the other side, without the receptionist, it started to drop
   packets with more than 70 outstanding queries.
 • There's batch of "dropped" queries in case of the receptionist. This is
   because the queues were not flushed when the queryperf stopped sending, so they
   were never delivered to auth. This would be solved by timeouts of few
   milliseconds on the queues in real case.
 • The RTT is slightly longer for the receptionist, but not by much.
 • The receptionist has actually bigger (by a small bit) QPS in this case.

Now, the last thing might be surprising. There are several reasons that may
explain it, though. I don't know which one of them it is:
 • I used the recvmmsg and sendmmsg calls, which are able to receive/send
   multiple messages by one system call. Then the auth server gets a whole batch
   at once, with just two calls to recv. The switches to kernel space might be
   slow. This is not available everywhere, so we probably want to have a generic
   version for other systems too. But if someone wants the best possible
   performance, we may want to say which OS is the best.
 • The locking on UDP socket shared between auths may slow things down. We read
   single-threaded from it and then distribute the queries to separate TCP
   connections.
 • ASIO overhead. It may happen the several layers of templates take their toll.
   The receptionist is low-level and quite simple and the batches limit the
   number of ASIO reads/writes needed.
 • The batches might help CPU cache (if we don't leave the process for kernel
   handling after each message, we might have some useful data still in the
   cache).

Also, the way I hacked the batch processing into the TCP server is probably
suboptimal (there are more data copies than necessary) and it could be sped up
further.

However, when I didn't include the batching, the receptionist was something like
25-50% slower than direct access to auths.

You can find the code in the trac2776 branch. Now, I don't think it is worth
doing some code review, but if you want to see more clearly what it does, you
can have a look.

Anyway, I believe that it is possible to write the receptionist without much
performance penalty and it seems to me to be the most flexible approach we
thought about. I'd propose going the receptionist path (there are some
pros/cons of each approach in git).

With regards

-- 
XML is like violence. If it doesn't solve your problem, use more.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130228/0b041be2/attachment.bin>