[bind10-dev] optimization: use some additional polling in SyncUDPServer

Mon Jul 2 18:34:32 UTC 2012

While I did benchmark tests for the new in-memory data source, I've
noticed that the system-level qps (measured by a queryperf-like tool)
was not good compared to benchmark results of query handling (which is
basically everything the server does except for network I/O).  So I
added some quick hack optimization to the SyncUDPServer class: after
processing one query in an asynchronous callback, try to receive some
more queries directly, i.e, without going back to the ASIO loop, in
the non blocking mode (which was already enabled within ASIO anyway).

In general it improved the total throughput for about 10% as
summarized in the following graph:
http://bind10.isc.org/~jinmei/qps3.png
(see my previous post for plum/bamboo/pine, but the difference between
them is not important here).

1recv/ev is equivalent to what we currently do.
Nrecv/ev (N>0) are the optimized versions, varying the number of max
queries that can be handled at once.

The following graph shows comparison between various versions of BIND
10 (and for comparison, BIND 9):
http://bind10.isc.org/~jinmei/qps2.png

"BIND10-XXX+"s are BIND 10 variants including this optimization (with
10 max queries).  The difference between "BIND 10 current+" and
"BIND10-pine+" (the most optimized one) are closer to the local
(excluding IO) benchmark result between these two versions than the
previous one: http://bind10.isc.org/~jinmei/qps.png

10% is not very big, but the required change should be pretty small
(it will be about 50-line additional code, completely local to the
SyncUDPServer class), so I think it's worth doing.  The major drawback
of this optimization is that if the server has multiple interfaces
(today many servers have at least two: one for IPv6 and the other for
IPv4), this optimization could worsen the worst case latency (the next
query in the other IF needs to wait until all N-queries in the first
IF are handled).  But, I guess with a reasonably chosen limit it's
acceptable - if the server can handle 20K queries/s in average, the
delay for handling 10 queries is about 0.5ms.  Also, if the server
runs multiple instance of b10-auth, it's more likely that queries from
different IFs are handled in different instances with higher
concurrency.

So I suggest we include this optimization some time with the current
(coming) scalability work.

Two additional notes:
- The amount of improvement (10%) matches my memory about the overhead
  of ASIO abstraction I mentioned in the last f2f meeting.
- The same optimization will work well for the resolver.

---
JINMEI, Tatuya