[bind10-dev] b10-auth query performance: experiments and suggestions

Wed Jan 18 23:14:42 UTC 2012

As some of you know I did some quick experiments on improving query
performance of b10-auth (especially with the in memory data source to
see how much we can effectively achieve by the end of project year 3
(end of this March).

In short, I'm quite confident that we can reasonably reach at least
"BIND 9 equivalent" performance (e.g. +/- 10% max qps), and perhaps
even something better.  I'm also pretty sure that we can achieve much
better performance such as two times faster than BIND 9 with
additional 2-3 sprint cycles (although it will be beyond the y3 goal).

Here are some more specific details of the optimizations I tried in
the experiments.  The experimental setup mentioned below is a root
server configuration (without DNSSEC related RRs) and tests using
queryperf with some (old but) real traffic to a root server.

1. Pre-establish shortcuts for additional section processing (in
   memory only)
   This is similar to BIND 9's "acache", and similar to what NSD
   internally does.  But unlike acache, since our current
   implementation uses pretty static in memory image, everything can
   be prepared at load time (and so it's not difficult to implement
   it).

   It improves max qps about 23% compared to the current BIND 10
   implementation.

2. Eliminate the overhead of ASIO wrapper
   We currently use generic {UDP/TCP}Server classes that can support
   recursive resolution in it.  Due to the possibility of recursion
   it's inherently inefficient: it involves some temporary resources
   such as output buffer and the message renderer for every query.
   This is an obvious waste for b10-auth because it always handles
   queries sequentially and should be able to reuse such resources.
   I've introduced a "UDPSyncServer", which reuse as much resource as
   possible and never forks itself.

   This one with optimization #1 improves max qps about 18% compared
   to the version only with optimization #1 (+45% compared to the
   current).

3. Replace the name compression logic
   The current implementation of name compression in the
   MessageRenderer class turned out to be inefficient.  I suspect
   one major reason is it internally uses an STL set, and set elements
   are always cleaned up and re-allocated for every query even if the
   renderer is reused.  It has other bottleneck as pointed out in
   #1568, but IMO we should first consider a bigger organizational
   change than that level of micro optimization.

   I changed the internal implementation to something similar to BIND
   9: maintain compression offset information in a hash table, and
   reuse the hash entries (up to some limit) when the renderer itself
   is reused.

   The effect of this optimization was remarkable.  This one with
   optimization #1+#2 improves max qps about 80% compared to the
   version only with optimization #1+#2 (and about 2.6 times faster
   than the current).  In my experiment, BIND 10 beat BIND 9 at this
   point.  (note: this optimization wouldn't be that effective without
   optimization #1, because it would implicitly rely on the fact that
   the renderer is reused).

   I'd also note that this optimization eliminates the need for #1568
   because the new compression logic doesn't use the class that #1568
   optimizes in the first place.  That's why I was not so positive
   about doing micro profiling too early (doing measurement itself
   isn't necessarily bad except for the time needed for it, but it
   often leads to jumping to premature micro optimization ideas).

It shouldn't be that difficult to fully implement these
optimizations.  Also, they are independent in terms of implementation
(they may have mutual effects like the case with #1 and #3), so we
should be able to parallelize developing these.  So, in my gut feeling
it's quite likely we can complete all of these within 2 sprints, and
perhaps even in a single sprint.

I propose these as the first tasks we should take on in the
performance work phase (after NSEC3 for in memory).

I'll explain further optimizations in a separate message.

---
JINMEI, Tatuya