[bind10-dev] production experience notes

JINMEI Tatuya / 神明達哉 jinmei at isc.org
Tue Dec 20 19:29:50 UTC 2011


At Mon, 19 Dec 2011 14:49:00 -0600 (CST),
"Jeremy C. Reed" <jreed at isc.org> wrote:
> 
> I was asked to look at a production system running a AS112 service. That 
> is it hosts a zone for hostname.as112.net and then empty zones (SOA and 
> NS only) for the RFC 1918 zones.  The system was running the latest 
> snapshot release as installed using FreeBSD port. It had been running 
> for about ten days but I am not sure when traffic was switched over to 
> it (but at least over a couple days ago).

Thanks for the report.  That's very interesting.

> 0) I was given a user login. Luckily bindctl worked for me (using 
> defaults). We need to secure the defaults for this.
> 
> The b10-auth was around 100% cpu usage. netstat showed it was averaging 
> 60,000 packets in per second and only around 200 packets out per second. 
> Every dig I did against it timed out.

It's amazing that an AS 112 server is actually accepting such a high
volume of queries.  It's not surprising the current implementation of
b10-auth can handle the load, but I'm quite sure we can improve it to
the level of handling such volume of queries.

> 6) I originally thought the problem was due to huge database and queries 
> coming in for things not in the database. This is a known issue where 
> the auth server temporarily hangs, but I can't find the ticket at this 
> moment.

I guess it's #414 (which was closed as a "duplicate" of #324).  But in
any case it's a different issue as you already noticed.

> simple and small zones. I confirmed this by using sqlite3 command line 
> to look at the records.

> 12) The incoming packets increased to 61000 per second and the outbound 
> packets jumped to around 3100 per second. The CPU usage for b10-auth 
> dropped to around 45% but b10-xfrout was also now around 45%. I used 
> bindctl to remove the b10-xfrout component. b10-auth went up to around 
> 79% cpu usage and netstat showed the outbound packets now was around 
> 8000 per second. I could now sometimes dig against and get the expected 
> response.

This is interesting.  So the queries included a non marginal number of
xfr requests?  I guess it suggests we should think about rate limiting
of accepting incoming xfr requests even at the auth side so that it
won't waste too much resource for forwarding the requests to xfrout
when the latter is too busy to handle them.

It's also strange to me that b10-auth still doesn't consume 100% CPU
time even if it should be too busy and actually drops queries.

Maybe related to this point, I guess we should allow b10-auth simply
returns REFUSED or some other error to xfr requests immediately if
it's not expected to answer them.  The ability of disabling xfrout is
nice, but what it currently means that b10-auth tries to forward the
requests but fails to establish a connection to xfrout (resulting in
an exception) for every request.  That's an obvious waste, and maybe
that's the reason for the idle time.

> 14) The admin of the service needed better performance, maybe up to 80K 
> QPS for spikes. (From these results and from my benchmarking data from 
> other system, I knew we couldn't reach it.) So for now, the admin 
> changed the router to have the traffic go to a known-working BIND 9 
> system. The inbound and outbound packets dropped to 0 per second and 
> b10-auth went to 0% cpu usage.

Maybe we should consider a quick hack performance task?  For example,
we may be able to consider a custom data source module (library) for a
static zone content that mostly hardcodes response images (for much
better performance).

---
JINMEI, Tatuya



More information about the bind10-dev mailing list