[bind10-dev] production experience notes

Mon Dec 19 20:49:00 UTC 2011

I was asked to look at a production system running a AS112 service. That 
is it hosts a zone for hostname.as112.net and then empty zones (SOA and 
NS only) for the RFC 1918 zones.  The system was running the latest 
snapshot release as installed using FreeBSD port. It had been running 
for about ten days but I am not sure when traffic was switched over to 
it (but at least over a couple days ago).

$ pkg_info | grep bind10
bind10-devel-20111128 Development version of ISC BIND 10 DNS Suite

This email describes my experience with this (some unrelated to the 
specific issue).

0) I was given a user login. Luckily bindctl worked for me (using 
defaults). We need to secure the defaults for this.

The b10-auth was around 100% cpu usage. netstat showed it was averaging 
60,000 packets in per second and only around 200 packets out per second. 
Every dig I did against it timed out.

1) I saw no logging. No logging was configured. I found an rc script 
that started bind10 using FreeBSD's daemon with the -f switch so all 
standard output and standard error was lost. (I had previously created 
an email thread about this same problem: 
https://lists.isc.org/pipermail/bind10-dev/2011-December/002911.html .)

2) I configured logging to a file.  It needlessly logged a lot and only 
about out-of-bailiwick data from NS records in the served zones. This is 
existing ticket http://bind10.isc.org/ticket/1102

3) I noticed bindctl tab completion didn't work with numbered sets. I 
opened a ticket for this:
http://bind10.isc.org/ticket/1519

4) Full debugging (debuglevel 99) showed it was answering some queries. 
At 60,000 queries per second, my logs rotated fast. I set to 0 and the 
logs rotated and all except my last file were reset to zero bytes and 
the oldest rotated file went beyond the maxsize. I opened ticket:
http://bind10.isc.org/ticket/1518

5) I opened another ticket about bindctl usage:
http://bind10.isc.org/ticket/1520

6) I originally thought the problem was due to huge database and queries 
coming in for things not in the database. This is a known issue where 
the auth server temporarily hangs, but I can't find the ticket at this 
moment. But this thought was wrong because the AS112 DNS setup is very 
simple and small zones. I confirmed this by using sqlite3 command line 
to look at the records.

7) Logging showed:

2011-12-19 18:47:12.512 DEBUG [b10-auth.datasrc] DATASRC_SQLITE_FINDREC 
looking for record '*.206.168.192.in-addr.arpa./PTR'
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_INSERT 
inserting item 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' into the hotspot cache
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_OLD_FOUND 
older instance of hotspot cache item 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' found, replacing
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_REMOVE 
removing 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' from the hotspot cache

What is purpose of bottom "removing"? This happened a lot, many times 
per second. I don't know, but seems inefficient. Please let us discuss 
this and open a ticket if needed.

8) Well I know SQLite3 backend even with its hot cache can not keep up, 
so I changed to in-memory. I had several zones to create. So I added a 
Auth/datasources, set it to memory, and then added two zones to it. The 
bindctl crashed so I opened a ticket: http://bind10.isc.org/ticket/1515

9) I converted simple zones files to canonical format as currently 
required by memory datasource backend (by using named-checkconf -D via a 
simple shell script). Note due to this limitation I had a single zone 
file that I had to regenerate into 20 different zone files.

10) I also used a very minor shell one line script to output some 
bindctl commands which I then copy and pasted into bindctl.

11) The log output showed that the memory datasource was disabled and 
then enabled and mentioned the sqlite3 database for each commit. I 
created a ticket for this: http://bind10.isc.org/ticket/1517

12) The incoming packets increased to 61000 per second and the outbound 
packets jumped to around 3100 per second. The CPU usage for b10-auth 
dropped to around 45% but b10-xfrout was also now around 45%. I used 
bindctl to remove the b10-xfrout component. b10-auth went up to around 
79% cpu usage and netstat showed the outbound packets now was around 
8000 per second. I could now sometimes dig against and get the expected 
response.

13) I also opened another ticket about logging: 
http://bind10.isc.org/ticket/1516

14) The admin of the service needed better performance, maybe up to 80K 
QPS for spikes. (From these results and from my benchmarking data from 
other system, I knew we couldn't reach it.) So for now, the admin 
changed the router to have the traffic go to a known-working BIND 9 
system. The inbound and outbound packets dropped to 0 per second and 
b10-auth went to 0% cpu usage.

  Jeremy C. Reed
  ISC