[bind10-dev] production experience notes
Jeremy C. Reed
jreed at isc.org
Mon Dec 19 20:49:00 UTC 2011
I was asked to look at a production system running a AS112 service. That
is it hosts a zone for hostname.as112.net and then empty zones (SOA and
NS only) for the RFC 1918 zones. The system was running the latest
snapshot release as installed using FreeBSD port. It had been running
for about ten days but I am not sure when traffic was switched over to
it (but at least over a couple days ago).
$ pkg_info | grep bind10
bind10-devel-20111128 Development version of ISC BIND 10 DNS Suite
This email describes my experience with this (some unrelated to the
specific issue).
0) I was given a user login. Luckily bindctl worked for me (using
defaults). We need to secure the defaults for this.
The b10-auth was around 100% cpu usage. netstat showed it was averaging
60,000 packets in per second and only around 200 packets out per second.
Every dig I did against it timed out.
1) I saw no logging. No logging was configured. I found an rc script
that started bind10 using FreeBSD's daemon with the -f switch so all
standard output and standard error was lost. (I had previously created
an email thread about this same problem:
https://lists.isc.org/pipermail/bind10-dev/2011-December/002911.html .)
2) I configured logging to a file. It needlessly logged a lot and only
about out-of-bailiwick data from NS records in the served zones. This is
existing ticket http://bind10.isc.org/ticket/1102
3) I noticed bindctl tab completion didn't work with numbered sets. I
opened a ticket for this:
http://bind10.isc.org/ticket/1519
4) Full debugging (debuglevel 99) showed it was answering some queries.
At 60,000 queries per second, my logs rotated fast. I set to 0 and the
logs rotated and all except my last file were reset to zero bytes and
the oldest rotated file went beyond the maxsize. I opened ticket:
http://bind10.isc.org/ticket/1518
5) I opened another ticket about bindctl usage:
http://bind10.isc.org/ticket/1520
6) I originally thought the problem was due to huge database and queries
coming in for things not in the database. This is a known issue where
the auth server temporarily hangs, but I can't find the ticket at this
moment. But this thought was wrong because the AS112 DNS setup is very
simple and small zones. I confirmed this by using sqlite3 command line
to look at the records.
7) Logging showed:
2011-12-19 18:47:12.512 DEBUG [b10-auth.datasrc] DATASRC_SQLITE_FINDREC
looking for record '*.206.168.192.in-addr.arpa./PTR'
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_INSERT
inserting item 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' into the hotspot cache
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_OLD_FOUND
older instance of hotspot cache item 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' found, replacing
2011-12-19 18:47:12.513 DEBUG [b10-auth.datasrc] DATASRC_CACHE_REMOVE
removing 'negative entry for *.206.168.192.in-addr.arpa. IN PTR
' from the hotspot cache
What is purpose of bottom "removing"? This happened a lot, many times
per second. I don't know, but seems inefficient. Please let us discuss
this and open a ticket if needed.
8) Well I know SQLite3 backend even with its hot cache can not keep up,
so I changed to in-memory. I had several zones to create. So I added a
Auth/datasources, set it to memory, and then added two zones to it. The
bindctl crashed so I opened a ticket: http://bind10.isc.org/ticket/1515
9) I converted simple zones files to canonical format as currently
required by memory datasource backend (by using named-checkconf -D via a
simple shell script). Note due to this limitation I had a single zone
file that I had to regenerate into 20 different zone files.
10) I also used a very minor shell one line script to output some
bindctl commands which I then copy and pasted into bindctl.
11) The log output showed that the memory datasource was disabled and
then enabled and mentioned the sqlite3 database for each commit. I
created a ticket for this: http://bind10.isc.org/ticket/1517
12) The incoming packets increased to 61000 per second and the outbound
packets jumped to around 3100 per second. The CPU usage for b10-auth
dropped to around 45% but b10-xfrout was also now around 45%. I used
bindctl to remove the b10-xfrout component. b10-auth went up to around
79% cpu usage and netstat showed the outbound packets now was around
8000 per second. I could now sometimes dig against and get the expected
response.
13) I also opened another ticket about logging:
http://bind10.isc.org/ticket/1516
14) The admin of the service needed better performance, maybe up to 80K
QPS for spikes. (From these results and from my benchmarking data from
other system, I knew we couldn't reach it.) So for now, the admin
changed the router to have the traffic go to a known-working BIND 9
system. The inbound and outbound packets dropped to 0 per second and
b10-auth went to 0% cpu usage.
Jeremy C. Reed
ISC
More information about the bind10-dev
mailing list