BIND 9 scaling

Tue Feb 15 20:15:40 UTC 2005

>>>>> "Phill" == Phill Wood <phill.wood at gmail.com> writes:

    Phill> We're master for a relatively small number of zones (~
    Phill> 1500) and provide a caching nameserver service as
    Phill> well. I've never encountered any performance issues myself
    Phill> but my management are asking what the limits might be.

    Phill> We served 6,000,000 requests yesterday using one CPU of a
    Phill> dual-CPU Sparc box (running with -n 1 due to threading bug
    Phill> in 9.2.3). Maximum load average was 0.46.

    Phill> Can I say, therefore, that we will be able to serve
    Phill> 12,000,000 requests easily and then performance may wane?
    Phill> And at what level of hits would we start to suffer?

It depends. All other things being equal, your assumption is not
unreasonable. However there are a number of variables that you've not
explained which could complicate matters. What you should do is
capture your name server's queries and replay them in a testbed that's
the same as your operational environment.  The queryperf tool in the
BIND contrib tree can then be used to replay those queries and
generate increased load. Observe the results and you'll see how much
capacity and throughput you have to spare. Changing the server
configuration or platform will give comparative numbers. And since
this benchmarking is based on your actual traffic/environment, the
results will be more meaningful than some artificial loads or back of
the envelope guesses.

Broadly speaking, the cost of an authoritative answer is the
same. It's a CPU-bound activity, provided the name server stores
everything in RAM as it normally does. So authoritative throughput
generally scales linearly with the CPU speed and/or number of queries.
Multiprocessors complicate this as the OS thread libraries and BIND9's
internal locking create bottlenecks and guzzle CPU cycles. So throwing
twice as many CPUs in one box won't necessarily deliver twice the
throughout. OTOH doubling the CPU speed of a uniprocessor would double
the capacity, all other things being equal and assuming that the
network can keep up with the increased traffic.

For resolving servers, CPU is not usually the limiting factor. Network
bandwidth, lossy links and the number of concurrent queries that the
server is in the process of resolving tends to matter more. So does
the amount of RAM for the server's cache. More or faster CPUs won't
help much if the DNS packets have lousy RTTs to remote name servers or
the name server can't allocate enough internal data structures for the
incoming query load from stub resolvers.

BTW, it's also a very good idea to keep the authoritative and
resolving server functionality separated even though BIND allows
these to be combined in one process. This also simplifies capacity
planning and so on. One set of servers can be optimised to service
queries from other name servers while the other set deals with queries
from local stub resolvers.

Benchmarking and capacity planning for name servers is much more
subtle than it might seem. Check out the Nominum White Paper on the
subject. It explains what to measure and why. ISTR a famous physicist
saying "to measure is to know". The WP is a good place to start.