BIND 9.x caching Performance under heavy loads

Thu Mar 3 14:06:16 UTC 2005

>>>>> "Srini" == Srini Avirneni <avirsri at gmail.com> writes:

    >> Please be more specific. You seem to be saying that "bad
    >> things happen when your name servers are under heavy
    >> load". This shouldn't exactly be a surprise. What's the actual
    >> problem are you experiencing and/or trying to solve?

    Srini> Yes, it is a suprise. Its a surprise to see Bind as a
    Srini> process, peak at 25% CPU day 1. Then, Day 2, peak climbs to
    Srini> over 50%. Load has not changed.

    Srini> By day 3, CPU will exceed 75% and Bind will no longer
    Srini> respond in a meaning full way (< 500 queries/sec). This is
    Srini> on very fast hardware.

Whether it's fast hardware or not doesn't matter. What you're saying
is that CPU utilisation increases and throughout drops over time,
right? Well, there could be lots of reasons for that. See below and my
previous posting on this thread. You can expect this kind of behaviour
if you're making the name server do more work by arbitrarily
constraining the size of the cache. Which you're more or less admitted.
I paraphrase: "when we make the cache smaller, the server falls over
more quickly.

Once you're convinced these factors are not to blame, it's time to
profile what the name server is actually doing. Maybe at these extremes
on your OS BIND9 is tickling some obscure lock contention problems,
possibly threading-related. That might require external consultancy
and/or a support contract for BIND.

    Srini> As load increased, the CPU time ran away quicker. One would
    Srini> expect that a Bind server that runs at 25% CPU during peak,
    Srini> would within a few points, operate as such continually, and
    Srini> scale somewhat linearly.

    Srini> The query loads we see are fairly consistent.

    >>  Query rates of 2500 qps are rather high. Is this a real,
    >> operational load or something you've cooked up in a testbed? If
    >> it's the former,

    Srini> Its real load. High is a relative term, which was one
    Srini> reason for submitting this question. These are not small
    Srini> servers, but Dual PIVs with plenty of RAM. Notice I stated
    Srini> servers, not server. :)

That does not help. Is 2500qps the load per server or is it 2500 qps
aggregated across all your caching servers (how many?)? Not that this
really makes much difference.

    Srini> This is the crux of the question: have others running Bind
    Srini> 9.x with substantial load noticed any similar behavior.

To answer this question meaningfully, you need to provide more
information so that people can make sensible comparisons with their
own operating environment. For example, are the clients querying for
names that make the server do excessive work to resolve them? eg Long
chains of NS records, zones that are almost lame, zones on name
servers behing very lossy links, etc, etc. What happens when you let
the name server's cache grow without limit? Have you run BIND9 with
threading disabled? Are your servers forwarding queries? Are you using
views? Are you using large or complex access control lists? These are
rhetorical questions BTW. If you're looking for help, I think you need
to hire a consultant or clueful support rather than ask this list. The
advice you get here is worth exactly what you pay for it.

BIND9 can handle lots of queries, though extreme loads from stub
resolvers can be troublesome. This is well known. Check the list
archives for details. The question you should really be asking
(yourself) is "if I'm/we're having operational problems, what am I/we
going to do about this?" rather than "has anyone else seen this?".