Performance Test Metrics for dns server performance.

Tue May 8 18:28:50 UTC 2001

> >>>>> "Raul" == Raul Miller <raul at usatoday.com> writes:
> 
>     Raul> If you've used a web browser which has bind as its dns
>     Raul> resolver, you'll often have to reload the page the first
>     Raul> time you visit a new site.  Sometimes several times -- until
>     Raul> bind gets enough information in the cache to resolve the
>     Raul> query.
> 
<snip>
> 
> The original poster did not give enough information about his setup
> and testing methodology to draw any conclusions from the results he
> found. For instance, what was in the server's cache,

You are wrong, I did give that information. 

   <pasted from first email>
   The above commands (1-3) were dumped into shell script and then executed
in
   the following fashion:

     <start dnscache on server>
     time ./iplist-runtest-1000.sh > iplist-1000.out1     # empty cache
     time ./iplist-runtest-1000.sh > iplist-1000.out2     # primed cache
     time ./iplist-runtest-1000.sh > iplist-1000.out3     # primed cache
     <stop dnscache on server and start BIND, repeat test>

     <stop BIND, start dnscache, start test>
     time ./iplist-runtest-10000.sh > iplist-10000.out1   # empty cache
     time ./iplist-runtest-10000.sh > iplist-10000.out2   # primed cache
     time ./iplist-runtest-10000.sh > iplist-10000.out3   # primed cache
     <stop dnscache, start BIND, repeat test>

     <stop BIND, start dnscache, start test>
     time ./iplist-runtest-100000.sh > iplist-100000.out1  # empty cache
     time ./iplist-runtest-100000.sh > iplist-100000.out2  # primed cache
     time ./iplist-runtest-100000.sh > iplist-100000.out3  # primed cache
     <stop dnscache, start BIND, repeat test>

   After the test runs it was simple to do a few greps on the output to
check
   the number of successes, timeouts, and temporary failures:
   </paste>

Since some folks didn't get it, I'll describe the test procedure in detail.
Step one was to start dnscache on the test server and run three consecutive
tests. In the first test, the cache would obviously be empty. In the two
subsequent tests, the cache would have all the results earned while running
the first test. It is for that reason that I expected the two subsequent
tests to happen much faster. That turned out not to be the case. I'm
assuming that's because most of the time the servers took to do the queries
was waiting for failed lookups to time out. This would seem to be a very
valid "real world" performance test.

So, in answer to the question "what was in the cache":  
  test run #1: nothing
  test run #2: the number of responses cached from test 1
  test run #3: the number of responses cached from test 1 & test 2

That number varied wildly between dnscache and BIND 8. In test 2 & 3
dnscache would have had roughly 40,000 cached responses and 50,000 failed
responses (per the numbers I posted). BIND on the other hand would have had
roughly 15,000 queries cached for the second and third runs.

All 90,000 IP addresses that were queried are local addresses here in our
NOC. All 90,000 addresses are swipped to the same two name servers (also
here in our NOC). Therefore, all the lookups would go through the normal
lookup hierarchy (arpa --> in-addr.arpa --> xx.in-addr.arpa) and get
resolved locally by our two BIND 8.2.3 name servers which are both
authoratitive for those IP blocks. The network conditions between the test
caching servers and the authoritative servers is a 100BaseT switched network
which is not saturated. Network conditions are constant across all the
tests. 

> what was in the caches of the servers that were queried by that server,

Judging by the output of dnscache, about 45,000 IP's had reverse set up, the
other 45,000 don't. 

> whether the NS records were prone to generate query restarts,

And how would I know this?

> if the network and
> router loads and topologies were the same, etc, etc. He also didn't
> define what was meant by "42% accuracy" or how this was even measured.

I did too. Read the original post again. I very specifically stated that I
was querying 90,112 IP addresses that were all local in our NOC. I also
stated that the caching dns servers I was testing ARE on the same LAN as the
authoritative name servers. Here's the relevant paragraph:

   <pasted>
   The file "iplist" is a compilation of 90,112 IP addresses that my company
   owns and are local IP's (to our NOC). So the answers will all be found on
   our three local name servers (solaris & bind 8) which are on the same
LAN.
   By limiting it to our IP space I'm limiting the tests skew for network
   conditions, etc. The only real factor I have no control over is the load
on
   the "real" name servers I'm querying so I've staggered the timing of the
   tests to limit variance.
   </pasted>

As far was the definition of "42% accuracy", I very specifically gave you
the total number of IP's queried.

  <pasted>
  dnscache-1.0.5 - 290MB - 90,112 requests
  simultaneous   time(s)   completed      timed-out      temp fail
  1,000          964       40,308         9,281          10,073
  10,000         976       40,496         8,928          11,132
  100,000        875       40,786         7,562          11,816

  BIND 8.2.3-REL - 6-8MB - 90,112 requests
  1,000          1144      18,966         5,203          47,638
  10,000         1157      14,299         4,899          54,236
  100,000        1200      12,771         5,185          56,575
  </pasted>

I also gave you the number of successful (completed) lookups. I even
described exactly how I counted them from the output. If you can handle
seventh grade math (dependend on local school system quality (or lack
thereof)), you can calculate accuracy. Is that too challenging for you?

Matt

Matt