drafts of write-ups on 9.2.0rc8 perf

Tue Nov 6 20:31:08 UTC 2001

Folks -

I have been doing my usual netperf thing against bind 9.2.0rc8, and a
bit against 8.2.5 and have some small papers ready do go online. I
thought I'd run them past you first for your perusal.

I think I am still on bind-workers, but best to cc me if you want me to
see your feedback.

rick jones

ftp://ftp.cup.hp.com/dist/networking/briefs/

-- 
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...

-- Attached file included as plaintext by Listar --
-- File: compet_dns_server_results.txt

	       Copyright 2000, Hewlett-Packard Company

     The Performance of Competitor's Boxes as Domain Name Servers
		   Running Various Versions of BIND

		     Rick Jones <raj at cup.hp.com>
		       Hewlett-Packard Company
			    Cupertino, CA

		    Revision 0.3; November 5, 2001
			   Add Memory Usage

		    Revision 0.2; October 26, 2001
			    Add Blade 1000

		     Revision 0.1; June 14, 2000

 ftp://ftp.cup.hp.com/dist/networking/briefs/compet_server_results.txt

Introduction:

One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS).  The pre-eminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Distribution (BIND) maintained by isc.org.

This evolving brief will take a look at the performance of various
competitors to the HP 9000 running the named server from various
revisions of BIND.

This document builds on prior documents discussing the DNS server
performance of HP 9000 systems running named. Those documents can be
found at the URLs:

  ftp://tardy.cup.hp.com/dist/networking/briefs/named_performance.txt
  ftp://tardy.cup.hp.com/dist/networking/briefs/dns_server_results.txt

	       Copyright 2000, Hewlett-Packard Company

Summary:

The astute reader will notice that altering the number of CPUs does
not appear to have a significant effect on the number of names a
system can resolve running a single copy of named. The named is a
single-threaded process, and the bulk of the CPU time is spent in
user-space. These two things are the reason one does not see much CPU
scaling. Basically, a single CPU is saturated, leaving any others
rather idle.

		      Named Performance Summary
	netperf DNS_RR test requesting 1000 out of cup.hp.com

                8.1.2        8.2.2-P5    9.2.0rc8     9.2.0rc8
                "Sun"         "Sun"       "stock"      "-fast"
   System    +====================================================
 UE420 1x450 |    4,272   |            |            |            |
 UE420 2x450 |    4,700   |            |            |            |
 UE420 4x450 |    5,156   |            |            |            |
 B1750 1x750 |            |   5,586    |   1,225    |   2,336    |
             +====================================================

 [ additional information about the flavors of named can be found
   in the section on configuration. ]

Based upon the UE420 tests, every 1000 requests per second required
approximately 2.8 mbit/s of network bandwidth leaving the server and
approximately 0.8 mbit/s of inbound network bandwidth. Of course, this
will depend entirely on the nature of the names being requested and
does not consider anything other than "A" record requests. 

	       Copyright 2000, Hewlett-Packard Company

Methodology:

On each system/revision of named measured, a caching-only server was
setup and seeded by requesting all of the names in the cup.hp.com
domain. At the time of the UE420 measurements, this was ~33,000
names. By the time of the Blade 1750 this had increased to ~49,000
entries. For the actual measurements the first 1000 names in
cup.hp.com were requested by the load generators running the DNS_RR
test of netperf3. Between 1 and 16 simultaneous, synchronous streams
of requests were measured. Each measurement lasted 10 (UE420R, 15 for
the Blade 1570) seconds and was taken three times. The number reported
in the summary table is the peak measured across the number of
threads.

The load generators were eight (UE420R) or 12 (Blade 1750) HP 9000
Model J5000 systems running revision B.11.00.47 (aka 11.ACE) of the
HP-UX 11 operating system. To ensure that network bandwidth was never
an issue all systems were connected via Gigabit Ethernet links.

Configurations:

The UE420R was configured with 4GB of RAM and either 1, 2 or 4 450 MHz
UltraSPARC II CPUs. A Sun PCI Gigabit Ethernet 2.0 NIC was used for
the test network. The UE420R was running BIND 8.1.2 "Sun" flavor of
named is simply the in.named shipped by Sun as part of Solaris 8.

The Blade 1750 was configured with 2GB of RAM and 1 750 MHz
UltraSPARC-III CPU. A Sun PCI Gigabit Ethernet NIC was used for the
test network and was inserted into the (sole?) 66 MHz PCI slot in the
system. The uname -a output from the Blade 1750 was:

SunOS hpisp801 5.8 Generic_108528-09 sun4u sparc SUNW,Sun-Blade-1000

and the what strings from in.named implied that the in.named for the
"Sun" flavor was based on BIND 8.2.2-P5.

The "stock" 9.2.0rc8 is as the named compiles out of the tar
file. From a brief examination of the logs it would appear this is a
-g compilation with no other options.

The "-fast" 9.2.0rc8 is a compilation of the named with "-fast"
replacing "-g." It is the author's understanding that this will cause
a compilation for the SPARC revision on which the compiler is running
and will enable appropriate optimizations. The "-fast" flag was used in
Sun's SPECcpu2000 submittals for the Blade 1750. The method used to
cause "-fast" to be set was:

$ CFLAGS="-fast" ./configure

While there was "only" 2GB of RAM in the system, all memory slots were
filled, so presumably the system had maximum memory
interleaving. Also, the system was no RAM starved - the named
(9.2.0rc8 at least) consumed less than 9MB of memory.

	       Copyright 2000, Hewlett-Packard Company

Rare data:

The "threads" in these data represent separate, synchronous, streams
of requests. The data presented here is the summary of the raw data
collected. If you are keenly interested in the minutiae, you can
contact the author.

 UE420R 1x450 8.1.2 "Sun"

 Threads             Ops
       1            2294
       2            2667
       3            3950
       4            4265
       5            4266
       6            4268
       7            4269
       8            4264
       9            4269
      10            4262
      11            4267
      12            4267
      13            4269
      14            4272
      15            4264
      16            4269

 Peak ops: 4272 with 14 threads

 UE420R 2x450 8.1.2 "Sun"

 Threads             Ops
       1            2297
       2            2672
       3            3960
       4            4638
       5            4680
       6            4539
       7            4532
       8            4524
       9            4594
      10            4700
      11            4666
      12            4564
      13            4590
      14            4536
      15            4622
      16            4559

 Peak ops: 4700 with 10 threads

 UE420R 4x450 8.1.2 "Sun"

 Threads             Ops
       1            2314
       2            2700
       3            3995
       4            5090
       5            5115
       6            5156
       7            5152
       8            5116
       9            5136
      10            5119
      11            5123
      12            5127
      13            5156
      14            5132
      15            5098
      16            5129

 Peak ops: 5156 with 6 threads

 Blade 1750 1x750 8.2.2-P5 "Sun"

 Threads             Ops
       1            2434
       2            4758
       3            5252
       4            5483
       5            5522
       6            5512
       7            5536
       8            5540
       9            5549
      10            5583
      11            5563
      12            5564
      13            5586
      14            5572
      15            5554
      16            5579

 Peak ops: 5586 with 13 threads

 Blade 1750 1x750 9.2.0rc8 "stock"

 Threads             Ops
       1             981
       2            1214
       3            1224
       4            1218
       5            1224
       6            1222
       7            1219
       8            1224
       9            1217
      10            1225
      11            1224
      12            1219
      13            1224
      14            1224
      15            1219
      16            1220

 Peak ops: 1225 with 10 threads

 Blade 1750 1x750 9.2.0rc8 "-fast"

 Threads             Ops
       1            1570
       2            2311
       3            2326
       4            2313
       5            2330
       6            2314
       7            2319
       8            2331
       9            2323
      10            2332
      11            2326
      12            2324
      13            2335
      14            2336
      15            2322
      16            2334

 Peak ops: 2336 with 14 threads

-- Attached file included as plaintext by Listar --
-- File: j6700_dns_server_results.txt

	       Copyright 2001, Hewlett-Packard Company

       The Performance of the HP j6700 as a Domain Name Server
			Running BIND 9.2.0rc8

		     Rick Jones <raj at cup.hp.com>
		       Hewlett-Packard Company
			    Cupertino, CA

		    Revision 0.4; November 5, 2001
			 Add Memory Footprint

		    Revision 0.3; November 2, 2001
			     Add +O4 data

 ftp://ftp.cup.hp.com/dist/networking/briefs/j6700_dns_server_results.txt

Introduction:

One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS).  The pre-eminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Distribution (BIND) maintained by isc.org.

This evolving brief will take a look at the performance of various
models of the HP 9000 running the named server from various revisions
of BIND.

This document builds on prior documents discussing ways to improve the
performance of an HP 9000 system running named. Those documents can be
found at the URLs:

  ftp://ftp.cup.hp.com/dist/networking/briefs/named_performance.txt
  ftp://ftp.cup.hp.com/dist/networking/briefs/dns_server_results.txt

Other related documents include, but are not limited to:

  ftp://ftp.cup.hp.com/dist/networking/briefs/compet_dns_server_results.txt

	       Copyright 2001, Hewlett-Packard Company

Summary:

A j6700 system running HP-UX 11.00 can handle over 7500 DNS name
requests per second running the named from BIND 9.2.0rc8. A great deal
of this performance is made possible by the advanced optimization
features of the HP ANSI C compiler. Compiler optimization can improve
the performance of named by as much as 70 percent over the way the
BIND 9.2.0rc8 distribution ships from www.isc.org.

		      Named Performance Summary
	netperf DNS_RR test requesting 1000 out of cup.hp.com

                9.2.0rc8     9.2.0rc8     9.2.0rc8      8.2.5
                 "stock"       "-O"        "+O4"         "-O"
   System    +====================================================
 j6700 1x750 |   2,894    |   4,631    |            |            |
 j6700 2x750 |            |   7,098    |   7,543    |  10,682    |
             +====================================================

 [ additional information about the "flavors" of named can be found
   in the section on configuration. ]

	       Copyright 2001, Hewlett-Packard Company

Methodology:

On each system/revision of named measured, a caching-only server was
setup and seeded by requesting all of the ~39,000 names in the
cup.hp.com domain. For the actual measurements the first 1000 names in
cup.hp.com were requested by the load generators running the DNS_RR
test of netperf3. Between 1 and 16 simultaneous, synchronous streams
of requests were measured. Each measurement lasted 15 seconds and was
taken three times. The number reported in the summary table is the
peak measured across the number of threads.

The load generators were twelve HP 9000 Model J5000 systems running
revision B.11.00.47 (aka 11.ACE) of the HP-UX 11 operating system. To
ensure that network bandwidth was never an issue all systems were
connected via Gigabit Ethernet links.

Configurations:

The j6700 was running revision B.11.00.47 (aka 11.ACE) of the HP-UX 11
Operating system and had an add-on PCI Gigabit Ethernet NIC (product
number A4926A) installed along with revision B.11.00.11 of the Gigabit
NIC driver along with basically "up-to-date" Transport and related
patches.

The j6700 was configured with 16GB of RAM and either 1 or 2 750 MHz
PA-8700 CPUs. It should be noted that the j6700 ships with a minimum
of two processors. The single CPU results are presented to give some
idea of MP scaling. For the single CPU results, the second CPU was
disabled in firmware. 

While the system had 16GB of RAM, the memory footprint of named was
less than 9MB. For this reason, it may be assumed that the same
performance could be achieved with a significantly smaller RAM
configuration. The benefit of being able to hold 16 GB of RAM comes
when one wants to hold a very large domain in memory. Perhaps even an
entire GLTD :)

The "stock" identifier indicates a binary built using the default
HP-UX 11 build rules from the 9.2.0rc8 distribution. The resulting
32-bit PA 2.0 binary includes debug information from the use of the -g
compiler option.

The "-O" binary remains a 32-bit PA 2.0 binary, compiled with -O
replacing -g. This is also the case for 8.2.5.

The "+O3" binary remains a 32-bit PA 2.0 binary, compiled with +O3.

The "+O4" binary remains a 32-bit PA 2.0 binary, compiled with +O4.

A "full" binary is a 64 bit PA 2.0 binary compiled with +DA2.0W and
includes Profile Based Optimization (PBO - +I/+P), +O4,
+Ostaticprediction and +Oentrysched. It also includes the use of the
chatr utility to set the default instruction and data page sizes to
"L" for largest possible. A full binary was not created for this
writeup.

One very interesting observation was that while 8.2.5 was
significantly faster than 9.2.0 during the measurement run, it took
much, Much, MUCH longer to get everything cached.

	       Copyright 2001, Hewlett-Packard Company

Rare data:

The "threads" in these data represent separate, synchronous, streams
of requests. The data presented here is the summary of the raw data
collected.

 j6700 1x750 MHz 9.2.0rc8 "stock" :

 Threads             Ops
       1            1679
       2            2773
       3            2846
       4            2848
       5            2849
       6            2852
       7            2849
       8            2862
       9            2889
      10            2893
      11            2894
      12            2893
      13            2892
      14            2893
      15            2892
      16            2893

 Peak ops: 2894 with 11 threads

 j6700 1x750 MHz 9.2.0rc8 "O2" :

 Threads             Ops
       1            2153
       2            4273
       3            4587
       4            4618
       5            4625
       6            4624
       7            4625
       8            4625
       9            4624
      10            4630
      11            4630
      12            4629
      13            4629
      14            4631
      15            4623
      16            4628

 Peak ops: 4631 with 14 threads

 j6700 2x750 MHz 9.2.0rc8 "O2" :

 Threads             Ops
       1            1322
       2            3484
       3            4195
       4            6522
       5            7075
       6            7096
       7            7089
       8            7060
       9            7073
      10            7098
      11            7085
      12            7089
      13            7082
      14            7090
      15            7082
      16            7077

 Peak ops: 7098 with 10 threads

 j6700 2x750 MHz 8.2.5 "O2" :

 Threads             Ops
       1            3013
       2            5925
       3            8860
       4           10586
       5           10675
       6           10682
       7           10675
       8           10660
       9           10675
      10           10679
      11           10680
      12           10664
      13           10656
      14           10672
      15           10665
      16           10664

 Peak ops: 10682 with 6 threads

 j6700 2x750 MHz 8.2.5 "O3" :

 Threads             Ops
       1            1417
       2            3419
       3            4355
       4            6682
       5            7197
       6            7230
       7            7208
       8            7226
       9            7212
      10            7220
      11            7217
      12            7223
      13            7217
      14            7229
      15            7215
      16            7222

 Peak ops: 7230 with 6 threads

 j6700 2x750 MHz 8.2.5 "O3" :

 Threads             Ops
       1            1221
       2            3418
       3            4687
       4            6797
       5            7487
       6            7537
       7            7522
       8            7543
       9            7514
      10            7516
      11            7508
      12            7515
      13            7499
      14            7507
      15            7514
      16            7501

 Peak ops: 7543 with 8 threads

	       Copyright 2001, Hewlett-Packard Company

Appendix P: Brief perf analysis

The vast majority of the CPU's time is spent in user mode while
running BIND 9.2.0rc8 on the J6700. When the system is 100% CPU
saturated on both CPUs, roughly 905 of the CPU time is spent in
user-space, and only 10% of the time is spent in the kernel.

The table below, massaged from the output of Glance, shows the system
calls and their rates when the test is running:

		     named 9.2.0rc8 system calls

		System Call Name   ID    Count   Rate
		--------------------------------------
		 gettimeofday      116   37817 7272.5
		 recvmsg           284   37813 7271.7
		 sendmsg           286   37813 7271.7
		 sched_yield       341     997  191.7
		 clock_gettime     346       2    0.3
		 ksleep            398      15    2.8
		 kwakeup           399      14    2.6

We can see that the bulk of the system calls are gettimeofday(),
recvmsg() and sendmsg(). It would seem that there is a gettimeofday()
call per request. The ksleep(), kwakeup() and sched_yield() syscalls
are common when acquiring and releasing mutexes. 

In HP-UX 11, if the system is MP and a thread attempts to acquire a
locked mutex, it will spin some amount of time, make a call to
sched_yield(), and try again to acquire the mutex. If after a set
number of times through that loop it is unable to acquire the mutex it
then calls ksleep to go to sleep waiting for the mutex. The owner of
the mutex then will call kwakeup().