drafts of write-ups on 9.2.0rc8 perf
Rick Jones
raj at cup.hp.com
Tue Nov 6 20:31:08 UTC 2001
Folks -
I have been doing my usual netperf thing against bind 9.2.0rc8, and a
bit against 8.2.5 and have some small papers ready do go online. I
thought I'd run them past you first for your perusal.
I think I am still on bind-workers, but best to cc me if you want me to
see your feedback.
rick jones
ftp://ftp.cup.hp.com/dist/networking/briefs/
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com but NOT BOTH...
-- Attached file included as plaintext by Listar --
-- File: compet_dns_server_results.txt
Copyright 2000, Hewlett-Packard Company
The Performance of Competitor's Boxes as Domain Name Servers
Running Various Versions of BIND
Rick Jones <raj at cup.hp.com>
Hewlett-Packard Company
Cupertino, CA
Revision 0.3; November 5, 2001
Add Memory Usage
Revision 0.2; October 26, 2001
Add Blade 1000
Revision 0.1; June 14, 2000
ftp://ftp.cup.hp.com/dist/networking/briefs/compet_server_results.txt
Introduction:
One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS). The pre-eminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Distribution (BIND) maintained by isc.org.
This evolving brief will take a look at the performance of various
competitors to the HP 9000 running the named server from various
revisions of BIND.
This document builds on prior documents discussing the DNS server
performance of HP 9000 systems running named. Those documents can be
found at the URLs:
ftp://tardy.cup.hp.com/dist/networking/briefs/named_performance.txt
ftp://tardy.cup.hp.com/dist/networking/briefs/dns_server_results.txt
Copyright 2000, Hewlett-Packard Company
Summary:
The astute reader will notice that altering the number of CPUs does
not appear to have a significant effect on the number of names a
system can resolve running a single copy of named. The named is a
single-threaded process, and the bulk of the CPU time is spent in
user-space. These two things are the reason one does not see much CPU
scaling. Basically, a single CPU is saturated, leaving any others
rather idle.
Named Performance Summary
netperf DNS_RR test requesting 1000 out of cup.hp.com
8.1.2 8.2.2-P5 9.2.0rc8 9.2.0rc8
"Sun" "Sun" "stock" "-fast"
System +====================================================
UE420 1x450 | 4,272 | | | |
UE420 2x450 | 4,700 | | | |
UE420 4x450 | 5,156 | | | |
B1750 1x750 | | 5,586 | 1,225 | 2,336 |
+====================================================
[ additional information about the flavors of named can be found
in the section on configuration. ]
Based upon the UE420 tests, every 1000 requests per second required
approximately 2.8 mbit/s of network bandwidth leaving the server and
approximately 0.8 mbit/s of inbound network bandwidth. Of course, this
will depend entirely on the nature of the names being requested and
does not consider anything other than "A" record requests.
Copyright 2000, Hewlett-Packard Company
Methodology:
On each system/revision of named measured, a caching-only server was
setup and seeded by requesting all of the names in the cup.hp.com
domain. At the time of the UE420 measurements, this was ~33,000
names. By the time of the Blade 1750 this had increased to ~49,000
entries. For the actual measurements the first 1000 names in
cup.hp.com were requested by the load generators running the DNS_RR
test of netperf3. Between 1 and 16 simultaneous, synchronous streams
of requests were measured. Each measurement lasted 10 (UE420R, 15 for
the Blade 1570) seconds and was taken three times. The number reported
in the summary table is the peak measured across the number of
threads.
The load generators were eight (UE420R) or 12 (Blade 1750) HP 9000
Model J5000 systems running revision B.11.00.47 (aka 11.ACE) of the
HP-UX 11 operating system. To ensure that network bandwidth was never
an issue all systems were connected via Gigabit Ethernet links.
Configurations:
The UE420R was configured with 4GB of RAM and either 1, 2 or 4 450 MHz
UltraSPARC II CPUs. A Sun PCI Gigabit Ethernet 2.0 NIC was used for
the test network. The UE420R was running BIND 8.1.2 "Sun" flavor of
named is simply the in.named shipped by Sun as part of Solaris 8.
The Blade 1750 was configured with 2GB of RAM and 1 750 MHz
UltraSPARC-III CPU. A Sun PCI Gigabit Ethernet NIC was used for the
test network and was inserted into the (sole?) 66 MHz PCI slot in the
system. The uname -a output from the Blade 1750 was:
SunOS hpisp801 5.8 Generic_108528-09 sun4u sparc SUNW,Sun-Blade-1000
and the what strings from in.named implied that the in.named for the
"Sun" flavor was based on BIND 8.2.2-P5.
The "stock" 9.2.0rc8 is as the named compiles out of the tar
file. From a brief examination of the logs it would appear this is a
-g compilation with no other options.
The "-fast" 9.2.0rc8 is a compilation of the named with "-fast"
replacing "-g." It is the author's understanding that this will cause
a compilation for the SPARC revision on which the compiler is running
and will enable appropriate optimizations. The "-fast" flag was used in
Sun's SPECcpu2000 submittals for the Blade 1750. The method used to
cause "-fast" to be set was:
$ CFLAGS="-fast" ./configure
While there was "only" 2GB of RAM in the system, all memory slots were
filled, so presumably the system had maximum memory
interleaving. Also, the system was no RAM starved - the named
(9.2.0rc8 at least) consumed less than 9MB of memory.
Copyright 2000, Hewlett-Packard Company
Rare data:
The "threads" in these data represent separate, synchronous, streams
of requests. The data presented here is the summary of the raw data
collected. If you are keenly interested in the minutiae, you can
contact the author.
UE420R 1x450 8.1.2 "Sun"
Threads Ops
1 2294
2 2667
3 3950
4 4265
5 4266
6 4268
7 4269
8 4264
9 4269
10 4262
11 4267
12 4267
13 4269
14 4272
15 4264
16 4269
Peak ops: 4272 with 14 threads
UE420R 2x450 8.1.2 "Sun"
Threads Ops
1 2297
2 2672
3 3960
4 4638
5 4680
6 4539
7 4532
8 4524
9 4594
10 4700
11 4666
12 4564
13 4590
14 4536
15 4622
16 4559
Peak ops: 4700 with 10 threads
UE420R 4x450 8.1.2 "Sun"
Threads Ops
1 2314
2 2700
3 3995
4 5090
5 5115
6 5156
7 5152
8 5116
9 5136
10 5119
11 5123
12 5127
13 5156
14 5132
15 5098
16 5129
Peak ops: 5156 with 6 threads
Blade 1750 1x750 8.2.2-P5 "Sun"
Threads Ops
1 2434
2 4758
3 5252
4 5483
5 5522
6 5512
7 5536
8 5540
9 5549
10 5583
11 5563
12 5564
13 5586
14 5572
15 5554
16 5579
Peak ops: 5586 with 13 threads
Blade 1750 1x750 9.2.0rc8 "stock"
Threads Ops
1 981
2 1214
3 1224
4 1218
5 1224
6 1222
7 1219
8 1224
9 1217
10 1225
11 1224
12 1219
13 1224
14 1224
15 1219
16 1220
Peak ops: 1225 with 10 threads
Blade 1750 1x750 9.2.0rc8 "-fast"
Threads Ops
1 1570
2 2311
3 2326
4 2313
5 2330
6 2314
7 2319
8 2331
9 2323
10 2332
11 2326
12 2324
13 2335
14 2336
15 2322
16 2334
Peak ops: 2336 with 14 threads
-- Attached file included as plaintext by Listar --
-- File: j6700_dns_server_results.txt
Copyright 2001, Hewlett-Packard Company
The Performance of the HP j6700 as a Domain Name Server
Running BIND 9.2.0rc8
Rick Jones <raj at cup.hp.com>
Hewlett-Packard Company
Cupertino, CA
Revision 0.4; November 5, 2001
Add Memory Footprint
Revision 0.3; November 2, 2001
Add +O4 data
ftp://ftp.cup.hp.com/dist/networking/briefs/j6700_dns_server_results.txt
Introduction:
One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS). The pre-eminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Distribution (BIND) maintained by isc.org.
This evolving brief will take a look at the performance of various
models of the HP 9000 running the named server from various revisions
of BIND.
This document builds on prior documents discussing ways to improve the
performance of an HP 9000 system running named. Those documents can be
found at the URLs:
ftp://ftp.cup.hp.com/dist/networking/briefs/named_performance.txt
ftp://ftp.cup.hp.com/dist/networking/briefs/dns_server_results.txt
Other related documents include, but are not limited to:
ftp://ftp.cup.hp.com/dist/networking/briefs/compet_dns_server_results.txt
Copyright 2001, Hewlett-Packard Company
Summary:
A j6700 system running HP-UX 11.00 can handle over 7500 DNS name
requests per second running the named from BIND 9.2.0rc8. A great deal
of this performance is made possible by the advanced optimization
features of the HP ANSI C compiler. Compiler optimization can improve
the performance of named by as much as 70 percent over the way the
BIND 9.2.0rc8 distribution ships from www.isc.org.
Named Performance Summary
netperf DNS_RR test requesting 1000 out of cup.hp.com
9.2.0rc8 9.2.0rc8 9.2.0rc8 8.2.5
"stock" "-O" "+O4" "-O"
System +====================================================
j6700 1x750 | 2,894 | 4,631 | | |
j6700 2x750 | | 7,098 | 7,543 | 10,682 |
+====================================================
[ additional information about the "flavors" of named can be found
in the section on configuration. ]
Copyright 2001, Hewlett-Packard Company
Methodology:
On each system/revision of named measured, a caching-only server was
setup and seeded by requesting all of the ~39,000 names in the
cup.hp.com domain. For the actual measurements the first 1000 names in
cup.hp.com were requested by the load generators running the DNS_RR
test of netperf3. Between 1 and 16 simultaneous, synchronous streams
of requests were measured. Each measurement lasted 15 seconds and was
taken three times. The number reported in the summary table is the
peak measured across the number of threads.
The load generators were twelve HP 9000 Model J5000 systems running
revision B.11.00.47 (aka 11.ACE) of the HP-UX 11 operating system. To
ensure that network bandwidth was never an issue all systems were
connected via Gigabit Ethernet links.
Configurations:
The j6700 was running revision B.11.00.47 (aka 11.ACE) of the HP-UX 11
Operating system and had an add-on PCI Gigabit Ethernet NIC (product
number A4926A) installed along with revision B.11.00.11 of the Gigabit
NIC driver along with basically "up-to-date" Transport and related
patches.
The j6700 was configured with 16GB of RAM and either 1 or 2 750 MHz
PA-8700 CPUs. It should be noted that the j6700 ships with a minimum
of two processors. The single CPU results are presented to give some
idea of MP scaling. For the single CPU results, the second CPU was
disabled in firmware.
While the system had 16GB of RAM, the memory footprint of named was
less than 9MB. For this reason, it may be assumed that the same
performance could be achieved with a significantly smaller RAM
configuration. The benefit of being able to hold 16 GB of RAM comes
when one wants to hold a very large domain in memory. Perhaps even an
entire GLTD :)
The "stock" identifier indicates a binary built using the default
HP-UX 11 build rules from the 9.2.0rc8 distribution. The resulting
32-bit PA 2.0 binary includes debug information from the use of the -g
compiler option.
The "-O" binary remains a 32-bit PA 2.0 binary, compiled with -O
replacing -g. This is also the case for 8.2.5.
The "+O3" binary remains a 32-bit PA 2.0 binary, compiled with +O3.
The "+O4" binary remains a 32-bit PA 2.0 binary, compiled with +O4.
A "full" binary is a 64 bit PA 2.0 binary compiled with +DA2.0W and
includes Profile Based Optimization (PBO - +I/+P), +O4,
+Ostaticprediction and +Oentrysched. It also includes the use of the
chatr utility to set the default instruction and data page sizes to
"L" for largest possible. A full binary was not created for this
writeup.
One very interesting observation was that while 8.2.5 was
significantly faster than 9.2.0 during the measurement run, it took
much, Much, MUCH longer to get everything cached.
Copyright 2001, Hewlett-Packard Company
Rare data:
The "threads" in these data represent separate, synchronous, streams
of requests. The data presented here is the summary of the raw data
collected.
j6700 1x750 MHz 9.2.0rc8 "stock" :
Threads Ops
1 1679
2 2773
3 2846
4 2848
5 2849
6 2852
7 2849
8 2862
9 2889
10 2893
11 2894
12 2893
13 2892
14 2893
15 2892
16 2893
Peak ops: 2894 with 11 threads
j6700 1x750 MHz 9.2.0rc8 "O2" :
Threads Ops
1 2153
2 4273
3 4587
4 4618
5 4625
6 4624
7 4625
8 4625
9 4624
10 4630
11 4630
12 4629
13 4629
14 4631
15 4623
16 4628
Peak ops: 4631 with 14 threads
j6700 2x750 MHz 9.2.0rc8 "O2" :
Threads Ops
1 1322
2 3484
3 4195
4 6522
5 7075
6 7096
7 7089
8 7060
9 7073
10 7098
11 7085
12 7089
13 7082
14 7090
15 7082
16 7077
Peak ops: 7098 with 10 threads
j6700 2x750 MHz 8.2.5 "O2" :
Threads Ops
1 3013
2 5925
3 8860
4 10586
5 10675
6 10682
7 10675
8 10660
9 10675
10 10679
11 10680
12 10664
13 10656
14 10672
15 10665
16 10664
Peak ops: 10682 with 6 threads
j6700 2x750 MHz 8.2.5 "O3" :
Threads Ops
1 1417
2 3419
3 4355
4 6682
5 7197
6 7230
7 7208
8 7226
9 7212
10 7220
11 7217
12 7223
13 7217
14 7229
15 7215
16 7222
Peak ops: 7230 with 6 threads
j6700 2x750 MHz 8.2.5 "O3" :
Threads Ops
1 1221
2 3418
3 4687
4 6797
5 7487
6 7537
7 7522
8 7543
9 7514
10 7516
11 7508
12 7515
13 7499
14 7507
15 7514
16 7501
Peak ops: 7543 with 8 threads
Copyright 2001, Hewlett-Packard Company
Appendix P: Brief perf analysis
The vast majority of the CPU's time is spent in user mode while
running BIND 9.2.0rc8 on the J6700. When the system is 100% CPU
saturated on both CPUs, roughly 905 of the CPU time is spent in
user-space, and only 10% of the time is spent in the kernel.
The table below, massaged from the output of Glance, shows the system
calls and their rates when the test is running:
named 9.2.0rc8 system calls
System Call Name ID Count Rate
--------------------------------------
gettimeofday 116 37817 7272.5
recvmsg 284 37813 7271.7
sendmsg 286 37813 7271.7
sched_yield 341 997 191.7
clock_gettime 346 2 0.3
ksleep 398 15 2.8
kwakeup 399 14 2.6
We can see that the bulk of the system calls are gettimeofday(),
recvmsg() and sendmsg(). It would seem that there is a gettimeofday()
call per request. The ksleep(), kwakeup() and sched_yield() syscalls
are common when acquiring and releasing mutexes.
In HP-UX 11, if the system is MP and a thread attempts to acquire a
locked mutex, it will spin some amount of time, make a call to
sched_yield(), and try again to acquire the mutex. If after a set
number of times through that loop it is unable to acquire the mutex it
then calls ksleep to go to sleep waiting for the mutex. The owner of
the mutex then will call kwakeup().
More information about the bind-workers
mailing list