Final pre-publication draft of the 9.1.0 perf writeup. Mk II
Rick Jones
raj at cup.hp.com
Wed Feb 14 23:28:36 UTC 2001
This message should have the attachment this time...
rick jones
--
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-- Attached file included as plaintext by Listar --
-- File: bind9_perf.txt
Copyright 2001, Hewlett-Packard Company
The Performance of BIND 9.1.0 on the HP 9000 L3000
Rick Jones <raj at cup.hp.com>
Hewlett-Packard Company
Cupertino, CA
Revision 0.7; February 13, 2001
ftp://ftp.cup.hp.com/dist/networking/briefs/bind9_perf.txt
Introduction:
One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS). The preeminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Domain (BIND) maintained by the Internet Software
Consortium (http://www.isc.org).
This evolving brief will take a look at the performance of
the named server from various revisions of BIND 9.
This document builds on prior documents discussing DNS server
performance which can be found under the URL:
ftp://ftp.cup.hp.com/dist/networking/briefs/
Summary:
BIND 9.1.0 named Performance Summary HP-UX 11.ACE
netperf DNS_RR requesting 1000 names out of cup.hp.com
+O2 +O2
"stock" "+O2" +DD64 nothread "full"
System +=================================================+
L3000 1x550 | 2,065 | 3,127 | 3,039 | 4,303 | 4,943 |
L3000 2x550 | | 5,338 | | | 6,771 |
L3000 4x550 | | 8,065 | | 4,757 | 10,006 |
J5600 2x552 | | | | | 7,459 |
+=================================================+
BIND 9.1.0 named Performance Summary HP-UX 11i
TOPS Enabled
netperf DNS_RR requesting 1000 names out of cup.hp.com
+O2 +O2
"stock" "+O2" +DD64 nothread "full"
System +=================================================+
L3000 1x550 | | | | | |
L3000 2x550 | | | | | 6,750 |
L3000 4x550 | | 7,792 | | | 9,599 |
J5600 2x552 | | | | | |
+=================================================+
BIND 9.1.0 named Performance Summary HP-UX 11i
TOPS Disabled
netperf DNS_RR requesting 1000 names out of cup.hp.com
+O2 +O2
"stock" "+O2" +DD64 nothread "full"
System +=================================================+
L3000 1x550 | | | | | 4,832 |
L3000 2x550 | | | | | |
L3000 4x550 | | 7,930 | | | 9,761 |
J5600 2x552 | | | | | |
+=================================================+
[ Additional information about the flavors of named can be found in
later sections. ]
Observations and Conclusions:
Apart from the mutex-related calls, and perhaps set_offsets(), the CPU
utilization appears to be spread-out over a very large number of
routines. This implies that higher levels of optimization, especially
those that can do automagic in-lining would be worth consideration.
This was validated in part by the performance boost of the version
with "full" optimization. The CPU time being spread-out over a large
number of routines also suggests a great deal (perhaps too much?) of
abstraction and layering.
One to two CPU scaling of the +O2 binary appears to be ~1.7X. One to
four CPU scaling is almost 2.6X. Two CPU scaling for the "full" binary
is 1.37X and four CPU scaling is only 2X.
The performance of BIND 9.1.0's named appears to be roughly 1/2 that
of BIND 8.2.2pl5's based on figures in the documents at the URL above.
Even if all the time spent in pthread routines (shown below) were to
magically evaporate, it would seem that the BIND 9.1.0 named would
still be substantially slower.
Even when compiled as a non-threaded application, the 9.1.0 named's
performance is still far below that of 8.2.2pl5 for similar compiler
optimization levels. (With the HP-UX ANSI C compiler "-O" and "+O2"
are the same.) This implies (but does not confirm) that a great part
of the slowdown in the 9.1.0 named is _not_ in the HP-UX 11 platform's
threads implementation.
As seen in the profiles below, a great deal of time is spent in the
pthread routines for locking and unlocking mutexes. Analysis of the
mutexes appears to show that a DNS server serving requests via a
single IP address will suffer contention for the mutex protecting
data structures associated with the socket bound to that IP
address. Once this contention is lessened by going to more than one IP
address contention appears to shift to the mutex protecting the
global ns_g_mctx data structure
Beyond the contention on mutexes, there appear to be a great many
mutex lock/unlock pairs per transaction. One analysis had the number
of mutex lock/unlock pairs at over one-hundred per transaction.
It also appears that during its lifetime, a request can be passed from
one thread to another. This may be related to the apparently large number
of mutex calls per transaction.
The author's naive assumption of the design of a multi-threaded named
would have been one in which each thread would behave as if it were a
named unto itself. It would read requests from the socket(s) assigned
to it and access the shared database with appropriate locks to
maintain the database's integrity. Any statistics kept would be kept
on a per-thread basis, and retrieval of overall statistics would
"walk" the threads in turn. It has been the author's experience
thus far that MP scaling is greatest when the design comes as close to
N separate instances as possible.
HP-UX 11i includes a new feature in the networking stack known as
"TOPS" or "Thread Optimized Packet Scheduling." At a high level, TOPS
remembers the CPU on which a socket was last accessed, and completes
inbound processing of packets for that socket on that CPU. This is a
refinement of the "Inbound Packet Scheduling" (IPS) of HP-UX 10.20
that only made decisions based on the addressing information in a
packet. The primary design center of TOPS is one where a single
thread of execution accesses one or more sockets and there is a high
packet rate through a single NIC, or significantly fewer NICs than
there are CPUs.
However, it would appear that the workflow model of BIND 9.1.0 is such
that multiple threads can be accessing the socket(s). Since these
threads execute on separate CPUs, there is no consistent CPU for any
one socket. Further, the actual packet rate for BIND 9.1.0 is very
low (less than 20 Mbit/s in these tests) and the bulk of the CPU time
is spent in user space, not in protocol processing. In this
environment, TOPS is not beneficial. One can disable TOPS by using adb
to poke the value of the kernel variable "enable_tops" to a value of
zero.
Appendix A: Configuration information
The initial HP-UX 11 tests were run on a system with the following
software installed:
A4926A B.11.00.11 1000Base-SX PCI Gigabit Ethernet Driver
A5230A B.11.00.05 100BT/9000 PCI
B3693AA C.03.20.00 HP GlancePlus/UX for s800 11.0
HPUXEng64RT B.11.00.01 English HP-UX 64-bit Runtime Environment
UXCoreMedia B.11.00.02 HP-UX Media Kit
XSWGR1100 B.11.00.47.08 General Release Patches, Nov. 1999 (ACE)
PHCO_21267 1.0 cumulative SAM/ObAM patch
PHNE_21865 1.0 1000Base-SX/T B.11.00.[02-11] cumul. patch
PHNE_21936 1.0 ftpd(1M) and ftp(1) patch
PHNE_22086 1.0 Streams Pty cumulative patch
PHNE_22244 1.0 Built-in PCI 100BASE-T patch
PHNE_22331 1.0 LAN product cumulative patch
PHNE_22397 1.0 cumulative ARPA Transport patch
PHNE_22542 1.0 PCI 100BT lan cumulative patch
The HP-UX 11i tests were run with the following software installed:
B3693AA C.03.20.00 HP GlancePlus/UX for s800 11i
CDE-English B.11.11 English CDE Environment
FDDI-00 B.11.11.01 PCI FDDI;Supptd HW=A3739A/A3739B;SW=J3626AA
FibrChanl-00 B.11.11.03 PCI FibreChannel;Supptd HW=A5158A
GigEther-00 B.11.11.14 PCI/HSC GigEther;Supptd HW=...
HPUX11i-OE-MC B.11.11 HP-UX Mission Critical Op. Env. Component
HPUXBase64 B.11.11 HP-UX 64-bit Base OS
HPUXBaseAux B.11.11 HP-UX Base OS Auxiliary
RAID-00 B.11.11.00 PCI RAID; Supptd HW=A5856A
Appendix B. Prospect Data
Some of the author's prior DNS server work employed a tool named puma
to get profiling information used in tuning and the like. Since that
time the puma profiling tool has been obsoleted. In its place, the
author used a tool named prospect. As of this writing, the author is
not certain of the public availability of the prospect tool. What
follows is some edited output of the prospect tool showing where time
was spent in various configurations. Full tool output will be
available via links in another appendix once the public disposition of
the prospect tool is determined.
One-CPU, L3000, named.9.1.0.O2, 11.0
pcnt accum% Routine name Instructi Filename
6% 6% set_offsets ENTRY ./named.9.1.0.O2
5% 11% pthread_mutex_lock CODE /usr/lib/libpthread.1
4% 15% dns_name_concatenate CODE ./named.9.1.0.O2
3% 18% dns_name_getlabelsequence CODE ./named.9.1.0.O2
3% 22% pthread_mutex_unlock CODE /usr/lib/libpthread.1
3% 25% dns_name_fullcompare CODE ./named.9.1.0.O2
3% 28% dns_rbt_findnode CODE ./named.9.1.0.O2
2% 30% dns_name_hash CODE ./named.9.1.0.O2
2% 33% pthread_mutex_lock ENTRY /usr/lib/libpthread.1
2% 35% memcpy CODE /usr/lib/libc.2
2% 37% dns_name_equal CODE ./named.9.1.0.O2
2% 39% __spin_lock CODE /usr/lib/libpthread.1
2% 41% __spin_unlock CODE /usr/lib/libpthread.1
2% 42% pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
2% 44% dns_rdataset_towiresorted CODE ./named.9.1.0.O2
1% 45% dns_name_split CODE ./named.9.1.0.O2
1% 47% dns_compress_findglobal CODE ./named.9.1.0.O2
1% 48% dns_name_clone CODE ./named.9.1.0.O2
1% 49% find_up ENTRY ./named.9.1.0.O2
1% 50% dns_name_fromregion CODE ./named.9.1.0.O2
1% 52% query_addadditional ENTRY ./named.9.1.0.O2
1% 53% dns_name_towire CODE ./named.9.1.0.O2
1% 54% cache_find ENTRY ./named.9.1.0.O2
1% 55% memcpy ENTRY /usr/lib/libc.2
1% 56% isc__buffer_add CODE ./named.9.1.0.O2
1% 57% chain_name ENTRY ./named.9.1.0.O2
1% 58% detachnode ENTRY ./named.9.1.0.O2
1% 59% isc__mempool_get CODE ./named.9.1.0.O2
1% 60% __pthread_mutex_lock ENTRY /usr/lib/libpthread.1
1% 61% dns_name_init CODE ./named.9.1.0.O2
1% 62% __pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
1% 62% __memset20 CODE /usr/lib/libc.2
1% 63% find_deepest_zonecut ENTRY ./named.9.1.0.O2
1% 64% isc__buffer_clear CODE ./named.9.1.0.O2
1% 65% memmove CODE /usr/lib/libc.2
1% 65% isc__buffer_putuint16 CODE ./named.9.1.0.O2
1% 66% qsort CODE /usr/lib/libc.2
1% 67% $$dyncall_external_20 MILLIC ./named.9.1.0.O2
1% 67% dns_aclelement_match CODE ./named.9.1.0.O2
1% 68% isc_rwlock_unlock CODE ./named.9.1.0.O2
1% 68% isc__mempool_put CODE ./named.9.1.0.O2
1% 69% new_reference ENTRY ./named.9.1.0.O2
1% 69% isc_rwlock_lock CODE ./named.9.1.0.O2
0% 70% dns_compress_add CODE ./named.9.1.0.O2
0% 70% isc__buffer_availableregion COD ./named.9.1.0.O2
0% 71% dns_name_setbuffer CODE ./named.9.1.0.O2
0% 71% dns_acl_match CODE ./named.9.1.0.O2
0% 72% pthread_mutex_lock STUB ./named.9.1.0.O2
0% 72% dns_rdata_towire CODE ./named.9.1.0.O2
0% 73% isc_buffer_getuint16 CODE ./named.9.1.0.O2
0% 73% memmove ENTRY /usr/lib/libc.2
0% 74% dns_message_rendersection CODE ./named.9.1.0.O2
0% 74% dns_rdataset_isassociated CODE ./named.9.1.0.O2
0% 74% pthread_mutex_unlock STUB ./named.9.1.0.O2
0% 75% memcpy STUB ./named.9.1.0.O2
0% 75% isc__buffer_init CODE ./named.9.1.0.O2
0% 76% cache_findrdataset ENTRY ./named.9.1.0.O2
0% 76% dns_name_getlabel CODE ./named.9.1.0.O2
0% 76% dns_compress_invalidate CODE ./named.9.1.0.O2
0% 77% client_request ENTRY ./named.9.1.0.O2
0% 77% dns_message_findname CODE ./named.9.1.0.O2
0% 77% isc_netaddr_eqprefix CODE ./named.9.1.0.O2
0% 78% mallinfo CODE /usr/lib/libc.2
0% 78% dns_zt_find CODE ./named.9.1.0.O2
0% 78% query_find ENTRY ./named.9.1.0.O2
0% 79% dns_name_fromwire CODE ./named.9.1.0.O2
0% 79% exit_check ENTRY ./named.9.1.0.O2
0% 79% dns_db_find CODE ./named.9.1.0.O2
0% 80% dns_db_findrdataset CODE ./named.9.1.0.O2
0% 80% msgresetnames ENTRY ./named.9.1.0.O2
0% 80% dns_rdata_fromregion CODE ./named.9.1.0.O2
0% 80% dns_rdataset_current CODE ./named.9.1.0.O2
Two-CPU, L3000, named.9.1.0.O2, 11.0
pcnt accum% Routine name Instrtn Filename
10% 10% pthread_mutex_lock CODE /usr/lib/libpthread.1
5% 16% set_offsets ENTRY ./named.9.1.0.O2
3% 19% pthread_mutex_unlock CODE /usr/lib/libpthread.1
3% 22% dns_rbt_findnode CODE ./named.9.1.0.O2
3% 26% find_up ENTRY ./named.9.1.0.O2
3% 28% __spin_lock CODE /usr/lib/libpthread.1
2% 31% dns_name_getlabelsequence CODE ./named.9.1.0.O2
2% 33% dns_name_equal CODE ./named.9.1.0.O2
2% 34% dns_name_fullcompare CODE ./named.9.1.0.O2
2% 36% __memset20 CODE /usr/lib/libc.2
2% 38% dns_name_concatenate CODE ./named.9.1.0.O2
2% 40% dns_name_hash CODE ./named.9.1.0.O2
2% 42% pthread_mutex_lock ENTRY /usr/lib/libpthread.1
2% 43% __spin_unlock CODE /usr/lib/libpthread.1
2% 45% memcpy CODE /usr/lib/libc.2
1% 46% dns_rdataset_towiresorted CODE ./named.9.1.0.O2
1% 47% pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
1% 49% cache_find ENTRY ./named.9.1.0.O2
1% 50% dns_compress_findglobal CODE ./named.9.1.0.O2
1% 51% detachnode ENTRY ./named.9.1.0.O2
1% 52% dns_name_clone CODE ./named.9.1.0.O2
1% 53% isc_rwlock_unlock CODE ./named.9.1.0.O2
1% 54% query_addadditional ENTRY ./named.9.1.0.O2
1% 55% dns_name_split CODE ./named.9.1.0.O2
1% 56% dns_name_towire CODE ./named.9.1.0.O2
1% 57% dns_name_fromregion CODE ./named.9.1.0.O2
1% 58% __pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
1% 59% isc_rwlock_lock CODE ./named.9.1.0.O2
1% 60% isc__mempool_get CODE ./named.9.1.0.O2
1% 60% dns_aclelement_match CODE ./named.9.1.0.O2
1% 61% __pthread_mutex_lock ENTRY /usr/lib/libpthread.1
1% 62% chain_name ENTRY ./named.9.1.0.O2
1% 63% qsort CODE /usr/lib/libc.2
1% 63% isc__buffer_add CODE ./named.9.1.0.O2
1% 64% isc__mempool_put CODE ./named.9.1.0.O2
1% 65% dns_name_init CODE ./named.9.1.0.O2
1% 65% dns_compress_add CODE ./named.9.1.0.O2
1% 66% memcpy ENTRY /usr/lib/libc.2
1% 67% isc__buffer_putuint16 CODE ./named.9.1.0.O2
1% 67% new_reference ENTRY ./named.9.1.0.O2
1% 68% $$dyncall_external_20 MILLIC ./named.9.1.0.O2
1% 68% __errno STUB /usr/lib/libc.2
1% 69% cache_findrdataset ENTRY ./named.9.1.0.O2
1% 69% memmove CODE /usr/lib/libc.2
1% 70% dns_zt_find CODE ./named.9.1.0.O2
1% 71% dns_rdata_towire CODE ./named.9.1.0.O2
0% 71% mem_getunlocked ENTRY ./named.9.1.0.O2
0% 71% isc_netaddr_eqprefix CODE ./named.9.1.0.O2
0% 72% dns_message_rendersection CODE ./named.9.1.0.O2
0% 72% isc__buffer_availableregion CODE ./named.9.1.0.O2
0% 73% dns_name_setbuffer CODE ./named.9.1.0.O2
0% 73% find_deepest_zonecut ENTRY ./named.9.1.0.O2
0% 74% mallinfo CODE /usr/lib/libc.2
0% 74% dns_acl_match CODE ./named.9.1.0.O2
0% 74% mem_putunlocked ENTRY ./named.9.1.0.O2
0% 75% dns_message_findname CODE ./named.9.1.0.O2
0% 75% isc__buffer_clear CODE ./named.9.1.0.O2
0% 76% attach ENTRY ./named.9.1.0.O2
0% 76% pthread_mutex_lock STUB ./named.9.1.0.O2
0% 76% dns_rdataset_isassociated CODE ./named.9.1.0.O2
0% 77% isc_timer_reset CODE ./named.9.1.0.O2
0% 77% isc_buffer_getuint16 CODE ./named.9.1.0.O2
0% 77% isc__buffer_init CODE ./named.9.1.0.O2
0% 78% msgresetnames ENTRY ./named.9.1.0.O2
0% 78% pthread_mutex_unlock STUB ./named.9.1.0.O2
0% 78% client_request ENTRY ./named.9.1.0.O2
0% 79% dns_compress_invalidate CODE ./named.9.1.0.O2
0% 79% dispatch ENTRY ./named.9.1.0.O2
0% 79% memcmp ENTRY /usr/lib/libc.2
0% 79% dns_rdataset_init CODE ./named.9.1.0.O2
the rest was not shown to save space.
Four-CPU, L3000, named.9.1.0.O2, 11.0
pcnt accum% Routine name Instrctn Filename
17% 17% pthread_mutex_lock CODE /usr/lib/libpthread.1
4% 21% set_offsets ENTRY ./named.9.1.0.O2
3% 24% pthread_mutex_unlock CODE /usr/lib/libpthread.1
3% 27% dns_rbt_findnode CODE ./named.9.1.0.O2
3% 30% __spin_lock CODE /usr/lib/libpthread.1
3% 32% find_up ENTRY ./named.9.1.0.O2
2% 35% dns_name_getlabelsequence CODE ./named.9.1.0.O2
2% 36% dns_name_fullcompare CODE ./named.9.1.0.O2
2% 38% dns_name_concatenate CODE ./named.9.1.0.O2
2% 40% dns_name_hash CODE ./named.9.1.0.O2
2% 42% __memset20 CODE /usr/lib/libc.2
2% 44% dns_name_equal CODE ./named.9.1.0.O2
2% 45% __spin_unlock CODE /usr/lib/libpthread.1
1% 47% isc_rwlock_unlock CODE ./named.9.1.0.O2
1% 48% isc_rwlock_lock CODE ./named.9.1.0.O2
1% 49% memcpy CODE /usr/lib/libc.2
1% 51% cache_find ENTRY ./named.9.1.0.O2
1% 52% dns_rdataset_towiresorted CODE ./named.9.1.0.O2
1% 53% pthread_mutex_lock ENTRY /usr/lib/libpthread.1
1% 55% detachnode ENTRY ./named.9.1.0.O2
1% 56% pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
1% 57% dns_name_clone CODE ./named.9.1.0.O2
1% 58% dns_compress_findglobal CODE ./named.9.1.0.O2
1% 59% isc__mempool_get CODE ./named.9.1.0.O2
1% 59% query_addadditional ENTRY ./named.9.1.0.O2
1% 60% dns_name_split CODE ./named.9.1.0.O2
1% 61% dns_name_towire CODE ./named.9.1.0.O2
1% 62% dns_aclelement_match CODE ./named.9.1.0.O2
1% 63% isc__buffer_add CODE ./named.9.1.0.O2
1% 63% new_reference ENTRY ./named.9.1.0.O2
1% 64% dns_name_fromregion CODE ./named.9.1.0.O2
1% 65% chain_name ENTRY ./named.9.1.0.O2
1% 65% __pthread_mutex_unlock ENTRY /usr/lib/libpthread.1
1% 66% isc__mempool_put CODE ./named.9.1.0.O2
1% 67% memcpy ENTRY /usr/lib/libc.2
1% 67% __pthread_mutex_lock ENTRY /usr/lib/libpthread.1
1% 68% attach ENTRY ./named.9.1.0.O2
1% 68% $$dyncall_external_20 MILLIC ./named.9.1.0.O2
1% 69% mem_getunlocked ENTRY ./named.9.1.0.O2
1% 70% dns_zt_find CODE ./named.9.1.0.O2
1% 70% __errno STUB /usr/lib/libc.2
1% 71% qsort CODE /usr/lib/libc.2
1% 71% isc__buffer_putuint16 CODE ./named.9.1.0.O2
1% 72% memmove CODE /usr/lib/libc.2
0% 72% dns_name_init CODE ./named.9.1.0.O2
0% 73% client_request ENTRY ./named.9.1.0.O2
0% 73% cache_findrdataset ENTRY ./named.9.1.0.O2
0% 74% dns_message_rendersection CODE ./named.9.1.0.O2
0% 74% dns_acl_match CODE ./named.9.1.0.O2
0% 74% mem_putunlocked ENTRY ./named.9.1.0.O2
0% 75% dns_compress_add CODE ./named.9.1.0.O2
0% 75% detach ENTRY ./named.9.1.0.O2
0% 76% dns_rdata_towire CODE ./named.9.1.0.O2
0% 76% isc__buffer_clear CODE ./named.9.1.0.O2
0% 76% mallinfo CODE /usr/lib/libc.2
0% 77% isc__buffer_availableregion CODE ./named.9.1.0.O2
0% 77% isc_timer_reset CODE ./named.9.1.0.O2
0% 78% find_deepest_zonecut ENTRY ./named.9.1.0.O2
0% 78% dns_message_findname CODE ./named.9.1.0.O2
0% 78% dns_name_fromwire CODE ./named.9.1.0.O2
0% 79% query_getcachedb ENTRY ./named.9.1.0.O2
0% 79% dns_name_setbuffer CODE ./named.9.1.0.O2
0% 79% isc_heap_delete CODE ./named.9.1.0.O2
0% 79% sooner ENTRY ./named.9.1.0.O2
One-CPU, L3000, named.9.1.0.O2.nothreads, 11.0
pcnt accum% Routine name Instruct Filename
8% 8% set_offsets ENTRY ./named.9.1.0.O2.nothreads
5% 13% dns_name_getlabelsequence CODE ./named.9.1.0.O2.nothreads
4% 17% dns_rbt_findnode CODE ./named.9.1.0.O2.nothreads
4% 21% dns_name_fullcompare CODE ./named.9.1.0.O2.nothreads
3% 25% dns_name_concatenate CODE ./named.9.1.0.O2.nothreads
3% 28% dns_name_hash CODE ./named.9.1.0.O2.nothreads
3% 31% dns_name_equal CODE ./named.9.1.0.O2.nothreads
2% 33% memcpy CODE /usr/lib/libc.2
2% 35% dns_compress_findglobal CODE ./named.9.1.0.O2.nothreads
2% 37% dns_rdataset_towiresorted CODE ./named.9.1.0.O2.nothreads
2% 38% dns_name_clone CODE ./named.9.1.0.O2.nothreads
1% 40% __memset20 CODE /usr/lib/libc.2
1% 41% dns_name_split CODE ./named.9.1.0.O2.nothreads
1% 43% find_up ENTRY ./named.9.1.0.O2.nothreads
1% 44% isc__buffer_add CODE ./named.9.1.0.O2.nothreads
1% 46% query_addadditional ENTRY ./named.9.1.0.O2.nothreads
1% 47% dns_name_towire CODE ./named.9.1.0.O2.nothreads
1% 48% dns_name_fromregion CODE ./named.9.1.0.O2.nothreads
1% 49% detachnode ENTRY ./named.9.1.0.O2.nothreads
1% 51% cache_find ENTRY ./named.9.1.0.O2.nothreads
1% 52% isc__mempool_get CODE ./named.9.1.0.O2.nothreads
1% 53% memcpy ENTRY /usr/lib/libc.2
1% 54% dns_name_init CODE ./named.9.1.0.O2.nothreads
1% 55% chain_name ENTRY ./named.9.1.0.O2.nothreads
1% 56% $$dyncall_external_20 MILLIC ./named.9.1.0.O2.nothreads
1% 57% find_deepest_zonecut ENTRY ./named.9.1.0.O2.nothreads
1% 58% qsort CODE /usr/lib/libc.2
1% 58% dns_aclelement_match CODE ./named.9.1.0.O2.nothreads
1% 59% isc__buffer_putuint16 CODE ./named.9.1.0.O2.nothreads
1% 60% isc__buffer_clear CODE ./named.9.1.0.O2.nothreads
1% 61% dns_rdataset_isassociated CODE ./named.9.1.0.O2.nothreads
1% 61% memmove CODE /usr/lib/libc.2
1% 62% new_reference ENTRY ./named.9.1.0.O2.nothreads
1% 62% name_hash ENTRY ./named.9.1.0.O2.nothreads
1% 63% isc__mempool_put CODE ./named.9.1.0.O2.nothreads
1% 64% dns_name_setbuffer CODE ./named.9.1.0.O2.nothreads
1% 64% dns_compress_add CODE ./named.9.1.0.O2.nothreads
1% 65% dns_name_fromwire CODE ./named.9.1.0.O2.nothreads
1% 65% dns_rdata_towire CODE ./named.9.1.0.O2.nothreads
1% 66% isc__buffer_init CODE ./named.9.1.0.O2.nothreads
1% 66% dns_message_rendersection CODE ./named.9.1.0.O2.nothreads
1% 67% cache_findrdataset ENTRY ./named.9.1.0.O2.nothreads
1% 67% isc__buffer_availableregion CODE ./named.9.1.0.O2.nothreads
0% 68% dns_name_getlabel CODE ./named.9.1.0.O2.nothreads
0% 68% rdataset_current ENTRY ./named.9.1.0.O2.nothreads
0% 69% query_find ENTRY ./named.9.1.0.O2.nothreads
0% 69% isc_netaddr_eqprefix CODE ./named.9.1.0.O2.nothreads
0% 70% dns_compress_invalidate CODE ./named.9.1.0.O2.nothreads
0% 70% dns_acl_match CODE ./named.9.1.0.O2.nothreads
0% 71% dns_db_find CODE ./named.9.1.0.O2.nothreads
0% 71% dns_message_findname CODE ./named.9.1.0.O2.nothreads
0% 72% memcpy STUB ./named.9.1.0.O2.nothreads
0% 72% client_request ENTRY ./named.9.1.0.O2.nothreads
0% 72% dns_rdataset_current CODE ./named.9.1.0.O2.nothreads
0% 73% mallinfo CODE /usr/lib/libc.2
0% 73% msgresetnames ENTRY ./named.9.1.0.O2.nothreads
0% 74% dns_rdataset_init CODE ./named.9.1.0.O2.nothreads
0% 74% dns_db_findrdataset CODE ./named.9.1.0.O2.nothreads
0% 74% dns_rdataset_next CODE ./named.9.1.0.O2.nothreads
0% 75% memmove ENTRY /usr/lib/libc.2
0% 75% dns_db_attach CODE ./named.9.1.0.O2.nothreads
0% 75% query_newname ENTRY ./named.9.1.0.O2.nothreads
0% 76% isc_rwlock_lock CODE ./named.9.1.0.O2.nothreads
0% 76% dns_rdataset_additionaldata CODE ./named.9.1.0.O2.nothreads
0% 76% dns_zt_find CODE ./named.9.1.0.O2.nothreads
0% 76% dns_compress_getmethods CODE ./named.9.1.0.O2.nothreads
0% 77% isc_buffer_getuint16 CODE ./named.9.1.0.O2.nothreads
0% 77% dispatch ENTRY ./named.9.1.0.O2.nothreads
0% 77% query_getcachedb ENTRY ./named.9.1.0.O2.nothreads
0% 78% memset ENTRY /usr/lib/libc.2
0% 78% dns_rdata_fromregion CODE ./named.9.1.0.O2.nothreads
0% 78% isc_rwlock_unlock CODE ./named.9.1.0.O2.nothreads
0% 79% dns_db_detach CODE ./named.9.1.0.O2.nothreads
0% 79% towire_ns ENTRY ./named.9.1.0.O2.nothreads
0% 79% isc_socket_recv CODE ./named.9.1.0.O2.nothreads
0% 79% memset CODE /usr/lib/libc.2
0% 80% mem_getunlocked ENTRY ./named.9.1.0.O2.nothreads
0% 80% dns_rdata_additionaldata CODE ./named.9.1.0.O2.nothreads
0% 80% dns_db_detachnode CODE ./named.9.1.0.O2.nothreads
0% 80% dns_rdata_toregion CODE ./named.9.1.0.O2.nothreads
Appendix X. Half-cooked Data
The following are the aggregated data for each of the runs mentioned
elsewhere in the document.
L3000, UP, named9.1.0stock - compilation as it is out of the tarball
for HP-UX - -g, no optimization:
Threads Ops
1 1218
2 1757
3 1861
4 1920
5 1961
6 1984
7 2000
8 2024
9 2032
10 2039
11 2050
12 2041
13 2055
14 2059
15 2063
16 2065
Peak ops: 2065 with 16 threads
L3000, UP, named9.1.0.O2 - compilation with +O2 instead of -g
Threads Ops
1 1704
2 2442
3 2728
4 2836
5 2903
6 2966
7 2984
8 3043
9 3063
10 3083
11 3088
12 3078
13 3103
14 3118
15 3122
16 3127
Peak ops: 3127 with 16 threads
L3000, UP, named9.1.0.O2 - compilation with +O2 +DD64 for a 64-bit binary
Threads Ops
1 1717
2 2474
3 2682
4 2793
5 2851
6 2900
7 2928
8 2985
9 3011
10 3000
11 3011
12 3016
13 3039
14 3029
15 3027
16 3037
Peak ops: 3039 with 13 threads
L3000, 2P, named9.1.0.O2 - compilation with +O2 instead of -g
Threads Ops
2 2647
4 3838
6 4650
8 4811
10 4978
12 5130
14 5212
16 5285
18 5282
20 5308
22 5316
24 5332
26 5326
28 5332
30 5338
32 5326
Peak ops: 5338 with 30 threads
L3000, 4P, named9.1.0.O2, 11.ACE - compilation with +O2 instead of -g
Threads Ops
2 2766
4 4324
6 5524
8 6466
10 7153
12 7459
14 7741
16 7790
18 7776
20 7679
22 7970
24 7866
26 7989
28 8065
30 7985
32 7935
Peak ops: 8065 with 28 threads
L3000, 4P, named9.1.0.O2, 11i, TOPS - compilation with +O2 instead of -g
Threads Ops
2 2782
4 4724
6 5869
8 6491
10 7107
12 7388
14 7592
16 7670
18 7710
20 7724
22 7729
24 7745
26 7755
28 7783
30 7735
32 7792
Peak ops: 7792 with 32 threads
L3000, 4P, named9.1.0.O2, 11i, noTOPS - compilation with +O2 instead of -g
Threads Ops
2 2793
4 4780
6 5950
8 6568
10 7229
12 7513
14 7733
16 7783
18 7853
20 7895
22 7883
24 7916
26 7910
28 7873
30 7930
32 7927
Peak ops: 7930 with 30 threads
L3000, 4P, named.9.1.0.O2.nothreads, 11.ACE - compilation with
+O2 instead of -g, also configure --disable-threads
Threads Ops
2 4253
4 4556
6 4622
8 4802
10 4806
12 4689
14 4659
16 4637
18 4673
20 4717
22 4775
24 4810
26 4726
28 4695
30 4781
32 4774
Peak ops: 4810 with 24 threads
L3000, 4P, named.9.1.0.O2.nothreads, 11.ACE - compilation with +O2
instead of -g, also configure --disable-threads
Threads Ops
2 4207
4 4554
6 4696
8 4756
10 4757
12 4753
14 4748
16 4741
18 4742
20 4747
22 4750
24 4744
26 4747
28 4746
30 4747
32 4748
Peak ops: 4757 with 10 threads
L3000, UP, named.9.1.0.O2.nothreads, 11.ACE - compilation with +O2
instead of -g, also configure --disable-threads
Threads Ops
2 3889
4 4175
6 4243
8 4302
10 4286
12 4285
14 4301
16 4303
18 4298
20 4279
22 4277
24 4293
26 4275
28 4285
30 4285
32 4290
Peak ops: 4303 with 16 threads
L3000, 2P, named.9.1.0.O2.PBO, 11.ACE - compilation with +O2
instead of -g, PBO based on training done on the two-CPU configuration
Threads Ops
2 2809
4 3962
6 4963
8 5275
10 5458
12 5612
14 5713
16 5776
18 5826
20 5808
22 5798
24 5836
26 5868
28 5858
30 5860
32 5722
Peak ops: 5868 with 26 threads
L3000, 2P, named.9.1.0.O4.PBO, 11.ACE - compilation with +O4
instead of -g, PBO based on training done on the two-CPU configuration
Threads Ops
2 3096
4 4029
6 5325
8 5719
10 5878
12 6136
14 6103
16 6133
18 6220
20 6287
22 6266
24 6230
26 6134
28 6137
30 6118
32 6109
Peak ops: 6287 with 20 threads
L3000, UP, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data
Threads Ops
2 3539
4 4338
6 4597
8 4751
10 4821
12 4840
14 4867
16 4883
18 4897
20 4929
22 4928
24 4919
26 4943
28 4936
30 4930
32 4937
Peak ops: 4943 with 26 threads
L3000, 2P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data
Threads Ops
2 3085
4 4789
6 5641
8 6045
10 6422
12 6634
14 6711
16 6771
18 6736
20 6728
22 6754
24 6730
26 6731
28 6706
30 6750
32 6721
Peak ops: 6771 with 16 threads
L3000, 4P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data
Threads Ops
2 3271
4 5148
6 6429
8 7615
10 8313
12 9021
14 9353
16 9660
18 9793
20 9871
22 9856
24 9903
26 9941
28 10003
30 10006
32 9873
Peak ops: 10006 with 30 threads
J5600, 2P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data
Threads Ops
2 3386
4 5195
6 5956
8 6463
10 6856
12 7230
14 7427
16 7459
18 7307
20 7369
22 7458
24 7434
26 7382
28 7421
30 7428
32 7386
Peak ops: 7459 with 16 threads
L3000, UP, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled
Threads Ops
2 2927
4 4000
6 4342
8 4447
10 4539
12 4676
14 4694
16 4696
18 4731
20 4768
22 4747
24 4758
26 4778
28 4774
30 4829
32 4832
Peak ops: 4832 with 32 threads
L3000, 2P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled
Threads Ops
2 3332
4 5093
6 5767
8 6209
10 6396
12 6624
14 6655
16 6696
18 6690
20 6712
22 6741
24 6660
26 6733
28 6750
30 6722
32 6734
Peak ops: 6750 with 28 threads
L3000, 4P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled
Threads Ops
2 3375
4 5524
6 6840
8 7680
10 8440
12 8894
14 9297
16 9417
18 9417
20 9534
22 9514
24 9538
26 9584
28 9599
30 9566
32 9563
Peak ops: 9599 with 28 threads
L3000, 4P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS disabled
Threads Ops
2 3344
4 5562
6 6921
8 7731
10 8587
12 9082
14 9431
16 9571
18 9638
20 9613
22 9683
24 9706
26 9702
28 9730
30 9752
32 9761
Peak ops: 9761 with 32 threads
Appendix T: A Brief Demonstration of TOPS
While the comparatively low packet rate/high overhead and
multi-threaded model used by named from BIND 9.1.0 might not fit in
the model for which TOPS was designed, TOPS does indeed have value :)
To demonstrate that value, the results of two sets of aggregate
netperf TCP_RR tests are presented.
Figure T1.
TOPS versus no TOPS
netperf single-byte, aggregate TCP_RR
Core 100BT, 4x550 MHz L3000
Throughput
Trans/s
no TOPS 51,704
TOPS 52,526
The difference above is constrained as the NIC is beginning to
approach the maximum packet per second limits of a 100BT network.
Maximum packets per second for 100BT is on the order of 140,000 pps,
and 52,526 transactions per second is ~105,000 packets per second.
This 100BT NIC was also connected to a "timeshare" network rather than
a test network. Further, had this been a system with more and/or
less-powerful CPUs (the 550 MHz PA-8600 being the most powerful
PA-RISC processor on the market as of this writing), the performance
benefits of TOPS would have been more striking.
In the output of vsar below (similar data can be seen in the "per CPU"
statistics of Glance) with TOPS disabled we see that one of the four
CPUs - the CPU taking interrupts from the NIC - is completely
saturated, but there is non-trivial idle time remaining on the other
three CPUs.
CPU usr sys intr idle
0 20.94 61.01 0.49 17.52
1 0.37 6.21 92.49 0.91
2 20.06 61.91 0.31 17.71
3 18.16 63.73 0.29 17.71
ALL 14.88 48.22 23.40 13.46
Now, with TOPS enabled, we can see some of that interrupt processing
spread-out across the CPUs, and a higher percentage of IDLE over
all.
CPU usr sys intr idle
0 23.93 44.96 11.64 19.47
1 8.19 17.17 64.65 10.00
2 20.30 47.89 12.07 19.74
3 20.15 47.80 12.26 19.79
ALL 18.14 39.45 25.16 17.25
Appendix S. Tracking-down the poor scaling
Well, if you have read this far, you deserve a reward, so let us
discuss the scaling of the named from BIND 9.1.0 in greater detail.
As you probably noticed, while absolute performance increased at
higher levels of optimization, the MP scaling reduced. This implies
that as the code became more efficient, contention for mutexes
increased. This is also implied in the increase in time shown spent in
mutex calls in the prospect profiles.
HP has a number of tools which can be used to measure mutex contention
in a threaded process. One of these is an internal tool known as
mutex_perf, which can decode the _pth.mumble file created when an
application is linked against an instrumented libpthread library. That same
library includes a call, pthread_mutex_setname_np(mp,name), which can
be used to associate an ASCII string of up to 32 characters with the
mutex pointed-to by mp. A small change to lib/isc/pthreads/thread.c to
add a routine:
#ifdef TRACED_MUTEXES
#include <pthread.h>
#include <stdio.h>
isc_result_t
isc_mutex_init_traced(pthread_mutex_t *mp, int line, char *funcname) {
char name[32];
snprintf(name,31,"line %d file %s",line,funcname);
if (pthread_mutex_init(mp,NULL) == 0) {
/* give it a name */
pthread_mutex_setname_np(mp,name);
return ISC_R_SUCCESS;
}
else {
return ISC_R_UNEXPECTED;
}
}
#endif /* TRACED_MUTEXES */
and then to lib/isc/pthreads/include/isc/mutex.h:
#ifdef TRACED_MUTEXES
#define isc_mutex_init(mp) isc_mutex_init_traced((mp),__LINE__,__FILE__)
#else
#define isc_mutex_init(mp) \
((pthread_mutex_init((mp), NULL) == 0) ? \
ISC_R_SUCCESS : ISC_R_UNEXPECTED)
#endif /* TRACED_MUTEXES */
and then poof, we can get overall mutex information like this:
General information:
Filename of instrumentation: _pth.mutex.21591
Instrumentation version: MUTEX_PERF_VERSION_1
Process ID: 21591
Run date: Tue Feb 13 10:48:18 2001
Number of processors available: 4
Total number created threads: 7
Maximum simul. running threads: 7
Ordering of mutexes displayed: Highest used mutexes first
Types of mutexes displayed: Used and unused mutexes
Count of initialized mutexes: 2508
Count of used mutexes: 1187
Mutexes with immediate acquisition: 1187
Mutexes with acquisition after spin: 31
Mutexes with acquisition after yield: 34
Mutexes with acquisition after sleep: 26
Mutexes with waiters on unlock: 26
Maximum number of waiters on an unlock: 1
and per mutex information like this:
Per lock statistics:
Mutex name: line 1146 file socket.c
Mutex init order: 228
Mutex address: 0x4006afa0
Mutex init: 0x4006afa0 (address of init or 1st lock)
Mutex type: PTHREAD_PROCESS_PRIVATE
Mutex type: PTHREAD_MUTEX_NORMAL
Mutex spin cnt: 700
Mutex yield freq: every 100 spins
Lock success calls: 14417389
Lock failed calls: 0
Trylock success calls: 0
Trylock failed calls: 0
Unlock success calls: 14417389
Unlock failed calls: 0
Lock immediately: 10310698
Lock during 1st spin: 1067505
Lock after yield1: 1834545
Lock after yield2: 630044
Lock after yield3: 271664
Lock after yield4: 116507
Lock after yield5: 52017
Lock after yield6: 28416
Lock after yield7: 9384
Lock after yield8: 0
Lock after yield9: 0
Lock after yield10/more:0
Lock after sleep (hdf): 0
Lock after sleep1: 85724
Lock after sleep2/more: 10885
Lock had waiters: 90558
Lock max waiters: 4
Lock total waiters: 104595
Unlock had waiters: 108550
Unlock max waiters: 4
Unlock total waiters: 131540
This run was simply requesting two separate names. You will notice
that a very large number of mutexes were created, but a smaller number
were "active." In a test accessing a larger number of names, even more
mutexes are created.
The per mutex data for "line 1146 file socket.c" shows that about 10
times out of every 14 the mutex is acquired immediately. However,
about 4 times out of 14, it was necessary to either spin for a short
while, or actually go to sleep for some time before the mutex could
be acquired. In other words, there was contention for this mutex.
Going back to the code, line 1146 in socket.c is in the routine
allocate_socket(). So, it would seem that there are per-socket data
structures being protected by this lock. Given that the test used only
one active socket, it is not all that surprising that there could be
some contention on that lock. One easy way to minimize that contention
might be to have more than one socket. We can do that by adding an
address alias to the Gigabit interface in use and update some of the
clients.
#active sockets relative perf
one 1.00
two 1.08
four 0.97
There was a nearly 8% improvement when named was accessing two
sockets. However, something dropped performance when there were four
sockets being accessed. Going back to the instrumented named, when
four sockets are being accessed, the socket mutex(es) do not appear
at the top of the list any longer. It is replaced by another one.
There are now four socket mutexes active, each looking something like
this:
Mutex name: line 1146 file socket.c
Mutex init order: 228
Mutex address: 0x4006afa0
Mutex init: 0x4006afa0 (address of init or 1st lock)
Mutex type: PTHREAD_PROCESS_PRIVATE
Mutex type: PTHREAD_MUTEX_NORMAL
Mutex spin cnt: 700
Mutex yield freq: every 100 spins
Lock success calls: 318577
Lock failed calls: 0
Trylock success calls: 0
Trylock failed calls: 0
Unlock success calls: 318577
Unlock failed calls: 0
Lock immediately: 283532
Lock during 1st spin: 8789
Lock after yield1: 15935
Lock after yield2: 5291
Lock after yield3: 1841
Lock after yield4: 892
Lock after yield5: 353
Lock after yield6: 230
Lock after yield7: 83
Lock after yield8: 0
Lock after yield9: 0
Lock after yield10/more:0
Lock after sleep (hdf): 0
Lock after sleep1: 1555
Lock after sleep2/more: 76
Lock had waiters: 1394
Lock max waiters: 2
Lock total waiters: 1442
Unlock had waiters: 1709
Unlock max waiters: 3
Unlock total waiters: 1913
You can see how the number of times the lock was acquired without
contention has increased to a ratio of 28 out of every 31
This lock though has risen to the top:
Mutex name: line 684 file mem.c
Mutex init order: 86
Mutex address: 0x40042070
Mutex init: 0x40042070 (address of init or 1st lock)
Mutex type: PTHREAD_PROCESS_PRIVATE
Mutex type: PTHREAD_MUTEX_NORMAL
Mutex spin cnt: 700
Mutex yield freq: every 100 spins
Lock success calls: 18431179
Lock failed calls: 0
Trylock success calls: 0
Trylock failed calls: 0
Unlock success calls: 18431179
Unlock failed calls: 0
Lock immediately: 14035389
Lock during 1st spin: 4042204
Lock after yield1: 314045
Lock after yield2: 13290
Lock after yield3: 6502
Lock after yield4: 7347
Lock after yield5: 4936
Lock after yield6: 2848
Lock after yield7: 1335
Lock after yield8: 0
Lock after yield9: 0
Lock after yield10/more:0
Lock after sleep (hdf): 0
Lock after sleep1: 3268
Lock after sleep2/more: 15
Lock had waiters: 3416
Lock max waiters: 3
Lock total waiters: 4398
Unlock had waiters: 3298
Unlock max waiters: 3
Unlock total waiters: 4706
with roughly 4 contented lockings out of every 18 attempts. There are
two "mem.c" files - one is in openssl, and since this was not an
encrypted test, we can probably assume it was lib/isc/mem.c, and the
routine is isc_mem_createx(). There were four mutexes initialized by
"line 684 file mem.c" and a search of the source code found only two
files with "isc_mem_createx" present: ./lib/isc/include/isc/mem.h and
./lib/isc/mem.c. At it turns-out, isc_mem_createx is only called from
isc_mem_create, and that call is the only thing in isc_mem_create (CS
101 coding style and complete faith in the compiler I suppose :) The
routine isc_mem_create() is called all over the place, and the author
did not care to wade through all that deeply :)
OK, the author lied - about an hour later, it dawned on him that it
would be possible to further modify the isc_mutex_init_traced routine
to dump a stack trace to a file for each mutex initialized. In this
way, it would be possible to go from a mutex address to a stack trace
of its initialization. The mods to thread.c become:
#ifdef TRACED_MUTEXES
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdio.h>
extern void U_STACK_TRACE(void);
void redirect_stack_trace(FILE *filest) {
int fd, fd2, err;
fd = fileno(filest);
fflush(filest);
fflush(stderr);
fd2 = dup(STDERR_FILENO);
err = dup2(fd, STDERR_FILENO);
U_STACK_TRACE();
fflush(stderr);
fd = dup2(fd2, STDERR_FILENO);
close(fd2);
}
isc_result_t
isc_mutex_init_traced(pthread_mutex_t *mp, int line, char *funcname) {
char name[32];
FILE *tracefile;
char tracefilename[64];
/* this is a rather heavy-handed way to find who
inits a mutex at a given address... */
snprintf(tracefilename,63,"/var/tmp/mutex%p",mp);
tracefile=fopen(tracefilename,"a");
if (tracefile != NULL) {
redirect_stack_trace(tracefile);
fclose(tracefile);
}
snprintf(name,31,"line %d file %s",line,funcname);
if (pthread_mutex_init(mp,NULL) == 0) {
/* give it a name */
pthread_mutex_setname_np(mp,name);
return ISC_R_SUCCESS;
}
else {
return ISC_R_UNEXPECTED;
}
}
#endif /* TRACED_MUTEXES */
So, now we can see who is initializing the mutex in mem.c that is the
next highest mutex in the list: (the address is different in this run
since the binary and some of the layout changed)
# more /var/tmp/mutex400436d8
( 0) 0x00170f84 redirect_stack_trace + 0x4c [./named_init_trace]
( 1) 0x00171020 isc_mutex_init_traced + 0x4c [./named_init_trace]
( 2) 0x001574e8 isc_mem_createx + 0x234 [./named_init_trace]
( 3) 0x001575e0 isc_mem_create + 0x28 [./named_init_trace]
( 4) 0x00013268 main + 0x7c [./named_init_trace]
( 5) 0xc01409f8 _start + 0xa0 [/usr/lib/libc.2]
( 6) 0x0000ac70 $START$ + 0x1a0 [./named_init_trace]
So, it is something fairly close to the beginning in main.c for named:
int
main(int argc, char *argv[]) {
isc_result_t result;
program_name = argv[0];
isc_assertion_setcallback(assertion_failed);
isc_error_setfatal(library_fatal_error);
isc_error_setunexpected(library_unexpected_error);
ns_os_init(program_name);
result = isc_app_start();
if (result != ISC_R_SUCCESS)
ns_main_earlyfatal("isc_app_start() failed: %s",
isc_result_totext(result));
result = isc_mem_create(0, 0, &ns_g_mctx);
if (result != ISC_R_SUCCESS)
ns_main_earlyfatal("isc_mem_create() failed: %s",
isc_result_totext(result));
dns_result_register();
dst_result_register();
there is only one call to isc_mem_create in main, to setup whatever is
pointed-to by ns_g_mctx. From the looks of things, this is a fairly
central datastructure, but in all honesty, as of yet, the author has
not figured-out what it is for exactly :)
Using this same tracking technique, we can also look at some of the
non-socket mutexes that lurk further down in the list:
Mutex name: line 95 file rwlock.c
Mutex init order: 432
Mutex address: 0x404b6980
Mutex init: 0x404b6980 (address of init or 1st lock)
Mutex type: PTHREAD_PROCESS_PRIVATE
Mutex type: PTHREAD_MUTEX_NORMAL
Mutex spin cnt: 700
Mutex yield freq: every 100 spins
Lock success calls: 28937186
Lock failed calls: 0
Trylock success calls: 0
Trylock failed calls: 0
Unlock success calls: 28937186
Unlock failed calls: 0
Lock immediately: 26611307
Lock during 1st spin: 2247119
Lock after yield1: 45440
Lock after yield2: 5038
Lock after yield3: 3643
Lock after yield4: 4056
Lock after yield5: 5534
Lock after yield6: 6576
Lock after yield7: 2880
Lock after yield8: 0
Lock after yield9: 0
Lock after yield10/more:0
Lock after sleep (hdf): 0
Lock after sleep1: 5566
Lock after sleep2/more: 27
Lock had waiters: 6146
Lock max waiters: 3
Lock total waiters: 7136
Unlock had waiters: 5620
Unlock max waiters: 3
Unlock total waiters: 6806
# cat /var/tmp/mutex404b6980
Stack_Trace: error while unwinding stack
( 0) 0x00170f84 redirect_stack_trace + 0x4c [./named_init_trace]
( 1) 0x00171020 isc_mutex_init_traced + 0x4c [./named_init_trace]
( 2) 0x0015c164 isc_rwlock_init + 0x88 [./named_init_trace]
( 3) 0x0010dee8 dns_rbtdb_create + 0xd8 [./named_init_trace]
( 4) 0x00049414 dns_db_create + 0xb8 [./named_init_trace]
( 5) 0x00047b70 dns_cache_create + 0x168 [./named_init_trace]
( 6) 0x0001a588 configure_view + 0x6f4 [./named_init_trace]
( 7) 0x0001bee8 load_configuration + 0x20c [./named_init_trace]
( 8) 0x0001ccd0 run_server + 0x148 [./named_init_trace]
( 9) 0x0016199c dispatch + 0x2f0 [./named_init_trace]
(10) 0x00161b88 run + 0x10 [./named_init_trace]
(11) 0xc11eb8e4 __pthread_create_system + 0x434 [/tmp/instr_pthreads/libpthread.1]
Mutex name: line 3992 file rbtdb.c
Mutex init order: 431
Mutex address: 0x404b6924
Mutex init: 0x404b6924 (address of init or 1st lock)
Mutex type: PTHREAD_PROCESS_PRIVATE
Mutex type: PTHREAD_MUTEX_NORMAL
Mutex spin cnt: 700
Mutex yield freq: every 100 spins
Lock success calls: 28984849
Lock failed calls: 0
Trylock success calls: 0
Trylock failed calls: 0
Unlock success calls: 28984849
Unlock failed calls: 0
Lock immediately: 26232294
Lock during 1st spin: 2671616
Lock after yield1: 46090
Lock after yield2: 5229
Lock after yield3: 4275
Lock after yield4: 4794
Lock after yield5: 6074
Lock after yield6: 6268
Lock after yield7: 2732
Lock after yield8: 0
Lock after yield9: 0
Lock after yield10/more:0
Lock after sleep (hdf): 0
Lock after sleep1: 5444
Lock after sleep2/more: 33
Lock had waiters: 5784
Lock max waiters: 3
Lock total waiters: 6806
Unlock had waiters: 5510
Unlock max waiters: 3
Unlock total waiters: 6774
# cat /var/tmp/mutex404b6924
Stack_Trace: error while unwinding stack
( 0) 0x00170f84 redirect_stack_trace + 0x4c [./named_init_trace]
( 1) 0x00171020 isc_mutex_init_traced + 0x4c [./named_init_trace]
( 2) 0x0010decc dns_rbtdb_create + 0xbc [./named_init_trace]
( 3) 0x00049414 dns_db_create + 0xb8 [./named_init_trace]
( 4) 0x00047b70 dns_cache_create + 0x168 [./named_init_trace]
( 5) 0x0001a588 configure_view + 0x6f4 [./named_init_trace]
( 6) 0x0001bee8 load_configuration + 0x20c [./named_init_trace]
( 7) 0x0001ccd0 run_server + 0x148 [./named_init_trace]
( 8) 0x0016199c dispatch + 0x2f0 [./named_init_trace]
( 9) 0x00161b88 run + 0x10 [./named_init_trace]
(10) 0xc11eb8e4 __pthread_create_system + 0x434 [/tmp/instr_pthreads/libpthread.1]
Another "interesting" consequence of putting stack traces into files
based on their address is one can see when a mutex is initialized more
than once. This is definitely the case with the mutex at address
403dbd3c which has a _long_ series of inits - 464 of them in the
testing done. It would seem that 463 of these initializations have
this stack trace:
Stack_Trace: error while unwinding stack
( 0) 0x00170f84 redirect_stack_trace + 0x4c [./named_init_trace]
( 1) 0x00171020 isc_mutex_init_traced + 0x4c [./named_init_trace]
( 2) 0x00162cc4 isc_timer_create + 0x418 [./named_init_trace]
( 3) 0x0009be00 fctx_create + 0x2bc [./named_init_trace]
( 4) 0x000a0f3c dns_resolver_createfetch + 0x4d4 [./named_init_trace]
( 5) 0x00017a24 query_recurse + 0x1c0 [./named_init_trace]
( 6) 0x00018250 query_find + 0x320 [./named_init_trace]
( 7) 0x000196c4 ns_query_start + 0x2b8 [./named_init_trace]
( 8) 0x0000dec4 client_request + 0x2e4 [./named_init_trace]
( 9) 0x0016199c dispatch + 0x2f0 [./named_init_trace]
(10) 0x00161b88 run + 0x10 [./named_init_trace]
(11) 0xc11eb8e4 __pthread_create_system + 0x434 [/tmp/instr_pthreads/libpthread.1]
and one has this:
Stack_Trace: error while unwinding stack
( 0) 0x00170f84 redirect_stack_trace + 0x4c [./named_init_trace]
( 1) 0x00171020 isc_mutex_init_traced + 0x4c [./named_init_trace]
( 2) 0x00162cc4 isc_timer_create + 0x418 [./named_init_trace]
( 3) 0x0000e9ac client_create + 0xfc [./named_init_trace]
( 4) 0x0000fb3c ns_clientmgr_createclients + 0x2f8 [./named_init_trace]
( 5) 0x0000f428 ns_client_replace + 0x68 [./named_init_trace]
( 6) 0x000178b4 query_recurse + 0x50 [./named_init_trace]
( 7) 0x00018250 query_find + 0x320 [./named_init_trace]
( 8) 0x000196c4 ns_query_start + 0x2b8 [./named_init_trace]
( 9) 0x0000dec4 client_request + 0x2e4 [./named_init_trace]
(10) 0x0016199c dispatch + 0x2f0 [./named_init_trace]
(11) 0x00161b88 run + 0x10 [./named_init_trace]
(12) 0xc11eb8e4 __pthread_create_system + 0x434 [/tmp/instr_pthreads/libpthread.1]
The author did not attempt to see if this was a mutex being created on
the stack, or something else. There was at least six other mutex
address with multiple initializations but they are not shown here.
One additional data point was gleaned from the output of the mutex_perf
tool. Dividing the sum of all the successful locks in a two-name test,
by the number of transactions performed yields the rather surprising
figure of over one-hundred mutex lock/unlock pairs per
transaction. (The "precise" count was 58792905 mutex locks in a 90
second run at a rate of 4616.15 transactions per second, which yields
141 lock/unlock pairs per transaction). If this is accurate, it could
go a long way towards explaining all the time in pthread_mutex_lock
and such.
More information about the bind-workers
mailing list