Final pre-publication draft of the 9.1.0 perf writeup. Mk II

Rick Jones raj at cup.hp.com
Wed Feb 14 23:28:36 UTC 2001


This message should have the attachment this time... 

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...

-- Attached file included as plaintext by Listar --
-- File: bind9_perf.txt


	       Copyright 2001, Hewlett-Packard Company




	  The Performance of BIND 9.1.0 on the HP 9000 L3000

		     Rick Jones <raj at cup.hp.com>
		       Hewlett-Packard Company
			    Cupertino, CA

		   Revision 0.7; February 13, 2001

     ftp://ftp.cup.hp.com/dist/networking/briefs/bind9_perf.txt



Introduction:

One of the fundamental and crucial pieces of the Internet's
infrastructure is the Domain Name System (DNS).  The preeminent
implementation of a DNS server is found in named from the Berkeley
Internet Name Domain (BIND) maintained by the Internet Software
Consortium (http://www.isc.org).

This evolving brief will take a look at the performance of
the named server from various revisions of BIND 9.

This document builds on prior documents discussing DNS server
performance which can be found under the URL:


  ftp://ftp.cup.hp.com/dist/networking/briefs/

Summary:

	  BIND 9.1.0 named Performance Summary HP-UX 11.ACE
	netperf DNS_RR requesting 1000 names out of cup.hp.com

                                     +O2       +O2
               "stock"   "+O2"      +DD64   nothread   "full"
   System    +=================================================+
 L3000 1x550 |  2,065  |  3,127  |  3,039  |  4,303  |  4,943  |
 L3000 2x550 |         |  5,338  |         |         |  6,771  |
 L3000 4x550 |         |  8,065  |         |  4,757  | 10,006  |
 J5600 2x552 |         |         |         |         |  7,459  |
             +=================================================+


	    BIND 9.1.0 named Performance Summary HP-UX 11i
			     TOPS Enabled
	netperf DNS_RR requesting 1000 names out of cup.hp.com

                                     +O2       +O2
               "stock"   "+O2"      +DD64   nothread   "full"
   System    +=================================================+
 L3000 1x550 |         |         |         |         |         |
 L3000 2x550 |         |         |         |         |  6,750  |
 L3000 4x550 |         |  7,792  |         |         |  9,599  |
 J5600 2x552 |         |         |         |         |         |
             +=================================================+

	    BIND 9.1.0 named Performance Summary HP-UX 11i
			    TOPS Disabled
	netperf DNS_RR requesting 1000 names out of cup.hp.com

                                     +O2       +O2
               "stock"   "+O2"      +DD64   nothread   "full"
   System    +=================================================+
 L3000 1x550 |         |         |         |         |  4,832  |
 L3000 2x550 |         |         |         |         |         |
 L3000 4x550 |         |  7,930  |         |         |  9,761  |
 J5600 2x552 |         |         |         |         |         |
             +=================================================+

 [ Additional information about the flavors of named can be found in
   later sections. ]



Observations and Conclusions:

Apart from the mutex-related calls, and perhaps set_offsets(), the CPU
utilization appears to be spread-out over a very large number of
routines. This implies that higher levels of optimization, especially
those that can do automagic in-lining would be worth consideration.
This was validated in part by the performance boost of the version
with "full" optimization. The CPU time being spread-out over a large
number of routines also suggests a great deal (perhaps too much?) of
abstraction and layering.

One to two CPU scaling of the +O2 binary appears to be ~1.7X.  One to
four CPU scaling is almost 2.6X. Two CPU scaling for the "full" binary
is 1.37X and four CPU scaling is only 2X.

The performance of BIND 9.1.0's named appears to be roughly 1/2 that
of BIND 8.2.2pl5's based on figures in the documents at the URL above.
Even if all the time spent in pthread routines (shown below) were to
magically evaporate, it would seem that the BIND 9.1.0 named would
still be substantially slower.

Even when compiled as a non-threaded application, the 9.1.0 named's
performance is still far below that of 8.2.2pl5 for similar compiler
optimization levels. (With the HP-UX ANSI C compiler "-O" and "+O2"
are the same.)  This implies (but does not confirm) that a great part
of the slowdown in the 9.1.0 named is _not_ in the HP-UX 11 platform's
threads implementation.

As seen in the profiles below, a great deal of time is spent in the
pthread routines for locking and unlocking mutexes. Analysis of the
mutexes appears to show that a DNS server serving requests via a
single IP address will suffer contention for the mutex protecting
data structures associated with the socket bound to that IP
address. Once this contention is lessened by going to more than one IP
address contention appears to shift to the mutex protecting the
global ns_g_mctx data structure

Beyond the contention on mutexes, there appear to be a great many
mutex lock/unlock pairs per transaction. One analysis had the number
of mutex lock/unlock pairs at over one-hundred per transaction.

It also appears that during its lifetime, a request can be passed from 
one thread to another. This may be related to the apparently large number
of mutex calls per transaction.

The author's naive assumption of the design of a multi-threaded named
would have been one in which each thread would behave as if it were a
named unto itself.  It would read requests from the socket(s) assigned
to it and access the shared database with appropriate locks to
maintain the database's integrity.  Any statistics kept would be kept
on a per-thread basis, and retrieval of overall statistics would
"walk" the threads in turn. It has been the author's experience
thus far that MP scaling is greatest when the design comes as close to
N separate instances as possible.

HP-UX 11i includes a new feature in the networking stack known as
"TOPS" or "Thread Optimized Packet Scheduling."  At a high level, TOPS
remembers the CPU on which a socket was last accessed, and completes
inbound processing of packets for that socket on that CPU.  This is a
refinement of the "Inbound Packet Scheduling" (IPS) of HP-UX 10.20
that only made decisions based on the addressing information in a
packet.  The primary design center of TOPS is one where a single
thread of execution accesses one or more sockets and there is a high
packet rate through a single NIC, or significantly fewer NICs than
there are CPUs.  

However, it would appear that the workflow model of BIND 9.1.0 is such
that multiple threads can be accessing the socket(s).  Since these
threads execute on separate CPUs, there is no consistent CPU for any
one socket.  Further, the actual packet rate for BIND 9.1.0 is very
low (less than 20 Mbit/s in these tests) and the bulk of the CPU time
is spent in user space, not in protocol processing. In this
environment, TOPS is not beneficial. One can disable TOPS by using adb
to poke the value of the kernel variable "enable_tops" to a value of
zero.



Appendix A: Configuration information

The initial HP-UX 11 tests were run on a system with the following
software installed:

  A4926A            B.11.00.11     1000Base-SX PCI Gigabit Ethernet Driver
  A5230A            B.11.00.05     100BT/9000 PCI
  B3693AA           C.03.20.00     HP GlancePlus/UX for s800 11.0
  HPUXEng64RT       B.11.00.01     English HP-UX 64-bit Runtime Environment
  UXCoreMedia       B.11.00.02     HP-UX Media Kit
  XSWGR1100         B.11.00.47.08  General Release Patches, Nov. 1999 (ACE)
  PHCO_21267        1.0            cumulative SAM/ObAM patch
  PHNE_21865        1.0            1000Base-SX/T B.11.00.[02-11] cumul. patch
  PHNE_21936        1.0            ftpd(1M) and ftp(1) patch
  PHNE_22086        1.0            Streams Pty cumulative patch
  PHNE_22244        1.0            Built-in PCI 100BASE-T patch
  PHNE_22331        1.0            LAN product cumulative patch
  PHNE_22397        1.0            cumulative ARPA Transport patch
  PHNE_22542        1.0            PCI 100BT lan cumulative patch

The HP-UX 11i tests were run with the following software installed:

  B3693AA          C.03.20.00  HP GlancePlus/UX for s800 11i 
  CDE-English      B.11.11     English CDE Environment 
  FDDI-00          B.11.11.01  PCI FDDI;Supptd HW=A3739A/A3739B;SW=J3626AA 
  FibrChanl-00     B.11.11.03  PCI FibreChannel;Supptd HW=A5158A 
  GigEther-00      B.11.11.14  PCI/HSC GigEther;Supptd HW=...
  HPUX11i-OE-MC    B.11.11     HP-UX Mission Critical Op. Env. Component 
  HPUXBase64       B.11.11     HP-UX 64-bit Base OS 
  HPUXBaseAux      B.11.11     HP-UX Base OS Auxiliary 
  RAID-00          B.11.11.00  PCI RAID; Supptd HW=A5856A 

Appendix B. Prospect Data

Some of the author's prior DNS server work employed a tool named puma
to get profiling information used in tuning and the like.  Since that
time the puma profiling tool has been obsoleted. In its place, the
author used a tool named prospect.  As of this writing, the author is
not certain of the public availability of the prospect tool.  What
follows is some edited output of the prospect tool showing where time
was spent in various configurations.  Full tool output will be
available via links in another appendix once the public disposition of
the prospect tool is determined.

One-CPU, L3000, named.9.1.0.O2, 11.0

pcnt accum% Routine name          Instructi Filename
  6%     6% set_offsets              ENTRY  ./named.9.1.0.O2
  5%    11% pthread_mutex_lock       CODE   /usr/lib/libpthread.1
  4%    15% dns_name_concatenate     CODE   ./named.9.1.0.O2
  3%    18% dns_name_getlabelsequence CODE  ./named.9.1.0.O2
  3%    22% pthread_mutex_unlock     CODE   /usr/lib/libpthread.1
  3%    25% dns_name_fullcompare     CODE   ./named.9.1.0.O2
  3%    28% dns_rbt_findnode         CODE   ./named.9.1.0.O2
  2%    30% dns_name_hash            CODE   ./named.9.1.0.O2
  2%    33% pthread_mutex_lock       ENTRY  /usr/lib/libpthread.1
  2%    35% memcpy                   CODE   /usr/lib/libc.2
  2%    37% dns_name_equal           CODE   ./named.9.1.0.O2
  2%    39% __spin_lock              CODE   /usr/lib/libpthread.1
  2%    41% __spin_unlock            CODE   /usr/lib/libpthread.1
  2%    42% pthread_mutex_unlock     ENTRY  /usr/lib/libpthread.1
  2%    44% dns_rdataset_towiresorted CODE  ./named.9.1.0.O2
  1%    45% dns_name_split           CODE   ./named.9.1.0.O2
  1%    47% dns_compress_findglobal  CODE   ./named.9.1.0.O2
  1%    48% dns_name_clone           CODE   ./named.9.1.0.O2
  1%    49% find_up                  ENTRY  ./named.9.1.0.O2
  1%    50% dns_name_fromregion      CODE   ./named.9.1.0.O2
  1%    52% query_addadditional      ENTRY  ./named.9.1.0.O2
  1%    53% dns_name_towire          CODE   ./named.9.1.0.O2
  1%    54% cache_find               ENTRY  ./named.9.1.0.O2
  1%    55% memcpy                   ENTRY  /usr/lib/libc.2
  1%    56% isc__buffer_add          CODE   ./named.9.1.0.O2
  1%    57% chain_name               ENTRY  ./named.9.1.0.O2
  1%    58% detachnode               ENTRY  ./named.9.1.0.O2
  1%    59% isc__mempool_get         CODE   ./named.9.1.0.O2
  1%    60% __pthread_mutex_lock     ENTRY  /usr/lib/libpthread.1
  1%    61% dns_name_init            CODE   ./named.9.1.0.O2
  1%    62% __pthread_mutex_unlock   ENTRY  /usr/lib/libpthread.1
  1%    62% __memset20               CODE   /usr/lib/libc.2
  1%    63% find_deepest_zonecut     ENTRY  ./named.9.1.0.O2
  1%    64% isc__buffer_clear        CODE   ./named.9.1.0.O2
  1%    65% memmove                  CODE   /usr/lib/libc.2
  1%    65% isc__buffer_putuint16    CODE   ./named.9.1.0.O2
  1%    66% qsort                    CODE   /usr/lib/libc.2
  1%    67% $$dyncall_external_20    MILLIC ./named.9.1.0.O2
  1%    67% dns_aclelement_match     CODE   ./named.9.1.0.O2
  1%    68% isc_rwlock_unlock        CODE   ./named.9.1.0.O2
  1%    68% isc__mempool_put         CODE   ./named.9.1.0.O2
  1%    69% new_reference            ENTRY  ./named.9.1.0.O2
  1%    69% isc_rwlock_lock          CODE   ./named.9.1.0.O2
  0%    70% dns_compress_add         CODE   ./named.9.1.0.O2
  0%    70% isc__buffer_availableregion COD ./named.9.1.0.O2
  0%    71% dns_name_setbuffer       CODE   ./named.9.1.0.O2
  0%    71% dns_acl_match            CODE   ./named.9.1.0.O2
  0%    72% pthread_mutex_lock       STUB   ./named.9.1.0.O2
  0%    72% dns_rdata_towire         CODE   ./named.9.1.0.O2
  0%    73% isc_buffer_getuint16     CODE   ./named.9.1.0.O2
  0%    73% memmove                  ENTRY  /usr/lib/libc.2
  0%    74% dns_message_rendersection CODE  ./named.9.1.0.O2
  0%    74% dns_rdataset_isassociated CODE  ./named.9.1.0.O2
  0%    74% pthread_mutex_unlock     STUB   ./named.9.1.0.O2
  0%    75% memcpy                   STUB   ./named.9.1.0.O2
  0%    75% isc__buffer_init         CODE   ./named.9.1.0.O2
  0%    76% cache_findrdataset       ENTRY  ./named.9.1.0.O2
  0%    76% dns_name_getlabel        CODE   ./named.9.1.0.O2
  0%    76% dns_compress_invalidate  CODE   ./named.9.1.0.O2
  0%    77% client_request           ENTRY  ./named.9.1.0.O2
  0%    77% dns_message_findname     CODE   ./named.9.1.0.O2
  0%    77% isc_netaddr_eqprefix     CODE   ./named.9.1.0.O2
  0%    78% mallinfo                 CODE   /usr/lib/libc.2
  0%    78% dns_zt_find              CODE   ./named.9.1.0.O2
  0%    78% query_find               ENTRY  ./named.9.1.0.O2
  0%    79% dns_name_fromwire        CODE   ./named.9.1.0.O2
  0%    79% exit_check               ENTRY  ./named.9.1.0.O2
  0%    79% dns_db_find              CODE   ./named.9.1.0.O2
  0%    80% dns_db_findrdataset      CODE   ./named.9.1.0.O2
  0%    80% msgresetnames            ENTRY  ./named.9.1.0.O2
  0%    80% dns_rdata_fromregion     CODE   ./named.9.1.0.O2
  0%    80% dns_rdataset_current     CODE   ./named.9.1.0.O2


Two-CPU, L3000, named.9.1.0.O2, 11.0

pcnt accum%  Routine name          Instrtn   Filename
 10%    10%  pthread_mutex_lock       CODE   /usr/lib/libpthread.1
  5%    16%  set_offsets              ENTRY  ./named.9.1.0.O2
  3%    19%  pthread_mutex_unlock     CODE   /usr/lib/libpthread.1
  3%    22%  dns_rbt_findnode         CODE   ./named.9.1.0.O2
  3%    26%  find_up                  ENTRY  ./named.9.1.0.O2
  3%    28%  __spin_lock              CODE   /usr/lib/libpthread.1
  2%    31%  dns_name_getlabelsequence CODE   ./named.9.1.0.O2
  2%    33%  dns_name_equal           CODE   ./named.9.1.0.O2
  2%    34%  dns_name_fullcompare     CODE   ./named.9.1.0.O2
  2%    36%  __memset20               CODE   /usr/lib/libc.2
  2%    38%  dns_name_concatenate     CODE   ./named.9.1.0.O2
  2%    40%  dns_name_hash            CODE   ./named.9.1.0.O2
  2%    42%  pthread_mutex_lock       ENTRY  /usr/lib/libpthread.1
  2%    43%  __spin_unlock            CODE   /usr/lib/libpthread.1
  2%    45%  memcpy                   CODE   /usr/lib/libc.2
  1%    46%  dns_rdataset_towiresorted CODE   ./named.9.1.0.O2
  1%    47%  pthread_mutex_unlock     ENTRY  /usr/lib/libpthread.1
  1%    49%  cache_find               ENTRY  ./named.9.1.0.O2
  1%    50%  dns_compress_findglobal  CODE   ./named.9.1.0.O2
  1%    51%  detachnode               ENTRY  ./named.9.1.0.O2
  1%    52%  dns_name_clone           CODE   ./named.9.1.0.O2
  1%    53%  isc_rwlock_unlock        CODE   ./named.9.1.0.O2
  1%    54%  query_addadditional      ENTRY  ./named.9.1.0.O2
  1%    55%  dns_name_split           CODE   ./named.9.1.0.O2
  1%    56%  dns_name_towire          CODE   ./named.9.1.0.O2
  1%    57%  dns_name_fromregion      CODE   ./named.9.1.0.O2
  1%    58%  __pthread_mutex_unlock   ENTRY  /usr/lib/libpthread.1
  1%    59%  isc_rwlock_lock          CODE   ./named.9.1.0.O2
  1%    60%  isc__mempool_get         CODE   ./named.9.1.0.O2
  1%    60%  dns_aclelement_match     CODE   ./named.9.1.0.O2
  1%    61%  __pthread_mutex_lock     ENTRY  /usr/lib/libpthread.1
  1%    62%  chain_name               ENTRY  ./named.9.1.0.O2
  1%    63%  qsort                    CODE   /usr/lib/libc.2
  1%    63%  isc__buffer_add          CODE   ./named.9.1.0.O2
  1%    64%  isc__mempool_put         CODE   ./named.9.1.0.O2
  1%    65%  dns_name_init            CODE   ./named.9.1.0.O2
  1%    65%  dns_compress_add         CODE   ./named.9.1.0.O2
  1%    66%  memcpy                   ENTRY  /usr/lib/libc.2
  1%    67%  isc__buffer_putuint16    CODE   ./named.9.1.0.O2
  1%    67%  new_reference            ENTRY  ./named.9.1.0.O2
  1%    68%  $$dyncall_external_20    MILLIC ./named.9.1.0.O2
  1%    68%  __errno                  STUB   /usr/lib/libc.2
  1%    69%  cache_findrdataset       ENTRY  ./named.9.1.0.O2
  1%    69%  memmove                  CODE   /usr/lib/libc.2
  1%    70%  dns_zt_find              CODE   ./named.9.1.0.O2
  1%    71%  dns_rdata_towire         CODE   ./named.9.1.0.O2
  0%    71%  mem_getunlocked          ENTRY  ./named.9.1.0.O2
  0%    71%  isc_netaddr_eqprefix     CODE   ./named.9.1.0.O2
  0%    72%  dns_message_rendersection CODE   ./named.9.1.0.O2
  0%    72%  isc__buffer_availableregion CODE  ./named.9.1.0.O2
  0%    73%  dns_name_setbuffer       CODE   ./named.9.1.0.O2
  0%    73%  find_deepest_zonecut     ENTRY  ./named.9.1.0.O2
  0%    74%  mallinfo                 CODE   /usr/lib/libc.2
  0%    74%  dns_acl_match            CODE   ./named.9.1.0.O2
  0%    74%  mem_putunlocked          ENTRY  ./named.9.1.0.O2
  0%    75%  dns_message_findname     CODE   ./named.9.1.0.O2
  0%    75%  isc__buffer_clear        CODE   ./named.9.1.0.O2
  0%    76%  attach                   ENTRY  ./named.9.1.0.O2
  0%    76%  pthread_mutex_lock       STUB   ./named.9.1.0.O2
  0%    76%  dns_rdataset_isassociated CODE   ./named.9.1.0.O2
  0%    77%  isc_timer_reset          CODE   ./named.9.1.0.O2
  0%    77%  isc_buffer_getuint16     CODE   ./named.9.1.0.O2
  0%    77%  isc__buffer_init         CODE   ./named.9.1.0.O2
  0%    78%  msgresetnames            ENTRY  ./named.9.1.0.O2
  0%    78%  pthread_mutex_unlock     STUB   ./named.9.1.0.O2
  0%    78%  client_request           ENTRY  ./named.9.1.0.O2
  0%    79%  dns_compress_invalidate  CODE   ./named.9.1.0.O2
  0%    79%  dispatch                 ENTRY  ./named.9.1.0.O2
  0%    79%  memcmp                   ENTRY  /usr/lib/libc.2
  0%    79%  dns_rdataset_init        CODE   ./named.9.1.0.O2

the rest was not shown to save space.

Four-CPU, L3000, named.9.1.0.O2, 11.0

pcnt accum%  Routine name          Instrctn  Filename
 17%    17%  pthread_mutex_lock       CODE   /usr/lib/libpthread.1
  4%    21%  set_offsets              ENTRY  ./named.9.1.0.O2
  3%    24%  pthread_mutex_unlock     CODE   /usr/lib/libpthread.1
  3%    27%  dns_rbt_findnode         CODE   ./named.9.1.0.O2
  3%    30%  __spin_lock              CODE   /usr/lib/libpthread.1
  3%    32%  find_up                  ENTRY  ./named.9.1.0.O2
  2%    35%  dns_name_getlabelsequence CODE  ./named.9.1.0.O2
  2%    36%  dns_name_fullcompare     CODE   ./named.9.1.0.O2
  2%    38%  dns_name_concatenate     CODE   ./named.9.1.0.O2
  2%    40%  dns_name_hash            CODE   ./named.9.1.0.O2
  2%    42%  __memset20               CODE   /usr/lib/libc.2
  2%    44%  dns_name_equal           CODE   ./named.9.1.0.O2
  2%    45%  __spin_unlock            CODE   /usr/lib/libpthread.1
  1%    47%  isc_rwlock_unlock        CODE   ./named.9.1.0.O2
  1%    48%  isc_rwlock_lock          CODE   ./named.9.1.0.O2
  1%    49%  memcpy                   CODE   /usr/lib/libc.2
  1%    51%  cache_find               ENTRY  ./named.9.1.0.O2
  1%    52%  dns_rdataset_towiresorted CODE  ./named.9.1.0.O2
  1%    53%  pthread_mutex_lock       ENTRY  /usr/lib/libpthread.1
  1%    55%  detachnode               ENTRY  ./named.9.1.0.O2
  1%    56%  pthread_mutex_unlock     ENTRY  /usr/lib/libpthread.1
  1%    57%  dns_name_clone           CODE   ./named.9.1.0.O2
  1%    58%  dns_compress_findglobal  CODE   ./named.9.1.0.O2
  1%    59%  isc__mempool_get         CODE   ./named.9.1.0.O2
  1%    59%  query_addadditional      ENTRY  ./named.9.1.0.O2
  1%    60%  dns_name_split           CODE   ./named.9.1.0.O2
  1%    61%  dns_name_towire          CODE   ./named.9.1.0.O2
  1%    62%  dns_aclelement_match     CODE   ./named.9.1.0.O2
  1%    63%  isc__buffer_add          CODE   ./named.9.1.0.O2
  1%    63%  new_reference            ENTRY  ./named.9.1.0.O2
  1%    64%  dns_name_fromregion      CODE   ./named.9.1.0.O2
  1%    65%  chain_name               ENTRY  ./named.9.1.0.O2
  1%    65%  __pthread_mutex_unlock   ENTRY  /usr/lib/libpthread.1
  1%    66%  isc__mempool_put         CODE   ./named.9.1.0.O2
  1%    67%  memcpy                   ENTRY  /usr/lib/libc.2
  1%    67%  __pthread_mutex_lock     ENTRY  /usr/lib/libpthread.1
  1%    68%  attach                   ENTRY  ./named.9.1.0.O2
  1%    68%  $$dyncall_external_20    MILLIC ./named.9.1.0.O2
  1%    69%  mem_getunlocked          ENTRY  ./named.9.1.0.O2
  1%    70%  dns_zt_find              CODE   ./named.9.1.0.O2
  1%    70%  __errno                  STUB   /usr/lib/libc.2
  1%    71%  qsort                    CODE   /usr/lib/libc.2
  1%    71%  isc__buffer_putuint16    CODE   ./named.9.1.0.O2
  1%    72%  memmove                  CODE   /usr/lib/libc.2
  0%    72%  dns_name_init            CODE   ./named.9.1.0.O2
  0%    73%  client_request           ENTRY  ./named.9.1.0.O2
  0%    73%  cache_findrdataset       ENTRY  ./named.9.1.0.O2
  0%    74%  dns_message_rendersection CODE  ./named.9.1.0.O2
  0%    74%  dns_acl_match            CODE   ./named.9.1.0.O2
  0%    74%  mem_putunlocked          ENTRY  ./named.9.1.0.O2
  0%    75%  dns_compress_add         CODE   ./named.9.1.0.O2
  0%    75%  detach                   ENTRY  ./named.9.1.0.O2
  0%    76%  dns_rdata_towire         CODE   ./named.9.1.0.O2
  0%    76%  isc__buffer_clear        CODE   ./named.9.1.0.O2
  0%    76%  mallinfo                 CODE   /usr/lib/libc.2
  0%    77%  isc__buffer_availableregion CODE ./named.9.1.0.O2
  0%    77%  isc_timer_reset          CODE   ./named.9.1.0.O2
  0%    78%  find_deepest_zonecut     ENTRY  ./named.9.1.0.O2
  0%    78%  dns_message_findname     CODE   ./named.9.1.0.O2
  0%    78%  dns_name_fromwire        CODE   ./named.9.1.0.O2
  0%    79%  query_getcachedb         ENTRY  ./named.9.1.0.O2
  0%    79%  dns_name_setbuffer       CODE   ./named.9.1.0.O2
  0%    79%  isc_heap_delete          CODE   ./named.9.1.0.O2
  0%    79%  sooner                   ENTRY  ./named.9.1.0.O2

One-CPU, L3000, named.9.1.0.O2.nothreads, 11.0

pcnt accum%  Routine name          Instruct Filename
  8%     8%  set_offsets              ENTRY ./named.9.1.0.O2.nothreads
  5%    13%  dns_name_getlabelsequence CODE  ./named.9.1.0.O2.nothreads
  4%    17%  dns_rbt_findnode         CODE  ./named.9.1.0.O2.nothreads
  4%    21%  dns_name_fullcompare     CODE  ./named.9.1.0.O2.nothreads
  3%    25%  dns_name_concatenate     CODE  ./named.9.1.0.O2.nothreads
  3%    28%  dns_name_hash            CODE  ./named.9.1.0.O2.nothreads
  3%    31%  dns_name_equal           CODE  ./named.9.1.0.O2.nothreads
  2%    33%  memcpy                   CODE  /usr/lib/libc.2
  2%    35%  dns_compress_findglobal  CODE  ./named.9.1.0.O2.nothreads
  2%    37%  dns_rdataset_towiresorted CODE  ./named.9.1.0.O2.nothreads
  2%    38%  dns_name_clone           CODE  ./named.9.1.0.O2.nothreads
  1%    40%  __memset20               CODE  /usr/lib/libc.2
  1%    41%  dns_name_split           CODE  ./named.9.1.0.O2.nothreads
  1%    43%  find_up                  ENTRY ./named.9.1.0.O2.nothreads
  1%    44%  isc__buffer_add          CODE  ./named.9.1.0.O2.nothreads
  1%    46%  query_addadditional      ENTRY ./named.9.1.0.O2.nothreads
  1%    47%  dns_name_towire          CODE  ./named.9.1.0.O2.nothreads
  1%    48%  dns_name_fromregion      CODE  ./named.9.1.0.O2.nothreads
  1%    49%  detachnode               ENTRY ./named.9.1.0.O2.nothreads
  1%    51%  cache_find               ENTRY ./named.9.1.0.O2.nothreads
  1%    52%  isc__mempool_get         CODE  ./named.9.1.0.O2.nothreads
  1%    53%  memcpy                   ENTRY /usr/lib/libc.2
  1%    54%  dns_name_init            CODE  ./named.9.1.0.O2.nothreads
  1%    55%  chain_name               ENTRY ./named.9.1.0.O2.nothreads
  1%    56%  $$dyncall_external_20    MILLIC ./named.9.1.0.O2.nothreads
  1%    57%  find_deepest_zonecut     ENTRY ./named.9.1.0.O2.nothreads
  1%    58%  qsort                    CODE  /usr/lib/libc.2
  1%    58%  dns_aclelement_match     CODE  ./named.9.1.0.O2.nothreads
  1%    59%  isc__buffer_putuint16    CODE  ./named.9.1.0.O2.nothreads
  1%    60%  isc__buffer_clear        CODE  ./named.9.1.0.O2.nothreads
  1%    61%  dns_rdataset_isassociated CODE  ./named.9.1.0.O2.nothreads
  1%    61%  memmove                  CODE  /usr/lib/libc.2
  1%    62%  new_reference            ENTRY ./named.9.1.0.O2.nothreads
  1%    62%  name_hash                ENTRY ./named.9.1.0.O2.nothreads
  1%    63%  isc__mempool_put         CODE  ./named.9.1.0.O2.nothreads
  1%    64%  dns_name_setbuffer       CODE  ./named.9.1.0.O2.nothreads
  1%    64%  dns_compress_add         CODE  ./named.9.1.0.O2.nothreads
  1%    65%  dns_name_fromwire        CODE  ./named.9.1.0.O2.nothreads
  1%    65%  dns_rdata_towire         CODE  ./named.9.1.0.O2.nothreads
  1%    66%  isc__buffer_init         CODE  ./named.9.1.0.O2.nothreads
  1%    66%  dns_message_rendersection CODE  ./named.9.1.0.O2.nothreads
  1%    67%  cache_findrdataset       ENTRY ./named.9.1.0.O2.nothreads
  1%    67%  isc__buffer_availableregion CODE ./named.9.1.0.O2.nothreads
  0%    68%  dns_name_getlabel        CODE  ./named.9.1.0.O2.nothreads
  0%    68%  rdataset_current         ENTRY ./named.9.1.0.O2.nothreads
  0%    69%  query_find               ENTRY ./named.9.1.0.O2.nothreads
  0%    69%  isc_netaddr_eqprefix     CODE  ./named.9.1.0.O2.nothreads
  0%    70%  dns_compress_invalidate  CODE  ./named.9.1.0.O2.nothreads
  0%    70%  dns_acl_match            CODE  ./named.9.1.0.O2.nothreads
  0%    71%  dns_db_find              CODE  ./named.9.1.0.O2.nothreads
  0%    71%  dns_message_findname     CODE  ./named.9.1.0.O2.nothreads
  0%    72%  memcpy                   STUB  ./named.9.1.0.O2.nothreads
  0%    72%  client_request           ENTRY ./named.9.1.0.O2.nothreads
  0%    72%  dns_rdataset_current     CODE  ./named.9.1.0.O2.nothreads
  0%    73%  mallinfo                 CODE  /usr/lib/libc.2
  0%    73%  msgresetnames            ENTRY ./named.9.1.0.O2.nothreads
  0%    74%  dns_rdataset_init        CODE  ./named.9.1.0.O2.nothreads
  0%    74%  dns_db_findrdataset      CODE  ./named.9.1.0.O2.nothreads
  0%    74%  dns_rdataset_next        CODE  ./named.9.1.0.O2.nothreads
  0%    75%  memmove                  ENTRY /usr/lib/libc.2
  0%    75%  dns_db_attach            CODE  ./named.9.1.0.O2.nothreads
  0%    75%  query_newname            ENTRY ./named.9.1.0.O2.nothreads
  0%    76%  isc_rwlock_lock          CODE  ./named.9.1.0.O2.nothreads
  0%    76%  dns_rdataset_additionaldata CODE ./named.9.1.0.O2.nothreads
  0%    76%  dns_zt_find              CODE  ./named.9.1.0.O2.nothreads
  0%    76%  dns_compress_getmethods  CODE  ./named.9.1.0.O2.nothreads
  0%    77%  isc_buffer_getuint16     CODE  ./named.9.1.0.O2.nothreads
  0%    77%  dispatch                 ENTRY ./named.9.1.0.O2.nothreads
  0%    77%  query_getcachedb         ENTRY ./named.9.1.0.O2.nothreads
  0%    78%  memset                   ENTRY /usr/lib/libc.2
  0%    78%  dns_rdata_fromregion     CODE  ./named.9.1.0.O2.nothreads
  0%    78%  isc_rwlock_unlock        CODE  ./named.9.1.0.O2.nothreads
  0%    79%  dns_db_detach            CODE  ./named.9.1.0.O2.nothreads
  0%    79%  towire_ns                ENTRY ./named.9.1.0.O2.nothreads
  0%    79%  isc_socket_recv          CODE  ./named.9.1.0.O2.nothreads
  0%    79%  memset                   CODE  /usr/lib/libc.2
  0%    80%  mem_getunlocked          ENTRY ./named.9.1.0.O2.nothreads
  0%    80%  dns_rdata_additionaldata CODE  ./named.9.1.0.O2.nothreads
  0%    80%  dns_db_detachnode        CODE  ./named.9.1.0.O2.nothreads
  0%    80%  dns_rdata_toregion       CODE  ./named.9.1.0.O2.nothreads

Appendix X. Half-cooked Data

The following are the aggregated data for each of the runs mentioned
elsewhere in the document.

L3000, UP, named9.1.0stock - compilation as it is out of the tarball
for HP-UX - -g, no optimization:


 Threads             Ops
       1            1218
       2            1757
       3            1861
       4            1920
       5            1961
       6            1984
       7            2000
       8            2024
       9            2032
      10            2039
      11            2050
      12            2041
      13            2055
      14            2059
      15            2063
      16            2065

 Peak ops: 2065 with 16 threads


L3000, UP, named9.1.0.O2 - compilation with +O2 instead of -g

 Threads             Ops
       1            1704
       2            2442
       3            2728
       4            2836
       5            2903
       6            2966
       7            2984
       8            3043
       9            3063
      10            3083
      11            3088
      12            3078
      13            3103
      14            3118
      15            3122
      16            3127

 Peak ops: 3127 with 16 threads

L3000, UP, named9.1.0.O2 - compilation with +O2 +DD64 for a 64-bit binary
 Threads             Ops
       1            1717
       2            2474
       3            2682
       4            2793
       5            2851
       6            2900
       7            2928
       8            2985
       9            3011
      10            3000
      11            3011
      12            3016
      13            3039
      14            3029
      15            3027
      16            3037

 Peak ops: 3039 with 13 threads

L3000, 2P, named9.1.0.O2 - compilation with +O2 instead of -g

 Threads             Ops
       2            2647
       4            3838
       6            4650
       8            4811
      10            4978
      12            5130
      14            5212
      16            5285
      18            5282
      20            5308
      22            5316
      24            5332
      26            5326
      28            5332
      30            5338
      32            5326

 Peak ops: 5338 with 30 threads

L3000, 4P, named9.1.0.O2, 11.ACE - compilation with +O2 instead of -g

 Threads             Ops
       2            2766
       4            4324
       6            5524
       8            6466
      10            7153
      12            7459
      14            7741
      16            7790
      18            7776
      20            7679
      22            7970
      24            7866
      26            7989
      28            8065
      30            7985
      32            7935

 Peak ops: 8065 with 28 threads

L3000, 4P, named9.1.0.O2, 11i, TOPS - compilation with +O2 instead of -g

 Threads             Ops
       2            2782
       4            4724
       6            5869
       8            6491
      10            7107
      12            7388
      14            7592
      16            7670
      18            7710
      20            7724
      22            7729
      24            7745
      26            7755
      28            7783
      30            7735
      32            7792

 Peak ops: 7792 with 32 threads

L3000, 4P, named9.1.0.O2, 11i, noTOPS - compilation with +O2 instead of -g


 Threads             Ops
       2            2793
       4            4780
       6            5950
       8            6568
      10            7229
      12            7513
      14            7733
      16            7783
      18            7853
      20            7895
      22            7883
      24            7916
      26            7910
      28            7873
      30            7930
      32            7927

 Peak ops: 7930 with 30 threads

L3000, 4P, named.9.1.0.O2.nothreads, 11.ACE - compilation with
+O2 instead of -g, also configure --disable-threads

 Threads             Ops
       2            4253
       4            4556
       6            4622
       8            4802
      10            4806
      12            4689
      14            4659
      16            4637
      18            4673
      20            4717
      22            4775
      24            4810
      26            4726
      28            4695
      30            4781
      32            4774

 Peak ops: 4810 with 24 threads

L3000, 4P, named.9.1.0.O2.nothreads, 11.ACE - compilation with +O2
instead of -g, also configure --disable-threads

 Threads             Ops
       2            4207
       4            4554
       6            4696
       8            4756
      10            4757
      12            4753
      14            4748
      16            4741
      18            4742
      20            4747
      22            4750
      24            4744
      26            4747
      28            4746
      30            4747
      32            4748

 Peak ops: 4757 with 10 threads

L3000, UP, named.9.1.0.O2.nothreads, 11.ACE - compilation with +O2
instead of -g, also configure --disable-threads


 Threads             Ops
       2            3889
       4            4175
       6            4243
       8            4302
      10            4286
      12            4285
      14            4301
      16            4303
      18            4298
      20            4279
      22            4277
      24            4293
      26            4275
      28            4285
      30            4285
      32            4290

 Peak ops: 4303 with 16 threads

L3000, 2P, named.9.1.0.O2.PBO, 11.ACE - compilation with +O2
instead of -g, PBO based on training done on the two-CPU configuration

 Threads             Ops
       2            2809
       4            3962
       6            4963
       8            5275
      10            5458
      12            5612
      14            5713
      16            5776
      18            5826
      20            5808
      22            5798
      24            5836
      26            5868
      28            5858
      30            5860
      32            5722

 Peak ops: 5868 with 26 threads

L3000, 2P, named.9.1.0.O4.PBO, 11.ACE - compilation with +O4
instead of -g, PBO based on training done on the two-CPU configuration

 Threads             Ops
       2            3096
       4            4029
       6            5325
       8            5719
      10            5878
      12            6136
      14            6103
      16            6133
      18            6220
      20            6287
      22            6266
      24            6230
      26            6134
      28            6137
      30            6118
      32            6109

 Peak ops: 6287 with 20 threads

L3000, UP, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data

 Threads             Ops
       2            3539
       4            4338
       6            4597
       8            4751
      10            4821
      12            4840
      14            4867
      16            4883
      18            4897
      20            4929
      22            4928
      24            4919
      26            4943
      28            4936
      30            4930
      32            4937

 Peak ops: 4943 with 26 threads

L3000, 2P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data

 Threads             Ops
       2            3085
       4            4789
       6            5641
       8            6045
      10            6422
      12            6634
      14            6711
      16            6771
      18            6736
      20            6728
      22            6754
      24            6730
      26            6731
      28            6706
      30            6750
      32            6721

 Peak ops: 6771 with 16 threads

L3000, 4P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data

 Threads             Ops
       2            3271
       4            5148
       6            6429
       8            7615
      10            8313
      12            9021
      14            9353
      16            9660
      18            9793
      20            9871
      22            9856
      24            9903
      26            9941
      28           10003
      30           10006
      32            9873

 Peak ops: 10006 with 30 threads


J5600, 2P, named.9.1.0.O4.PBO.chatr, 11.ACE - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data


 Threads             Ops
       2            3386
       4            5195
       6            5956
       8            6463
      10            6856
      12            7230
      14            7427
      16            7459
      18            7307
      20            7369
      22            7458
      24            7434
      26            7382
      28            7421
      30            7428
      32            7386

 Peak ops: 7459 with 16 threads

L3000, UP, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled


 Threads             Ops
       2            2927
       4            4000
       6            4342
       8            4447
      10            4539
      12            4676
      14            4694
      16            4696
      18            4731
      20            4768
      22            4747
      24            4758
      26            4778
      28            4774
      30            4829
      32            4832

 Peak ops: 4832 with 32 threads


L3000, 2P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled

 Threads             Ops
       2            3332
       4            5093
       6            5767
       8            6209
      10            6396
      12            6624
      14            6655
      16            6696
      18            6690
      20            6712
      22            6741
      24            6660
      26            6733
      28            6750
      30            6722
      32            6734

 Peak ops: 6750 with 28 threads

L3000, 4P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS enabled


 Threads             Ops
       2            3375
       4            5524
       6            6840
       8            7680
      10            8440
      12            8894
      14            9297
      16            9417
      18            9417
      20            9534
      22            9514
      24            9538
      26            9584
      28            9599
      30            9566
      32            9563

 Peak ops: 9599 with 28 threads

L3000, 4P, named.9.1.0.O4.PBO.chatr, 11i - compilation with +O4, PBO
based on training done on the two-CPU configuration, static branch
prediction, large pages for instruction and data, TOPS disabled


 Threads             Ops
       2            3344
       4            5562
       6            6921
       8            7731
      10            8587
      12            9082
      14            9431
      16            9571
      18            9638
      20            9613
      22            9683
      24            9706
      26            9702
      28            9730
      30            9752
      32            9761

 Peak ops: 9761 with 32 threads



Appendix T: A Brief Demonstration of TOPS

While the comparatively low packet rate/high overhead and
multi-threaded model used by named from BIND 9.1.0 might not fit in
the model for which TOPS was designed, TOPS does indeed have value :)
To demonstrate that value, the results of two sets of aggregate
netperf TCP_RR tests are presented.

			      Figure T1.
			 TOPS versus no TOPS
		netperf single-byte, aggregate TCP_RR
		     Core 100BT, 4x550 MHz L3000

                                   Throughput
                                    Trans/s

                      no TOPS        51,704
                      TOPS           52,526

The difference above is constrained as the NIC is beginning to
approach the maximum packet per second limits of a 100BT network.
Maximum packets per second for 100BT is on the order of 140,000 pps,
and 52,526 transactions per second is ~105,000 packets per second.
This 100BT NIC was also connected to a "timeshare" network rather than
a test network.  Further, had this been a system with more and/or
less-powerful CPUs (the 550 MHz PA-8600 being the most powerful
PA-RISC processor on the market as of this writing), the performance
benefits of TOPS would have been more striking.

In the output of vsar below (similar data can be seen in the "per CPU" 
statistics of Glance) with TOPS disabled we see that one of the four
CPUs - the CPU taking interrupts from the NIC - is completely
saturated, but there is non-trivial idle time remaining on the other
three CPUs.

CPU    usr    sys   intr   idle
  0  20.94  61.01   0.49  17.52
  1   0.37   6.21  92.49   0.91
  2  20.06  61.91   0.31  17.71
  3  18.16  63.73   0.29  17.71
ALL  14.88  48.22  23.40  13.46

Now, with TOPS enabled, we can see some of that interrupt processing
spread-out across the CPUs, and a higher percentage of IDLE over
all.

CPU    usr    sys   intr   idle
  0  23.93  44.96  11.64  19.47
  1   8.19  17.17  64.65  10.00
  2  20.30  47.89  12.07  19.74
  3  20.15  47.80  12.26  19.79
ALL  18.14  39.45  25.16  17.25



Appendix S. Tracking-down the poor scaling

Well, if you have read this far, you deserve a reward, so let us
discuss the scaling of the named from BIND 9.1.0 in greater detail. 

As you probably noticed, while absolute performance increased at
higher levels of optimization, the MP scaling reduced. This implies
that as the code became more efficient, contention for mutexes
increased. This is also implied in the increase in time shown spent in
mutex calls in the prospect profiles.

HP has a number of tools which can be used to measure mutex contention
in a threaded process. One of these is an internal tool known as
mutex_perf, which can decode the _pth.mumble file created when an
application is linked against an instrumented libpthread library. That same
library includes a call, pthread_mutex_setname_np(mp,name), which can
be used to associate an ASCII string of up to 32 characters with the
mutex pointed-to by mp. A small change to lib/isc/pthreads/thread.c to
add a routine:

#ifdef TRACED_MUTEXES
#include <pthread.h>
#include <stdio.h>
isc_result_t
isc_mutex_init_traced(pthread_mutex_t *mp, int line, char *funcname) {

char name[32];

        snprintf(name,31,"line %d file %s",line,funcname);
        if (pthread_mutex_init(mp,NULL) == 0) {
                /* give it a name */
                pthread_mutex_setname_np(mp,name);
                return ISC_R_SUCCESS;
        }
        else {
                return ISC_R_UNEXPECTED;
        }
}
#endif /* TRACED_MUTEXES */

and then to lib/isc/pthreads/include/isc/mutex.h:


#ifdef TRACED_MUTEXES
#define isc_mutex_init(mp) isc_mutex_init_traced((mp),__LINE__,__FILE__)
#else
#define isc_mutex_init(mp) \
        ((pthread_mutex_init((mp), NULL) == 0) ? \
         ISC_R_SUCCESS : ISC_R_UNEXPECTED)
#endif /* TRACED_MUTEXES */

and then poof, we can get overall mutex information like this:


General information:
        Filename of instrumentation:    _pth.mutex.21591
        Instrumentation version:        MUTEX_PERF_VERSION_1
        Process ID:                     21591
        Run date:                       Tue Feb 13 10:48:18 2001

        Number of processors available: 4
        Total number created threads:   7
        Maximum simul. running threads: 7

        Ordering of mutexes displayed:  Highest used mutexes first
        Types of mutexes displayed:     Used and unused mutexes
        Count of initialized mutexes:   2508
        Count of used mutexes:          1187

        Mutexes with immediate acquisition:     1187
        Mutexes with acquisition after spin:    31
        Mutexes with acquisition after yield:   34
        Mutexes with acquisition after sleep:   26
        Mutexes with waiters on unlock:         26
        Maximum number of waiters on an unlock: 1

and per mutex information like this:

Per lock statistics:
        Mutex name:             line 1146 file socket.c
        Mutex init order:       228
        Mutex address:          0x4006afa0
        Mutex init:             0x4006afa0 (address of init or 1st lock)
        Mutex type:             PTHREAD_PROCESS_PRIVATE
        Mutex type:             PTHREAD_MUTEX_NORMAL
        Mutex spin cnt:         700
        Mutex yield freq:       every 100 spins
        Lock success calls:     14417389
        Lock failed calls:      0
        Trylock success calls:  0
        Trylock failed calls:   0
        Unlock success calls:   14417389
        Unlock failed calls:    0
        Lock immediately:       10310698
        Lock during 1st spin:   1067505
        Lock after yield1:      1834545
        Lock after yield2:      630044
        Lock after yield3:      271664
        Lock after yield4:      116507
        Lock after yield5:      52017
        Lock after yield6:      28416
        Lock after yield7:      9384
        Lock after yield8:      0
        Lock after yield9:      0
        Lock after yield10/more:0
        Lock after sleep (hdf): 0
        Lock after sleep1:      85724
        Lock after sleep2/more: 10885
        Lock had waiters:       90558
        Lock max waiters:       4
        Lock total waiters:     104595
        Unlock had waiters:     108550
        Unlock max waiters:     4
        Unlock total waiters:   131540

This run was simply requesting two separate names. You will notice
that a very large number of mutexes were created, but a smaller number
were "active." In a test accessing a larger number of names, even more
mutexes are created.

The per mutex data for "line 1146 file socket.c" shows that about 10
times out of every 14 the mutex is acquired immediately. However,
about 4 times out of 14, it was necessary to either spin for a short
while, or actually go to sleep for some time before the mutex could
be acquired. In other words, there was contention for this mutex.

Going back to the code, line 1146 in socket.c is in the routine
allocate_socket(). So, it would seem that there are per-socket data
structures being protected by this lock. Given that the test used only
one active socket, it is not all that surprising that there could be
some contention on that lock. One easy way to minimize that contention
might be to have more than one socket. We can do that by adding an
address alias to the Gigabit interface in use and update some of the
clients. 

    #active sockets   relative perf
          one             1.00
          two             1.08
         four             0.97

There was a nearly 8% improvement when named was accessing two
sockets. However, something dropped performance when there were four
sockets being accessed. Going back to the instrumented named, when
four sockets are being accessed, the socket mutex(es) do not appear
at the top of the list any longer. It is replaced by another one.

There are now four socket mutexes active, each looking something like
this:

        Mutex name:             line 1146 file socket.c
        Mutex init order:       228
        Mutex address:          0x4006afa0
        Mutex init:             0x4006afa0 (address of init or 1st lock)
        Mutex type:             PTHREAD_PROCESS_PRIVATE
        Mutex type:             PTHREAD_MUTEX_NORMAL
        Mutex spin cnt:         700
        Mutex yield freq:       every 100 spins
        Lock success calls:     318577
        Lock failed calls:      0
        Trylock success calls:  0
        Trylock failed calls:   0
        Unlock success calls:   318577
        Unlock failed calls:    0
        Lock immediately:       283532
        Lock during 1st spin:   8789
        Lock after yield1:      15935
        Lock after yield2:      5291
        Lock after yield3:      1841
        Lock after yield4:      892
        Lock after yield5:      353
        Lock after yield6:      230
        Lock after yield7:      83
        Lock after yield8:      0
        Lock after yield9:      0
        Lock after yield10/more:0
        Lock after sleep (hdf): 0
        Lock after sleep1:      1555
        Lock after sleep2/more: 76
        Lock had waiters:       1394
        Lock max waiters:       2
        Lock total waiters:     1442
        Unlock had waiters:     1709
        Unlock max waiters:     3
        Unlock total waiters:   1913


You can see how the number of times the lock was acquired without
contention has increased to a ratio of 28 out of every 31

This lock though has risen to the top:

        Mutex name:             line 684 file mem.c
        Mutex init order:       86
        Mutex address:          0x40042070
        Mutex init:             0x40042070 (address of init or 1st lock)
        Mutex type:             PTHREAD_PROCESS_PRIVATE
        Mutex type:             PTHREAD_MUTEX_NORMAL
        Mutex spin cnt:         700
        Mutex yield freq:       every 100 spins
        Lock success calls:     18431179
        Lock failed calls:      0
        Trylock success calls:  0
        Trylock failed calls:   0
        Unlock success calls:   18431179
        Unlock failed calls:    0
        Lock immediately:       14035389
        Lock during 1st spin:   4042204
        Lock after yield1:      314045
        Lock after yield2:      13290
        Lock after yield3:      6502
        Lock after yield4:      7347
        Lock after yield5:      4936
        Lock after yield6:      2848
        Lock after yield7:      1335
        Lock after yield8:      0
        Lock after yield9:      0
        Lock after yield10/more:0
        Lock after sleep (hdf): 0
        Lock after sleep1:      3268
        Lock after sleep2/more: 15
        Lock had waiters:       3416
        Lock max waiters:       3
        Lock total waiters:     4398
        Unlock had waiters:     3298
        Unlock max waiters:     3
        Unlock total waiters:   4706

with roughly 4 contented lockings out of every 18 attempts. There are
two "mem.c" files - one is in openssl, and since this was not an
encrypted test, we can probably assume it was lib/isc/mem.c, and the
routine is isc_mem_createx(). There were four mutexes initialized by
"line 684 file mem.c" and a search of the source code found only two
files with "isc_mem_createx" present: ./lib/isc/include/isc/mem.h and 
./lib/isc/mem.c. At it turns-out, isc_mem_createx is only called from
isc_mem_create, and that call is the only thing in isc_mem_create (CS
101 coding style and complete faith in the compiler I suppose :) The
routine isc_mem_create() is called all over the place, and the author
did not care to wade through all that deeply :)

OK, the author lied - about an hour later, it dawned on him that it
would be possible to further modify the isc_mutex_init_traced routine
to dump a stack trace to a file for each mutex initialized. In this
way, it would be possible to go from a mutex address to a stack trace
of its initialization. The mods to thread.c become:

#ifdef TRACED_MUTEXES
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdio.h>

extern void U_STACK_TRACE(void);
void redirect_stack_trace(FILE *filest) {
   int fd, fd2, err;
   fd = fileno(filest);
   fflush(filest);
   fflush(stderr);
   fd2 = dup(STDERR_FILENO);
   err = dup2(fd, STDERR_FILENO);
   U_STACK_TRACE();
   fflush(stderr);
   fd = dup2(fd2, STDERR_FILENO);
   close(fd2);
}
isc_result_t
isc_mutex_init_traced(pthread_mutex_t *mp, int line, char *funcname) {

char name[32];
FILE *tracefile;
char tracefilename[64];

        /* this is a rather heavy-handed way to find who
           inits a mutex at a given address... */
        snprintf(tracefilename,63,"/var/tmp/mutex%p",mp);
        tracefile=fopen(tracefilename,"a");
        if (tracefile != NULL) {
                redirect_stack_trace(tracefile);
                fclose(tracefile);
        }
        snprintf(name,31,"line %d file %s",line,funcname);
        if (pthread_mutex_init(mp,NULL) == 0) {
                /* give it a name */
                pthread_mutex_setname_np(mp,name);
                return ISC_R_SUCCESS;
        }
        else {
                return ISC_R_UNEXPECTED;
        }
}
#endif /* TRACED_MUTEXES */

So, now we can see who is initializing the mutex in mem.c that is the
next highest mutex in the list: (the address is different in this run
since the binary and some of the layout changed)

# more /var/tmp/mutex400436d8
( 0)  0x00170f84   redirect_stack_trace + 0x4c  [./named_init_trace]
( 1)  0x00171020   isc_mutex_init_traced + 0x4c  [./named_init_trace]
( 2)  0x001574e8   isc_mem_createx + 0x234  [./named_init_trace]
( 3)  0x001575e0   isc_mem_create + 0x28  [./named_init_trace]
( 4)  0x00013268   main + 0x7c  [./named_init_trace]
( 5)  0xc01409f8   _start + 0xa0  [/usr/lib/libc.2]
( 6)  0x0000ac70   $START$ + 0x1a0  [./named_init_trace]

So, it is something fairly close to the beginning in main.c for named:


int
main(int argc, char *argv[]) {
        isc_result_t result;

        program_name = argv[0];
        isc_assertion_setcallback(assertion_failed);
        isc_error_setfatal(library_fatal_error);
        isc_error_setunexpected(library_unexpected_error);

        ns_os_init(program_name);

        result = isc_app_start();
        if (result != ISC_R_SUCCESS)
                ns_main_earlyfatal("isc_app_start() failed: %s",
                                   isc_result_totext(result));

        result = isc_mem_create(0, 0, &ns_g_mctx);
        if (result != ISC_R_SUCCESS)
                ns_main_earlyfatal("isc_mem_create() failed: %s",
                                   isc_result_totext(result));

        dns_result_register();
        dst_result_register();


there is only one call to isc_mem_create in main, to setup whatever is 
pointed-to by ns_g_mctx. From the looks of things, this is a fairly
central datastructure, but in all honesty, as of yet, the author has
not figured-out what it is for exactly :)

Using this same tracking technique, we can also look at some of the
non-socket mutexes that lurk further down in the list:

        Mutex name:             line 95 file rwlock.c
        Mutex init order:       432
        Mutex address:          0x404b6980
        Mutex init:             0x404b6980 (address of init or 1st lock)
        Mutex type:             PTHREAD_PROCESS_PRIVATE
        Mutex type:             PTHREAD_MUTEX_NORMAL
        Mutex spin cnt:         700
        Mutex yield freq:       every 100 spins
        Lock success calls:     28937186
        Lock failed calls:      0
        Trylock success calls:  0
        Trylock failed calls:   0
        Unlock success calls:   28937186
        Unlock failed calls:    0
        Lock immediately:       26611307
        Lock during 1st spin:   2247119
        Lock after yield1:      45440
        Lock after yield2:      5038
        Lock after yield3:      3643
        Lock after yield4:      4056
        Lock after yield5:      5534
        Lock after yield6:      6576
        Lock after yield7:      2880
        Lock after yield8:      0
        Lock after yield9:      0
        Lock after yield10/more:0
        Lock after sleep (hdf): 0
        Lock after sleep1:      5566
        Lock after sleep2/more: 27
        Lock had waiters:       6146
        Lock max waiters:       3
        Lock total waiters:     7136
        Unlock had waiters:     5620
        Unlock max waiters:     3
        Unlock total waiters:   6806

# cat /var/tmp/mutex404b6980
Stack_Trace: error while unwinding stack
( 0)  0x00170f84   redirect_stack_trace + 0x4c  [./named_init_trace]
( 1)  0x00171020   isc_mutex_init_traced + 0x4c  [./named_init_trace]
( 2)  0x0015c164   isc_rwlock_init + 0x88  [./named_init_trace]
( 3)  0x0010dee8   dns_rbtdb_create + 0xd8  [./named_init_trace]
( 4)  0x00049414   dns_db_create + 0xb8  [./named_init_trace]
( 5)  0x00047b70   dns_cache_create + 0x168  [./named_init_trace]
( 6)  0x0001a588   configure_view + 0x6f4  [./named_init_trace]
( 7)  0x0001bee8   load_configuration + 0x20c  [./named_init_trace]
( 8)  0x0001ccd0   run_server + 0x148  [./named_init_trace]
( 9)  0x0016199c   dispatch + 0x2f0  [./named_init_trace]
(10)  0x00161b88   run + 0x10  [./named_init_trace]
(11)  0xc11eb8e4   __pthread_create_system + 0x434  [/tmp/instr_pthreads/libpthread.1]

        Mutex name:             line 3992 file rbtdb.c
        Mutex init order:       431
        Mutex address:          0x404b6924
        Mutex init:             0x404b6924 (address of init or 1st lock)
        Mutex type:             PTHREAD_PROCESS_PRIVATE
        Mutex type:             PTHREAD_MUTEX_NORMAL
        Mutex spin cnt:         700
        Mutex yield freq:       every 100 spins
        Lock success calls:     28984849
        Lock failed calls:      0
        Trylock success calls:  0
        Trylock failed calls:   0
        Unlock success calls:   28984849
        Unlock failed calls:    0
        Lock immediately:       26232294
        Lock during 1st spin:   2671616
        Lock after yield1:      46090
        Lock after yield2:      5229
        Lock after yield3:      4275
        Lock after yield4:      4794
        Lock after yield5:      6074
        Lock after yield6:      6268
        Lock after yield7:      2732
        Lock after yield8:      0
        Lock after yield9:      0
        Lock after yield10/more:0
        Lock after sleep (hdf): 0
        Lock after sleep1:      5444
        Lock after sleep2/more: 33
        Lock had waiters:       5784
        Lock max waiters:       3
        Lock total waiters:     6806
        Unlock had waiters:     5510
        Unlock max waiters:     3
        Unlock total waiters:   6774

# cat /var/tmp/mutex404b6924
Stack_Trace: error while unwinding stack
( 0)  0x00170f84   redirect_stack_trace + 0x4c  [./named_init_trace]
( 1)  0x00171020   isc_mutex_init_traced + 0x4c  [./named_init_trace]
( 2)  0x0010decc   dns_rbtdb_create + 0xbc  [./named_init_trace]
( 3)  0x00049414   dns_db_create + 0xb8  [./named_init_trace]
( 4)  0x00047b70   dns_cache_create + 0x168  [./named_init_trace]
( 5)  0x0001a588   configure_view + 0x6f4  [./named_init_trace]
( 6)  0x0001bee8   load_configuration + 0x20c  [./named_init_trace]
( 7)  0x0001ccd0   run_server + 0x148  [./named_init_trace]
( 8)  0x0016199c   dispatch + 0x2f0  [./named_init_trace]
( 9)  0x00161b88   run + 0x10  [./named_init_trace]
(10)  0xc11eb8e4   __pthread_create_system + 0x434  [/tmp/instr_pthreads/libpthread.1]

Another "interesting" consequence of putting stack traces into files
based on their address is one can see when a mutex is initialized more 
than once. This is definitely the case with the mutex at address
403dbd3c which has a _long_ series of inits - 464 of them in the
testing done. It would seem that 463 of these initializations have
this stack trace:

Stack_Trace: error while unwinding stack
( 0)  0x00170f84   redirect_stack_trace + 0x4c  [./named_init_trace]
( 1)  0x00171020   isc_mutex_init_traced + 0x4c  [./named_init_trace]
( 2)  0x00162cc4   isc_timer_create + 0x418  [./named_init_trace]
( 3)  0x0009be00   fctx_create + 0x2bc  [./named_init_trace]
( 4)  0x000a0f3c   dns_resolver_createfetch + 0x4d4  [./named_init_trace]
( 5)  0x00017a24   query_recurse + 0x1c0  [./named_init_trace]
( 6)  0x00018250   query_find + 0x320  [./named_init_trace]
( 7)  0x000196c4   ns_query_start + 0x2b8  [./named_init_trace]
( 8)  0x0000dec4   client_request + 0x2e4  [./named_init_trace]
( 9)  0x0016199c   dispatch + 0x2f0  [./named_init_trace]
(10)  0x00161b88   run + 0x10  [./named_init_trace]
(11)  0xc11eb8e4   __pthread_create_system + 0x434  [/tmp/instr_pthreads/libpthread.1]

and one has this:

Stack_Trace: error while unwinding stack
( 0)  0x00170f84   redirect_stack_trace + 0x4c  [./named_init_trace]
( 1)  0x00171020   isc_mutex_init_traced + 0x4c  [./named_init_trace]
( 2)  0x00162cc4   isc_timer_create + 0x418  [./named_init_trace]
( 3)  0x0000e9ac   client_create + 0xfc  [./named_init_trace]
( 4)  0x0000fb3c   ns_clientmgr_createclients + 0x2f8  [./named_init_trace]
( 5)  0x0000f428   ns_client_replace + 0x68  [./named_init_trace]
( 6)  0x000178b4   query_recurse + 0x50  [./named_init_trace]
( 7)  0x00018250   query_find + 0x320  [./named_init_trace]
( 8)  0x000196c4   ns_query_start + 0x2b8  [./named_init_trace]
( 9)  0x0000dec4   client_request + 0x2e4  [./named_init_trace]
(10)  0x0016199c   dispatch + 0x2f0  [./named_init_trace]
(11)  0x00161b88   run + 0x10  [./named_init_trace]
(12)  0xc11eb8e4   __pthread_create_system + 0x434  [/tmp/instr_pthreads/libpthread.1]

The author did not attempt to see if this was a mutex being created on
the stack, or something else. There was at least six other mutex
address with multiple initializations but they are not shown here.

One additional data point was gleaned from the output of the mutex_perf
tool. Dividing the sum of all the successful locks in a two-name test,
by the number of transactions performed yields the rather surprising
figure of over one-hundred mutex lock/unlock pairs per
transaction. (The "precise" count was 58792905 mutex locks in a 90
second run at a rate of 4616.15 transactions per second, which yields
141 lock/unlock pairs per transaction). If this is accurate, it could
go a long way towards explaining all the time in pthread_mutex_lock
and such.




More information about the bind-workers mailing list