Questions on BIND Start/stop Timings Solaris 9 vs. Ubuntu hardy

b19141 at anl.gov b19141 at anl.gov
Thu Apr 29 15:30:10 UTC 2010


I wrote on April 16:
>I did some timings with BIND 9.6.1-P3 and 9.7.0-P1 on two servers:
>
>     SunOS 5.9 sun4u sparc SUNW,Sun-Blade-1500  (old hardware)
>     Ubuntu hardy x86_64 GNU/Linux              (more modern hardware)
>
>I had noticed long times for "rndc reload" to complete, and I wanted to
>see if 9.6.1-P3 was different than 9.7.0-P1.
>
>My test DNS has three small "regular" zones defined.  It also has our
>list of 75,507 spyware/malware zones that we define to point to a
>"honeypot" machine to log and respond to the traffic.
>
>I have marked with "<=====" areas where I have questions.
>
>SunOS 5.9:
>     ---------------
>     9.7.0-P1
>     Apr 15 10:27:48  ./rndc reload
>     Apr 15 10:28:21  reloading configuration succeeded
>     Apr 15 10:28:21  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 10:30:38  zone thabengmanagement.co.za/IN: loaded serial ...
>                      named process taking 99.44% of CPU.   <=====
>     Apr 15 10:32:51  reloading zones succeeded
>     Apr 15 10:32:51  next command prompt
>     ---------------
>     Apr 15 10:46:07  ./rndc stop
>     Apr 15 10:46:07  shutting down: flushing changes
>     Apr 15 10:46:07  no longer listening on 146.137.180.21
>     Apr 15           named process not shown in "top" output.
>     Apr 15 10:47:41  ps -ef | grep named shows named process gone.  <=====
>     ---------------
>     Apr 15 10:53:09  Start BIND.
>     Apr 15 10:53:09  built with '--prefix=/export/home/named/bind'
>                                 '--with-openssl=/krb5'
>                                 '--sysconfdir=/export/home/named'
>                                 '--enable-threads'
>                                 '--localstatedir=/var'
>     Apr 15 10:53:56  named taking 87% of CPU
>     Apr 15 10:54:49  ps -ef | grep named shows two named processes.
>                           root 21638 21637 94 10:53:09 ?        1:34 /e...
>                           root 21637 19863  0 10:53:09 pts/2    0:00 /e...
>     Apr 15 10:53:40  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 10:55:55  zone thabengmanagement.co.za/IN: loaded serial ...
>     Apr 15 10:58:08  two named processes   <=====
>     Apr 15 10:58:09  next command prompt
>     Apr 15 10:58:13  one named process
>     ---------------
>     Apr 15 11:29:07  ./rndc stop
>     Apr 15 11:29:07  next command prompt
>                      named taking 90% - 95%  of CPU   <=====
>     Apr 15 11:30:45  named process finally stops
>     ---------------
>     9.6.1-P3
>     Apr 15 11:32:34  Start BIND.
>     Apr 15 11:32:34  built with '--prefix=/export/home/named/bind'
>                                 '--with-openssl=/krb5'
>                                 '--sysconfdir=/export/home/named'
>                                 '--enable-threads'
>                                 '--localstatedir=/var'
>                      two named processes running
>                      named taking 80% - 99% of CPU
>     Apr 15 11:33:05  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 11:35:17  zone thabengmanagement.co.za/IN: loaded serial ...
>     Apr 15 11:37:31  running  <=====
>     Apr 15 11:37:31  next comand prompt
>     Apr 15 11:37:32  one named process running
>     ---------------
>
>
>Ubuntu hardy:
>     ---------------
>     9.7.0-P1
>     Apr 15 10:38:54  ./rndc reload
>                      named process taking 11% - 20% of CPU
>     Apr 15 10:39:04  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 10:40:06  zone thabengmanagement.co.za/IN: loaded serial ...
>     Apr 15 10:40:19  next command prompt
>     ----------------
>     Apr 15 11:07:51  ./rndc stop
>     Apr 15 11:07:51  next command prompt
>     Apr 15 11:07:52  no longer listening on 146.139.115. 146#53
>     Apr 15           named taking 87% of CPU.
>     Apr 15 11:10:00  exiting
>     Apr 15 11:10:06  named process is gone.
>     ---------------
>     Apr 15 11:12:27  Start BIND.
>     Apr 15 11:12:27  built with '--prefix=/etc/iscbind/bind/'
>                                 '--sysconfdir=/etc/iscbind'
>                                 '--mandir=/usr/share/man'
>                                 '--infodir=/usr/share/info'
>                      named taking 80% of CPU
>                      two named processes running
>                           root     31162     1 60 11:12 ?        00:00:38 /et
>     Apr 15 11:12:37  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 11:13:12  zone thabengmanagement.co.za/IN: loaded serial ...
>     Apr 15 11:13:14  next command prompt
>     Apr 15 11:13:15  one named process running
>     ---------------
>     Apr 15 11:18:25  ./rndc stop
>     ---------------
>     9.6.1-P3
>     Apr 15 11:22:57  Start BIND.
>     Apr 15 11:22:58  built with '--prefix=/etc/iscbind/bind/'
>                                 '--sysconfdir=/etc/iscbind'
>                                 '--localstatedir=/var'
>                                 '--mandir=/usr/share/man'
>                                 '--infodir=/usr/share/info'
>     Apr 15 11:23:07  zone binhafeez.ae/IN: loaded serial ...
>     Apr 15 11:23:27  zone thabengmanagement.co.za/IN: loaded serial ...
>     Apr 15 11:23:31  two named processes running
>                      named taking 99% of CPU
>     Apr 15 11:23:31  next command prompt
>     Apr 15 11:23:36  one named process running
>     ---------------
>
>The Ubuntu hardy (running on newer hardware) is faster.  My questions:
>
>1) On SunOS 5.9 after the last malware zone is loaded, there is a
>   wait of 2 minutes and 13 seconds before control is returned to
>   the command prompt, and DNS begins servicing queries.  During this
>   time BIND is using close to 100% of the CPU.
>   On Ubuntu hardy, the wait is much shorter - up to 13 seconds.
>   What is BIND doing after the last zone is loaded?  Why does it take
>   much longer on Solaris 9?
>
>2) When BIND starts (on both operating systems), "ps -ef" shows two
>   processes running:
>
>	ps -ef | grep named
>
>	root     31162     1 60 11:12 ?        00:00:38  /etc...
>	root     31162 31161 60 11:12 ?        00:00:20  /etc...
>
>   When control is returned to the command prompt, the "ps -ef"
>   command shows only one process running.  What is this second
>   process that is running during startup?
>
>3) With "rndc stop" it takes a second before control is returned
>   to the command prompt, but it takes 1.5 minutes before BIND
>   terminates.  What is happening during this time?  Is BIND writing
>   all of the zones to disk from memory?
>
>Thanks.

I have done more research.  I ran a truss trace, and the trace
ends when BIND issues the "running" message, and only one named
process is running.  The truss command must be tracing the process
that ends when the "running" message is issued.

I ran a truss trace with the BIND "-f" parameter (Run the server in
the foreground (that is, do not daemonize)).  The truss trace was
run on Solaris 10_x86.  Other timings were done on Solaris 9
Sun-Blade-1500 and Ubuntu hardy x86_64).

Here are delta timings on the two Solaris boxes and an Ubuntu/hardy
x86_64 box.  There are about 75,000 spyware/malware zones being loaded.

                                Sol9 x86  Ubu
-------------------------------|----|----|----|
first zone loaded              |0:00|0:00|0:00|
last zone loaded               |2:14|1:21|0:22|
next command prompt/"running"  |2:13|1:32|0:03|
BIND now using 0% CPU          |  ??|0:43|0:05|

It takes longer on Solaris to load the 75,000 zones, but that may be
due to faster hardware on the Ubuntu/hardy machine.  On Ubuntu,
after the last zone has been loaded, it takes about three seconds
until control is returned to the command prompt and BIND begins
responding to DNS requests.  It takes much longer (with close to
100% CPU usage) on both Solaris boxes.  What is named doing during
this time?

In the truss trace I see after the last zone has been loaded MANY
lines like these:

     lwp_unpark(4)            = 0
     lwp_park(0xFE94FE60, 0)  = 0
     lwp_park(0x00000000, 0)  = 0
     lwp_park(0xFE94FE60, 0)  = 0
     lwp_unpark(4)            = 0
     yield()                  = 0

There are fewer "yield()" lines than "lwp_" lines.

After the "running" message is issued, I see LOTS of lines like these:

     lwp_unpark(2)            = 0
     lwp_park(0x00000000, 0)  = 0
     lwp_unpark(3)            = 0
     lwp_park(0x00000000, 0)  = 0
     yield()                  = 0

(again with fewer "yield()" lines) until the BIND/named process is
using 0% of the CPU.  Then the truss trace has two lines without
timestamps:

     lwp_park(0x00000000, 0)         (sleeping...)
     lwp_park(0x00000000, 0)         (sleeping...)

and then there are no more trace entries until I shutdown BIND.

I found one Sun blog site

     http://blogs.sun.com/chrisg/entry/lwp_park_and_lwp_unpark

that explains the lwp_ sytem calls, but I do not know the BIND source
code well enough to determine what code is causing these lwp_ and
yield calls, which are using close to 100% of the CPU.
----------------------------------------------------------------------
Barry S. Finkel
Computing and Information Systems Division
Argonne National Laboratory          Phone:    +1 (630) 252-7277
9700 South Cass Avenue               Facsimile:+1 (630) 252-4601
Building 240, Room 5.B.8             Internet: BSFinkel at anl.gov
Argonne, IL   60439-4828             IBMMAIL:  I1004994



More information about the bind-users mailing list