BIND 9.7 Serial Number Decrease Problem

Barry Finkel bsfinkel at anl.gov
Mon Jun 6 19:01:32 UTC 2011


In message<4DE9045C.2050509 at anl.gov>, Barry Finkel writes:
>> I have a problem with BIND 9.7.x on Ubuntu.
>> I have two servers that are running 9.7.3.
>> They slave 332 zones, and they also master 213,750
>> malware/spyware zones that we have defined to reroute these
>> domains to a local machine.
>>
>> When I was upgrading the BIND to 9.7.3-P1 yesterday, an
>>
>>        ./rndc stop
>>
>> command ran over 8 minutes, and named did not stop.
>> A "kill" command did not work; I had to revert to a
>> "kill -9" command.  What was BIND doing?  Gracefully
>> closing all of the zones?
>
> Most probably.  "rndc stop" ensures that masterfiles are up-to-date
> before exiting.  "rndc halt" does not try to flush master files
> before exiting.
>
> There could also have been a reference leak causing named to not
> stop.
>
>>   BIND 9.7.3-P1 came up fine, but there are two things that concern me:
>>
>> 1) After BIND began responding to queries, it was using
>>      100% of the CPU for about three minutes.  I am not sure what
>>      BIND was doing.  This is not major because BIND was handling
>>      customer queries, and after the three minutes the CPU usage
>>      dropped to a normal 1%.
>>
>> 2) Two zones reported serial number decreases.  This is bad.
>>
>> I did some research on the two zones - both Microsoft
>> Active Directory zones (one _tcp and one _udp) that are mastered
>> on a Windows Domain Controller and slaved on my BIND boxes.
>> I have around 44 AD zones I slave, and only these two reported
>> problems - on my two internal Ubuntu slaves and my two Solaris 10
>> slaves.  The two Solaris 10 slaves do not run the spyware zones,
>> so I had no problem with "./rndc stop".  I therefore am not sure
>> that the serial number problems are due to the "kill -9".
>
> They shouldn't be.  The handling of master files and journals is
> designed to have the power be pull at anytime provided the filesystem
> supports atomic replacement of files.
>
>> I looked at the serial number issue on these two zones in detail;
>> I capture the serial numbers on all the AD zones each morning at
>> 6:10.  Here is information for the _tcp zone:
>>
>>        Date        Zone  Mast Slav Slav
>>        20 Oct 2010 _tcp. 1233 1233 1233
>>        21 Oct 2010 _tcp. 1239 1239 1239 The master incremented the serial.
>>        ...
>>        09 Nov 2010 _tcp. 1239 1239 1239
>>        10 Nov 2010 _tcp. 1238 1239 1239 Master decreased due to MS patch
>>        11 Nov 2010 _tcp. 1238 1238 1238
>>        ...
>>        03 Dec 2010 _tcp. 1238 1238 1238
>>        04 Dec 2010 _tcp. 1238 1238 1239 ??
>>        05 Dec 2010 _tcp. 1238 1239 1238 ??
>>        06 Dec 2010 _tcp. 1238 1238 1238
>>        ...
>>        09 Dec 2010 _tcp. 1238 1238 1238
>>        10 Dec 2010 _tcp. 1238 1238 1239 ??
>>        11 Dec 2010 _tcp. 1238 1239 1238 ??
>>        12 Dec 2010 _tcp. 1238 1238 1238
>>        ...
>>        05 Jan 2011 _tcp. 1238 1238 1238
>>        06 Jan 2011 _tcp. 1238 1239 1239 ??
>>        07 Jan 2011 _tcp. 1238 1238 1238
>>        ...
>>        02 Mar 2011 _tcp. 1238 1238 1238 Upgrade 9.7.2-P3 to 9.7.3
>>        03 Mar 2011 _tcp. 1238 1239 1239
>>        04 Mar 2011 _tcp. 1238 1238 1238
>>        ...
>>        16 Apr 2011 _tcp. 1238 1238 1238
>>        17 Apr 2011 _tcp. 1238 1238 1238 1238 1238 Two Sol10 slaves added.
>>        ...
>>        02 Jun 2011 _tcp. 1238 1238 1238 1238 1238 Upgrade 9.7.3 to 9.7.3-P1
>>        03 Jun 2011 _tcp. 1238 1239 1239 1239 1239
>>
>> Both Ubuntu slaves have been up for 149 days (reboot around Jan 15).
>> The zone serial was 1239 until a MS patch run on the Domain
>> Controller decreased the serial by one on the evening of Nov 9.
>> I did nothing to correct the problem; I waited for the two zones
>> to expire, and then new zones were transferred from the Windows
>> master server.  The serial number was 1238 on the master and
>> slaves.  On a few days, the serial on the slaves increased
>> by one, and I am not sure what happened on those days.
>>
>> On Mar 02 I upgraded BIND from 9.7.2-P3 to 9.7.3, and the
>> serial numbers on the two upgraded BIND slaves reverted to the
>> higher 1239 serial.  Again, I did no fixup, and on Mar 04
>> the serials were the same at the lower value.  I think that the
>> serial number decrease was temporary during the patch run.
>> On Apr 17 I added the two Solaris 10 slaves to my morning report, and
>> all five serials were contant at 1238 until I upgraded BIND Tuesday (on
>> the Solaris 10 boxes) and yesterday (on the Ubuntu boxes).  Immediately
>> after the upgrade BIND reported the serial number problem on these two
>> zones.  The other AD zones have had no serial number problems.
>>
>> I have no idea why BIND would remember the increased 1239
>> serial number, when the serial number for the zone has been constant
>> at 1238 since Mar 04.  I have to assume that between Mar 04 and
>> Jun 03 BIND would have written the zone to disk, either in the
>> base zone file or a .jnl file.
>>
>> --
>> ----------------------------------------------------------------------
>> Barry S. Finkel

Phil Mayers suggested a corrupt .jnl file; I am not sure.
How do I debug this?
I have the following situation now:

      1) The master (on an MS DNS Server) has serial 1238.

      2) The zone file on a Solaris 10 slave box has serial 1239.
         The slave is running 9.7.3-P1.

      3) There is a zone.jnl file on the box, and when I do a dig for
         SOA the response is the correct value, 1238.  This decreased
         serial number must be somewhere in the .jnl file.  I am not
         sure how to decode the .jnl file; I have not looked at the
         code in detail.

      4) There are no complaints about serial number mismatches, as
         the .jnl file has the decreases serial number to match the
         master's serial number.


Here is an "od -c" for the .jnl file:

0000000   ;   B   I   N   D       L   O   G       V   9  \n  \0  \0  \0
0000020  \0  \0  \0   &  \0  \0 002  \0  \0  \0  \0   '  \0  \0 004 264
0000040  \0  \0  \0   8  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000060  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000100  \0  \0  \0   &  \0  \0 002  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000120  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0001000  \0  \0 002 250  \0  \0  \0   &  \0  \0  \0   '  \0  \0  \0   l
0001020 004   _   t   c   p 004   c   i   e   p 003   d   i   s 003   a
0001040   n   l 003   g   o   v  \0  \0 006  \0 001  \0  \0 016 020  \0
0001060   K 007   c   i   e   p   d   c   1 004   c   i   e   p 003   d
0001100   i   s 003   a   n   l 003   g   o   v  \0  \n   h   o   s   t
0001120   m   a   s   t   e   r 004   c   i   e   p 003   d   i   s 003
0001140   a   n   l 003   g   o   v  \0  \0  \0  \0   &  \0  \0 003 204
0001160  \0  \0 002   X  \0 001   Q 200  \0  \0 016 020  \0  \0  \0   ;
0001200 004   _   t   c   p 004   c   i   e   p 003   d   i   s 003   a
0001220   n   l 003   g   o   v  \0  \0 002  \0 001  \0  \0 016 020  \0
0001240 032 007   c   i   e   p   d   c   2 004   c   i   e   p 003   d
0001260   i   s 003   a   n   l 003   g   o   v  \0  \0  \0  \0   / 004
0001300   _   t   c   p 004   c   i   e   p 003   d   i   s 003   a   n
0001320   l 003   g   o   v  \0  \0 002  \0 001  \0  \0 016 020  \0 016
0001340 004   d   n   s   2 003   a   n   l 003   g   o   v  \0  \0  \0
0001360  \0   / 004   _   t   c   p 004   c   i   e   p 003   d   i   s
0001400 003   a   n   l 003   g   o   v  \0  \0 002  \0 001  \0  \0 016
0001420 020  \0 016 004   d   n   s   1 003   a   n   l 003   g   o   v
0001440  \0  \0  \0  \0   ; 004   _   t   c   p 004   c   i   e   p 003
0001460   d   i   s 003   a   n   l 003   g   o   v  \0  \0 002  \0 001
0001500  \0  \0 016 020  \0 032 007   c   i   e   p   d   c   1 004   c
0001520   i   e   p 003   d   i   s 003   a   n   l 003   g   o   v  \0
0001540  \0  \0  \0   l 004   _   t   c   p 004   c   i   e   p 003   d
0001560   i   s 003   a   n   l 003   g   o   v  \0  \0 006  \0 001  \0
0001600  \0 016 020  \0   K 007   c   i   e   p   d   c   1 004   c   i
0001620   e   p 003   d   i   s 003   a   n   l 003   g   o   v  \0  \n
0001640   h   o   s   t   m   a   s   t   e   r 004   c   i   e   p 003
0001660   d   i   s 003   a   n   l 003   g   o   v  \0  \0  \0  \0   '
0001700  \0  \0 003 204  \0  \0 002   X  \0 001   Q 200  \0  \0 016 020
0001720  \0  \0  \0   ; 004   _   t   c   p 004   c   i   e   p 003   d
0001740   i   s 003   a   n   l 003   g   o   v  \0  \0 002  \0 001  \0
0001760  \0 016 020  \0 032 007   c   i   e   p   d   c   1 004   c   i
0002000   e   p 003   d   i   s 003   a   n   l 003   g   o   v  \0  \0
0002020  \0  \0   / 004   _   t   c   p 004   c   i   e   p 003   d   i
0002040   s 003   a   n   l 003   g   o   v  \0  \0 002  \0 001  \0  \0
0002060 016 020  \0 016 004   d   n   s   1 003   a   n   l 003   g   o
0002100   v  \0  \0  \0  \0   / 004   _   t   c   p 004   c   i   e   p
0002120 003   d   i   s 003   a   n   l 003   g   o   v  \0  \0 002  \0
0002140 001  \0  \0 016 020  \0 016 004   d   n   s   2 003   a   n   l
0002160 003   g   o   v  \0  \0  \0  \0   ; 004   _   t   c   p 004   c
0002200   i   e   p 003   d   i   s 003   a   n   l 003   g   o   v  \0
0002220  \0 002  \0 001  \0  \0 016 020  \0 032 007   c   i   e   p   d
0002240   c   2 004   c   i   e   p 003   d   i   s 003   a   n   l 003
0002260   g   o   v  \0
0002264


If I were to stop and restart BIND, is there any tracing I can
do to see what happens?  I have no problem with doing a quick
restart on this Solaris 10 server because it does not have many
users, and it does not have the large number of malware/spyware
zones.  Thanks.
-- 
----------------------------------------------------------------------
Barry S. Finkel
Computing and Information Systems Division
Argonne National Laboratory          Phone:    +1 (630) 252-7277
9700 South Cass Avenue               Facsimile:+1 (630) 252-4601
Building 240, Room 5.B.8             Internet: BSFinkel at anl.gov
Argonne, IL   60439-4828             IBMMAIL:  I1004994



More information about the bind-users mailing list