Journal errors

Fri Apr 28 13:55:12 UTC 2006

>> "Phillip" <preeves1 at gmail.com> wrote:
>> 
>>>> We have Bind 9.2.3 running on RHE3.   Our redhat servers are slaves to
>>>> windows servers.  The errors that are piling up in our logs are as
>>>> follows:
>>>>
>>>> named[27177]: malformed transaction: db.mydomain.com.jnl last serial
>>>> 20065 != transaction first serial 20064
>>>> named[27177]: transfer of 'mydomain.com/IN' from 192.168.100.10 #53:
>>>> failed while receiving responses: unexpected error
>>>> named[27177]: transfer of 'mydomain.com/IN' from 192.168.100.10#53: end
>>>> of transfer
>>>> named[27177]: zone mydomain.com/IN: transferred serial 20065
>>>>
>>>> These are the only errors or warnings that we recieve in our logs and
>>>> we get about 50 a day.  Not a big deal but getting annoying.  Ive been
>>>> searching for the past few weeks for the answer to this question
>>>> myself.   I read a little and it says that jnl files are created on a
>>>> BIND server that permits IXFR transfers to reduce network traffic.
>>>> These files are created when a primary server announces that an update
>>>> has been made.  The jnl files hold for 15 minutes (I think) waiting on
>>>> any more updates and then once the time expires BIND attempts to merge
>>>> the changes into the existing db file on the slave server.  When I
>>>> monitored the changes I would look at the Windows servers and notice
>>>> that they might make 2 or 3 changes per 15 minutes, maybe more.  Every
>>>> change it would update the jnl file with an updated SOA.  All this
>>>> makes sense to me even the message...
>>>> "named[27177]: malformed transaction: db.mydomain.com.jnl last serial
>>>> 20065 != transaction first serial 20064"
>>>>
>>>> However I dont understand why it errors in the updating of the db
>>>> record after this and surely it has nothing to do with the diff in
>>>> SOAs.  I was wondering if this was a bug in BIND 9.2.X

>> And I replied:
>> 
>>> I am seeing messages like this:
>>>
>>>     Apr 18 03:12:44 dns0.anl.gov named[163]: [ID 873579 daemon.error]
>>>       malformed transaction: cmt.anl.gov.jnl last serial 2001077345 !=
>>>       transaction first serial 2001077344
>>>
>>> In my case, the master for the zone in question is a MS W2k+3 DNS
>>> Server, with many DDNS updates throughout the day from a MS W2k+3 DHCP
>>> Server.  I slave the zone internally on BIND dns1.anl.gov.  I want the
>>> zone on dns0 (a hidden BIND "master") so that I can process it along
>>> with my other zones via scripts.  For various reasons I cannot have
>>> dns0 be a slave to the W2k+3 DNS Server, so I use dns1 as the master.
>>> I think in this case that dns1 starts an IXFR to dns0, and during the
>>> IXFR an IXFR arrives at dns1 from the real W2k+3 DNS master.  So, this
>>> error is between two BIND 9.2.4 systems.  I know that I need to upgrade
>>> to the latest BIND 3.x.  As with Phillip, "I was wondering if this was
>>> a bug in BIND 9.2.X."  I am not seeing any of these errors in the
>>> AXFR/IXFR from the W2k+3 DNS Server to any of my foour BIND slave
>>> servers.

and later I replied:
>> I got a snoop trace of this, and I combined it with the dns.log file
>> from my W2k+3 DNS Server.  I am not an expert in decoding DNS packets,
>> especially IXFR packets.  I can send the trace records and my summary
>> to anyone who wants to look at this.  I probably will not file a bug
>> report until I can reproduce it on a more current level of BIND.
>> I have not looked at the change log in the newer BINDs to see if this
>> is listed as a known bug that has been resolved.
>> 
>> In my case, I can get around the problem by having the dns1 server
>> "also-notify" the dns0 server when the zones from the Windows box
>> are updated.  For a number of technical reasons, I cannot have the
>> Windows DNS Server notify dns0 when a zone is updated.  The problem
>> seems to occur when there are a number of updates to the master zone.
>> That zone gets successfullly IXFRed to dns1 after each DDNS update.
>> But dns0 is not notified.  At the refresh interval, dns0 asks dns1 for
>> the zone SOA, and dns0 sees an increased zone serial number.  When the
>> IXFR from dns1 to dns0 occurs, there could be multiple updates included
>> in the IXFR, and this is where the error is occurring.  A decoding of
>> the snoop packets would shed light on exactly what is happening.

Danny Mayer <mayer at gis.net> replied to my posting:

>I seem to recall that Windows DNS doesn't do IXFR correctly and it's
>very hard to get right. BIND 8 was never quite right and it wasn't until
>BIND 9 that BIND was able to do it correctly. There were also some
>interesting requirements discussed on namedroppers that needed to be
>implemented on conforming servers in order for this to work. I suggested
>you fall back to AXFR.

As far as I can tell, there is no problem with the IXFRs from Windows
to my BIND server dns1.  The problem is with the accumulation of 
updates on dns1 being packaged correctly to do an IXFR to another BIND
server.  As dns1 does not NOTIFY dns0 when there is a new copy of the
zone, the IXFRs on dns1 get stacked until the refresh timer on dns0
determines that it is time for another SOA serial check on dns1.
But as I previously wrote, I have not looked at the snoop packets in
detail to determine exactly what is causing the BIND error message.
I do not know if the original IXFR packets from Windows to dns1 have
problems, and those problems are not reported as errors until the
subsequent IXFR to the dns0 BIND box.
----------------------------------------------------------------------
Barry S. Finkel
Computing and Information Systems Division
Argonne National Laboratory          Phone:    +1 (630) 252-7277
9700 South Cass Avenue               Facsimile:+1 (630) 252-4601
Building 222, Room D209              Internet: BSFinkel at anl.gov
Argonne, IL   60439-4828             IBMMAIL:  I1004994