Secondaries sometimes don't respond to notify

Thu Mar 3 21:51:31 UTC 2005

Re our problem with occasional dropped notifies:

I'm thinking of experimenting with Solaris's setting, "udp_recv_hiwat".
Can anyone share experience with raising udp_recv_hiwat
on loaded query servers? Did raising it visibly help any problems
or hurt anything?

Sockets have a setting, SO_RCVBUF, controllable by setsockopt,
that determines the size of a buffer through which the socket's
incoming UDP packets pass, but as far as I can see, BIND9.3
doesn't touch the setting.   Our Solaris uses 8192 by default, and
has a system-wide setting controllable by ndd (udp_recv_hiwat)
which is the default for SO_RCVBUF.  I'm thinking of raising
udp_recv_hiwat some to moderately-increased value, e.g.
doubling or quadrupling it, then restarting named, to see what
difference it makes.

John Wobus
Cornell CIT

I wrote:

> I'm not solving this, so I'll give an update and see if anyone has
> good ideas to offer me.
>
> The problem is two secondaries that randomly "drop" notifies from
> the primary (BIND9.3, more details in previous message).
>
> The servers are on Solaris 8 and I used snoop (like tcpdump)
> to verify that the packets do indeed cross the network to the
> secondary.  Then, sometimes the notify works as advertised,
> but at random times two kinds of failures occur.  Sometimes
> named on the secondary never logs that it received the notify.
> In much fewer instances, named does log it, but snoop never
> shows it responding with the notify response.  I've tried looking at
> truss output, but didn't make progress fitting together much
> more of the picture.
>
> The ideas I still have left to try are (1) crank up and pore through
> BIND debugging logging or (2) put the secondary on a bigger
> server and see if the problem disappears.  I can believe that
> the secondary is simply too busy, but would expect some sort
> of logging or to hear confirmation that other sites have seen
> this before.  In evidence, I do see the problem occurring
> more often during busier times.
>
> Any ideas/inspirations appreciated.
>
> John Wobus
>
> On Jan 24, 2005, at 5:23 PM, John Wobus wrote:
>
> > Notifies sometimes get lost between our bind 9.3 servers.  What can I
> > look for as a cause?
> >
> > Two secondary servers are showing the problem with the same primary
> > server.  When the failure occurs, the primary server logs that it 
> sent
> > notifies, then logs 'notify retries exceeded' for the secondary in
> > question.  The secondary's log shows nothing.  Zones and secondaries
> > affected at any particular instance are random: failure occurs for 
> only
> > 10-40% of the notifications.  When one secondary fails for a 
> particular
> > zone, the other one often succeeds in loading it.  The new zone files
> > have updated SOA serial numbers. The failing secondary later 
> transfers
> > the zone successfully, when the refresh interval expires.  None of 
> the
> > servers have firewall software.  The servers serve fewer than 300
> > zones.
> >
> > I've checked the network, the bind config file options (which are
> > generally the defaults), looked for other problems in the logs,
> > searched my bind books/manuals and searched online and I have run out
> > of ideas.
> >
> > John Wobus
> > Cornell CIT
> >
> >