Secondaries sometimes don't respond to notify
jw354 at cornell.edu
Thu Mar 3 21:51:31 UTC 2005
Re our problem with occasional dropped notifies:
I'm thinking of experimenting with Solaris's setting, "udp_recv_hiwat".
Can anyone share experience with raising udp_recv_hiwat
on loaded query servers? Did raising it visibly help any problems
or hurt anything?
Sockets have a setting, SO_RCVBUF, controllable by setsockopt,
that determines the size of a buffer through which the socket's
incoming UDP packets pass, but as far as I can see, BIND9.3
doesn't touch the setting. Our Solaris uses 8192 by default, and
has a system-wide setting controllable by ndd (udp_recv_hiwat)
which is the default for SO_RCVBUF. I'm thinking of raising
udp_recv_hiwat some to moderately-increased value, e.g.
doubling or quadrupling it, then restarting named, to see what
difference it makes.
> I'm not solving this, so I'll give an update and see if anyone has
> good ideas to offer me.
> The problem is two secondaries that randomly "drop" notifies from
> the primary (BIND9.3, more details in previous message).
> The servers are on Solaris 8 and I used snoop (like tcpdump)
> to verify that the packets do indeed cross the network to the
> secondary. Then, sometimes the notify works as advertised,
> but at random times two kinds of failures occur. Sometimes
> named on the secondary never logs that it received the notify.
> In much fewer instances, named does log it, but snoop never
> shows it responding with the notify response. I've tried looking at
> truss output, but didn't make progress fitting together much
> more of the picture.
> The ideas I still have left to try are (1) crank up and pore through
> BIND debugging logging or (2) put the secondary on a bigger
> server and see if the problem disappears. I can believe that
> the secondary is simply too busy, but would expect some sort
> of logging or to hear confirmation that other sites have seen
> this before. In evidence, I do see the problem occurring
> more often during busier times.
> Any ideas/inspirations appreciated.
> John Wobus
> On Jan 24, 2005, at 5:23 PM, John Wobus wrote:
> > Notifies sometimes get lost between our bind 9.3 servers. What can I
> > look for as a cause?
> > Two secondary servers are showing the problem with the same primary
> > server. When the failure occurs, the primary server logs that it
> > notifies, then logs 'notify retries exceeded' for the secondary in
> > question. The secondary's log shows nothing. Zones and secondaries
> > affected at any particular instance are random: failure occurs for
> > 10-40% of the notifications. When one secondary fails for a
> > zone, the other one often succeeds in loading it. The new zone files
> > have updated SOA serial numbers. The failing secondary later
> > the zone successfully, when the refresh interval expires. None of
> > servers have firewall software. The servers serve fewer than 300
> > zones.
> > I've checked the network, the bind config file options (which are
> > generally the defaults), looked for other problems in the logs,
> > searched my bind books/manuals and searched online and I have run out
> > of ideas.
> > John Wobus
> > Cornell CIT
More information about the bind-users