BIND slave server ignoring responses to all UDP-based SOA queries (zone refresh) for hours at a time

Thu Jun 4 18:36:29 UTC 2015

Apologies in advance for this lengthy description.

Since making I made a configuration change a few weeks ago, every 1-3 days, my
BIND 9.9.7 server experiences several hours of retry/timeout failures while
performing UDP-based SOA serial number queries (zone refresh).

My server acts like it doesn't hear master servers' responses to my server's
UDP-based SOA queries.  For each such UDP-based SOA query that times out and retries several times,
BIND then falls back to using TCP for the query; it "hears" that TCP-based answer.

This affects all the zones for which my server acts as a slave.

For example, focusing on just a single slave zone, my server logs (names and addresses anonymized):

May 30 14:56:23 foo.example.com named[16217]: general: info: zone 1.10.in-addr.arpa/IN/default: refresh: retry limit for master 192.168.1.2#53 exceeded (source 0.0.0.0#0)
May 30 14:56:23 foo.example.com named[16217]: general: info: zone 1.10.in-addr.arpa/IN/default: Transfer started.
May 30 14:56:23 foo.example.com named[16217]: xfer-in: info: transfer of '1.10.in-addr.arpa/IN/default' from 192.168.1.2#53: connected using 172.16.3.4#63325
May 30 14:56:23 foo.example.com named[16217]: general: info: zone 1.10.in-addr.arpa/IN/default: transferred serial 11707064: TSIG 'key-foo-bar.example.com'
May 30 14:56:23 foo.example.com named[16217]: xfer-in: info: transfer of '1.10.in-addr.arpa/IN/default' from 192.168.1.2#53: Transfer completed: 1 messages, 52 records, 9065 bytes, 0.146 secs (62089 bytes/sec)

--

Each time my server gets into this state:

* It affects all the slave zones my server is trying to refresh.
It doesn't affect only some of the zones due for refresh.

Throughout the incident, the messages above are logged for all those slave
zones, not just selected ones.

* It's not confined to only those slave zones my server is trying to refresh from some masters.
Some of the masters my server pulls from are the source of dozens of zones.
Some masters are the source of just one or two zones I pull.
Some of the masters are local, while some are across the Internet.
The issue affects refresh checks to all of the masters; it's not specific to just some of the masters from which we pull.

* Each time this happens, it lasts for an extended period. 
I've seen this range from 3 hours to 13 hours.
The problem seems to appear and then vanish "on its own."

* It generally happens every one to three days.

* While it's happening, it doesn't affect just "some" zone refresh UDP-based SOA query/responses.
It affects *all* of them.

* While it's happening, it doesn't affect zone refresh TCP-based SOA query/responses.
Those proceed fine.

* While it's happening, it doesn't affect handling of incoming NOTIFY messages.
My server continues to see incoming NOTIFY messages (UDP).

If the message is from an appropriate master and specifies a serial number
higher than my server has, it correctly triggers my server to pull an updated zone.
If the serial number is not higher, my does not pull an updated zone.
If the messages is not from an appropriate master, my server ignores it.

* While this is happening, and my server is in the middle of a UDP-based SOA query retry/timeout
for a zone, my server receives NOTIFY messages from various legitimate masters for the zone.
My server (reasonably) logs:

May 30 14:55:48 foo.example.com named[16217]: notify: info: client 192.168.1.3#62867/key key-foo-bar.example.com: view default: received notify for zone '1.10.IN-ADDR.ARPA': TSIG 'key-foo-bar.example.com'
May 30 14:55:48 foo.example.com named[16217]: general: info: zone 1.10.in-addr.arpa/IN/default: notify from 192.168.1.3#62867: refresh in progress, refresh check queued

--

./lib/dns/zone.c refresh_callback() shows that the "refresh: retry limit for master .. exceeded"
message above is logged after the server has tried to query a master for the SOA (using EDNS) over UDP
and failed (after retries).  After logging this message, the code goes on to try using TCP.
The messages logged above imply that this succeeded.

During one of these events, I performed packet capture to observe
the refresh activity.  The packet capture shows my server sending the SOA query
to the master using UDP, and the master replying a fraction of a second later.
The master's reply looks good to me.

After a timeout, this repeats (as if the server didn't receive the master's reply, or didn't like it).
This happens several times, then the server logs the "refresh: retry limit for master ... exceeded" message above.
The my server tries the SOA query over TCP, gets a reply, and now appears satisified.

--

While my server is in this state (for some hours at a time),
I observe it is otherwise receiving DNS queries from clients and sending
responses to clients normally, both over UDP and TCP.

--

It seems my server is able to send and receive using both UDP and TCP during these events.
It's clearly hearing *some* of the incoming UDP-based DNS messages.

It's only acting like it doesn't hear (or doesn't like) the incoming UDP-based 
SOA responses to its UDP-based SOA queries.

It's not apparent to me what causes the server to enter this state,
or to return to normal some hours later.

--

Platform:

BIND 9.9.7 (also verified it happens in 9.9.6-P1)
Solaris 10 on SPARC 
authoritative-only server
119 zones (42 master, 77 slave)
1 view

RRL is in-use.
All the masters my server pulls from are exempt from RRL;

acl masters-we-pull-from {
   192.168.1.2; 192.168.1.3; ...
};

acl clients-exempt-from-rate-limiting {
   ...
   masters-we-pull-from;
   ...
};

...

options {
   ...
   rate-limit { clients-exempt-from-rate-limiting; };
   ...
};

RRL logs each time it rate-limits any netblock.
It is not logging any rate-limiting being enforced on masters my server pulls from.

--

This is a DNS server that's been running for years without this issue.

The problem began three days after I made a configuration change to my server.
As that was the only recent change, I imagine this configuration change might be what triggered the
sporadic problematic behavior. (I cannot test my guess by temporarily backing out that change; 
a customer is relying on that configuration change.)

My configuration change was:

My old default was to NOT default to request-ixfr, and only request-ixfr for specific slave zones.
My new default is to default to request-ixfr, and to disable request-ixfr for a specific master server.

So on my server I changed:

In global "options":

Comment out:
   request-ixfr no;
Add:
   max-journal-size 100M;

For one master server I pull slave zones from, add to its "server" statement:
   request-ixfr no;

For several zones I am a slave for, add to each "zone" statement
(to document my intent that this zone attempt IXFR regardless of
the global setting):
    request-ixfr yes;

I needed to make this change because some of the zones for which I slave
change so often, and are so large, that using AXFR caused my server to lag
well behind the zones' other servers.

--

I'll mention (without knowing whether it is relevant) that:

* my server receives lots of NOTIFY messages it refuses because they come from non-masters

* because some of the zones my server slaves for are changed very often,
my server performs IXFR very often.  

--

Any ideas?

Irwin Tillman
OIT Networking & Monitoring Systems, Princeton University