Why short Retry times?

Thu Oct 4 23:09:19 UTC 2001

In another thread, Brad Knowles was critiquing someone's zone file, and
said "Generally speaking you want your Refresh to be at least 3 times your
Retry."

We operate an ISP nameserver that's slaving thousands of zones, and short
Retry times are a big annoyance.  If a master for lots of zones crashes,
the slave fills up with named-xfer processes trying to connect to it.  BIND
8 has a limit of 20 concurrent zone transfers, and they take 7 minutes to
give up: Solaris's connect() times out in about 3.5 minutes, and if you use
transfer-source BIND tries a second time using the machine's regular source
address.  We're not running BIND 9 yet, so I don't know if it has similar
behavior; even though it doesn't run named-xfer as a separate process, I
suspect it still has internal limits on the number of zone transfers it
will try.  Many domains have 10- and 15-minute Retry times, so if a site
has more than about 40 domains the first domain will be back in the zone
transfer queue before we've finished trying all the other zones.

By setting transfers-per-ns lower than transfers-in we can keep that one
dead master from totally monopolizing the server, but we don't want
transfers-per-ns to be too low or it can take a while to get in sync when a
customer changes lots of zones.  Regardless, this constant retrying seems
wrong.  Although the general wisdom is that Retry should be much shorter
than Refresh, I wonder about the logic of it.  If the master is having
problems, how will querying it more often help?

The only sense I can make is that the assumption is that the server has
been shut down while the admin is updating DB files.  If this is the case,
the failure implies that changes are likely, so retrying quickly should get
them to propagate faster.  But in my experience, most failures are not due
to this.  Admins don't usually shut down the server while editing the
files; they edit the files and then restart or reload the server.  This
usually results in a short outage, and it's not likely that the slave will
have tried a zone transfer during it, so it won't automatically try sooner.

Furthermore, BIND now implements the NOTIFY mechanism.  When it starts up
it sends out a notify message for every zone, and that will get the slaves
to do zone transfers right away.  I believe the Windows NT and Windows 2000
name servers also do this when they start up.

In most other contexts where things are retried, exponential backoff is
pretty common.  Examples are TCP retransmissions, resolver queries, and
some mail servers.  When things break, we assume that it will take a while
to get better, so we don't compound the problem by trying more often.
Short Retry times seem almost as pointless as repeatedly hitting the
elevator call button.

We used to have a customer with about 15,000 zones that we were secondary
for (they're a web hosting firm that caters to the porn industry, and their
customers like to come up with all possible variants and misspellings of
their site names).  Occasionally their master server went down for a day or
so and it practically brought our server to its knees (I think that was the
impetus for us to reduce transfers-per-ns).  Now the most we pull from a
single master is about 1,200, and that server has always been very
reliable, but I've asked them to bump their Retry times up to an hour or
two just in case (they haven't gotten around to it, and I'm not sure they
ever will).  Even sites with only 50-100 zones can result in noticeable
delays in refreshing other customers's zones when the master goes down.

-- 
Barry Margolin, barmar at genuity.net
Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.