TTL's for glue records in different TLD's -> unresolvable domains

Wed Apr 11 19:02:40 UTC 2001

In protocol terms, this shouldn't be a problem: the nameserver would just
go out and fetch the missing A record (dns1.domain.to).

However, BIND 8 lacks what is called "query restart", which means,
basically, that it loses track of where it is in the process of resolving
a query if there are too many steps involved in resolving it. So when
some but not all of the A RRs associated with a given cached NS RRset
happen to be missing, it basically gives up on the query halfway through.
Usually, the "lack of query restart" problem is kludged by the fact that
the client will retry its query and eventually the nameserver will
complete resolution. But that doesn't happen in all cases, and, besides,
oftentimes the application will time out before enough queries are
attempted.

BIND 9 supposedly has query restart, so eventually this problem will go
away.

- Kevin

abuse at x32174aba94e4.xmx.sez.to wrote:

> Today I encountered a situation where a certain host was unable to
> resolve entries in my domain despite the fact that one of my
> nameservers was functional.  I am wondering if I have stumbled across
> a limitation of the DNS protocol, a bug in bind, or if I have subtly
> misconfigured something.
>
> The domain in question is in the .to TLD.  The authoritative
> nameservers for the domain are:
>
> dns1.domain.to
> dns1.backup.com
> dns2.backup.com
>
> So, when queried for domain.to, the authoritative servers for the .to
> domain send back a glue record for dns1.domain.to, but not for
> dns1.backup.com or dns2.backup.com.
>
> Somewhere along the line, both dns1.backup.com and dns2.backup.com
> fell over.  This still left dns1.domain.to up, so I believed that DNS
> resolution would continue to work for domain.to.  However one of my
> client hosts would not resolve hosts inside domain.to.
>
> As it turns out, the .to servers send glue records with a 1D TTL,
> while the .com servers send glue records with a 2D TTL.
>
> The client, running bind 8.2.3, had cached the 3 NS entries for
> domain.to.  However, since it had been more than 1 day and less than 2
> days since it last queried domain.to, dns1.domain.to's A record had
> expired.  It was only querying the dns1.backup.com and dns2.backup.com
> nameservers, which were of course down.  Since it didn't have a valid
> A record for dns1.domain.to, it never queried that host, and it chose
> not to attempt to refresh that record.
>
> Is this working as designed?  Should the expiration of a glue record
> when other glue records have a higher TTL prevent that host from being
> queried?  How can I prevent clients from experiencing this problem in
> the future?
>
> --
> Pablo