Trying again on SERVFAIL
vesely at tana.it
Wed Feb 10 08:05:38 UTC 2021
thanks for your reply.
On Tue 09/Feb/2021 18:15:43 +0100 Havard Eidnes wrote:
>> is there a way to know that a query has already been tried a few
>> minutes ago, and failed?
> From whose perspective?
> A well-behaved application could remember it asked the same query
> a short while ago, of course, but that's up to the application.
For an application, caching queries feels like stealing the resolver's job.
> Or is the perspective that of a recursive resolver? As far as I
> remember, BIND used as a recursive resolver will "cache" this
> knowledge, but I'm not entirely certain for how long, since it
> can't use the method from an NXDOMAIN reply which includes the
> SOA record (and uses the re-purposed "minimum" field for the TTL
> for the negative cache entry).
I too recall that NXDOMAIN can be cached for a while. I'd guess some kinds of
failures are also cached.
>> It happens seldomly, but sometimes the DKIM mail filter gets a
>> SERVFAIL when it tries to authenticate an incoming message.
>> SERVFAIL occurs when DNSSEC check fails.
> ...or when none of the name servers for the containing zone
> responds with an answer. I.e. it's not *just* DNSSEC failure
> which can trigger SERVFAIL.
Yes, of course. Yet, however sporadic, DNSSEC failure seems to be the most
>> Trying again is useless, it has to be treated as a permanent
> Well, now... Basically nothing in the DNS is permanent, because
> it is not completely static; hence most information in the DNS
> has a TTL attached to it. So the question then becomes how an
> application, say a mail server should treat SERVFAIL. It may
> very well be that the "maximum retry time" of the mail server is
> far longer than any of the TTLs for the pieces of DNS data that
> you could not look up, so it may be appropriate to treat SERVFAIL
> as a signal to "re-queue the message and try again in 30
> minutes", so in essence converting SERVFAIL into a "temporary
> failure" in the context of the mail server.
That's what I've been doing. For an incoming message, a temporary failure
means replying a 4xx code. The sender keeps the message in its queue, and
eventually gives up. Once upon a time, MTAs used to retry sending for five
days. Nowadays, several servers don't let queued messages grow older than one day.
In the most severe case, a failed DKIM signature might entail a reject. So the
best course of action seems to be to reserve temporary failures to this case.
Still, being able to differentiate a local network congestion from a remote bad
configuration would help.
More information about the bind-users