Trying again on SERVFAIL

Wed Feb 10 08:05:38 UTC 2021

Hi Havard,

thanks for your reply.

On Tue 09/Feb/2021 18:15:43 +0100 Havard Eidnes wrote:
>> is there a way to know that a query has already been tried a few
>> minutes ago, and failed?
> 
>  From whose perspective?
> 
> A well-behaved application could remember it asked the same query
> a short while ago, of course, but that's up to the application.

For an application, caching queries feels like stealing the resolver's job.

> Or is the perspective that of a recursive resolver?  As far as I
> remember, BIND used as a recursive resolver will "cache" this
> knowledge, but I'm not entirely certain for how long, since it
> can't use the method from an NXDOMAIN reply which includes the
> SOA record (and uses the re-purposed "minimum" field for the TTL
> for the negative cache entry).

I too recall that NXDOMAIN can be cached for a while.  I'd guess some kinds of 
failures are also cached.

>> It happens seldomly, but sometimes the DKIM mail filter gets a
>> SERVFAIL when it tries to authenticate an incoming message.
>> SERVFAIL occurs when DNSSEC check fails.
> 
> ...or when none of the name servers for the containing zone
> responds with an answer.  I.e. it's not *just* DNSSEC failure
> which can trigger SERVFAIL.

Yes, of course.  Yet, however sporadic, DNSSEC failure seems to be the most 
frequent case.

>> Trying again is useless, it has to be treated as a permanent
>> error.
> 
> Well, now...  Basically nothing in the DNS is permanent, because
> it is not completely static; hence most information in the DNS
> has a TTL attached to it.  So the question then becomes how an
> application, say a mail server should treat SERVFAIL.  It may
> very well be that the "maximum retry time" of the mail server is
> far longer than any of the TTLs for the pieces of DNS data that
> you could not look up, so it may be appropriate to treat SERVFAIL
> as a signal to "re-queue the message and try again in 30
> minutes", so in essence converting SERVFAIL into a "temporary
> failure" in the context of the mail server.

That's what I've been doing.  For an incoming message, a temporary failure 
means replying a 4xx code.  The sender keeps the message in its queue, and 
eventually gives up.  Once upon a time, MTAs used to retry sending for five 
days.  Nowadays, several servers don't let queued messages grow older than one day.

In the most severe case, a failed DKIM signature might entail a reject.  So the 
best course of action seems to be to reserve temporary failures to this case.

Still, being able to differentiate a local network congestion from a remote bad 
configuration would help.

Best
Ale
--