Increase in retry and timeout errors post 9.9.4 -> 9.11.4 upgrade

I set send-cookie no; globally to test this theory out but the pattern of retries and timeout continued. Despite this I was able to determine the retries/timeouts matches the same pattern as the resolver statistic for truncated responses received which suggests they are related.

When I look at the same graph on one of the other servers it doesn't have any truncated responses but instead has a lot of NXDOMAIN errors which the upgraded server does not.


Well BIND 9.11+ supports DNS COOKIE by default and there are some servers that mishandle EDNS requests with a DNS COOKIE option present.  Unknown EDNS options are supposed to be ignored, but there are servers/firewalls that just drop such queries.  Others return FORMERR, others return NXDOMAIN when there is a answer w/o the option being present, others echo unknown options, and others still send back a DNS COOKIE response but fail to correctly copy the client cookie part to the response.  show how servers for .GOV zone behave when presented with a unknown EDNS option.  Other datasets are similar.

You can use "server <prefix> { send-cookie no; };” to work around known broken servers.


> I have three centos 7 servers running bind acting as internal resolvers. There was an update released that upgrades them from 0:9.9.4-74.el7_6.2 to 32:9.11.4-16.P2.el7_8.2. On performing this upgrade to one of the servers there has been a notable increase in retry and timeout errors as measured by data collected from the statistics channel. Where previously the number of errors for retry and timeouts was < 10/2 minutes I now regularly see spikes > 50/2 minutes and the error levels have remained consistent on the other two servers. When I downgrade the server back to 9.9.4 the error rate drops as well.
> I increased the log level for the query-errors log and observed the number of entries between the upgraded and non-upgraded servers were about the same so there doesn't appear to be an increase in errors.
> I'm not sure whether the issue is that I'm not looking in the correct place to identify the source of retries/timeouts or the other possibility that occurred to me is that there might have been a change between the two versions for what data is represented by those retry/timeout counters and the increased rate is not a problem but just representing different information.
