EDNS request problem on TTL=0 data

Fri Jun 24 19:22:30 UTC 2011

Hi,

I'm investigating an outage that happened on a bind server. It was
configured as a caching resolving name server. It was forwarding for
one specific zone. This zone had two nameservers/forwarders of which one
at some point was unreachable due to a cable cut. The other nameserver
turned out to be dropping any requests with the DO bit set.

What seems to have happened is:

1 the bind nameserver would send 3 queries 1s apart with ENDS0+DO bit, which were
   dropped.
2 bind sends out a query without the DO bit, it gets a response with TTL=0
3 a burst of queued up queries for that exact query gets rushed through for 1s
   (prob not more then max-clients-per-query though, which was set at 100)
4 goto 1

This not only caused resolving failures for the forwarding data, but within the hour
caused the entire server to collapse under load. The number of clients asking
for this data was higher then the max-clients-per-query setting.

My questions:

1 Is this problem happening because EDNS failure is not remembered for forwarders?
2 Is this problem happening because EDNS failure is forgotten once there is no more
   data cached that used the specified nameserver?
3 Does max-clients-per-query apply to forward zone queries too, or is this ignored?
4 Can this behaviour be changed via a configuration option so we can remember this EDNS
   failure so that we're not unable to anser queries for 3 out of 4 seconds?

Paul