query dropping vs. returning nxdomain

Tue Mar 7 17:32:25 UTC 2006

i'm about to follow my own advice and rate-limit myself on this thread, so
that voices other than the-usual-half-dozen can be heard on this topic.  but
i'll use up today's bind-workers quota following up geertjan's excellent post:

# While I don't have a good answer, I don't think that cripping
# an application is the answer to DDoS attacks.

while i agree, i think those are the cards we were dealt.  the IP stack and
its routing system and basic application set were all conceived on a closed
network that had no untrustworthy parties connected to it, and it's all very
fragile.  now don't get me wrong, a day never goes by that i don't miss jon
postel and his friendly wisdom, but "be liberal in what you accept" is just
not a scaling property for a completely open network with bad people on it.

we crippled smtp by closing open-relay.  we're in the process now of crippling
DNS by closing open-recursion.  we'll gradually cripple everything we grew up
with in our never-ending struggle to keep things running in some shape/form.

# Secondly, I think that the vast majority of nameservers don't have a high
# packet rate and hence would continue to work with these rate-limiters in
# place by default. Having them in by default would mean that 95% of the world
# can run this protection by default, w/o any configuration.
# 
# The ones that have a higher load that would trigger the rate-limiter are
# hopefully ran by propellorheads with clue, and at least they would need clue
# to change the rate limiter.
# 
# Makes sense?

unfortunately, no it does not.  more and more high-volume services are run by
less and less technically savvy operators, and there's just no bar to this --
if someone runs a service without understanding its knobs, it can be months or
years before there's enough back pressure toward that operator to get them to
learn what they need to know and change what they need to change.

considering rate limiting, i think we may be onto something, but not in the
usual per-flow way that some of those lurking here may be thinking.  per-flow
rate limiting requires holding flow state (even though only quota, not full
connection state) and if your attacker can artificially drive up the number of
flows you see (which is easy, since BCP38 is not universally deployed and any
attacker can source a packet from any IP source address they want) then they
can make you use up all available flow state state on flows that may or may
not be relevant.  then you begin to churn -- to discard, probably on an LRU
basis, older flow states in order to make room for new ones.  this means you
can be made to discard information necessary for effective rate limiting.  and
if rate limiting is effective, then this attack WILL be launched to make it
ineffective -- we know enough about the bad guys now to know that in advance.

on the other hand, statistical non-flow-based rate limiting is a reasonable
way forward.  what if there was only a ((random() % 8) == 1) chance that a
given ACL-denied query would be told DNS-REFUSED?  obviously this would have
to be an option of some kind -- queries from RFC1918 space should be able to
be dropped with no result at all -- but simulating lossiness would be a good
way to let people do some kind of debugging (repeating queries until they got
a positive or negative answer) without giving attackers an interesting
reflector (whether amplifying or not).