Tuning for lots of SERVFAIL responses

Fri Feb 19 02:05:23 UTC 2016

John Miller <johnmill at brandeis.edu> wrote:

> Thanks for the reply, Tony. With the recent glibc bug, I figured most
> folks would be off putting out those fires!

If they haven't done it by now then, gosh, I feel sorry for them.

(It's SO NICE to have a redundant service that you can patch and upgrade
without affecting users. Operational priority 1: reduce anxiety and
improve mental health!)

And, you (dear readers) should not have to care about the rest of this
post - there's very little a hostmaster can do to improve things when
you've lost your uplink. All I have been able to do for my users is
explain why the DNS servers did the best they could, while my colleagues
did the real work to improve our uplinks and reduce these failures.

> > [...] but it's still "slow" when we lose external connectivity
> > because (I think) of attempts at TLS OCSP lookups :-(
>
> We've run into similar issues in the past: people were hitting a
> captive portal that didn't allow access to the CAs for OCSP
> verification.

OCSP is just one example of unexpected external dependencies. I mention
it particularly because it was confusing - different browsers make
different trade-offs about TLS certificate revocation, so we got
inconsistent problem reports.

> We're not quite there with regard to traffic volume: we're somewhere
> around 150 qps on each server (maybe 5-600 qps campus-wide), but as
> happened to you, we saw the same 3-4x spike in volume.

Right. This number should be fairly consistent across different sites
because the traffic increase is due to two things: firstly, stub resolvers
usually try a query three times (with a 10s timeout for each query, so an
overall 30s timout), and secondly, users might retry manually (or they
might go and get coffee).

> Likewise, we went from roughly 20 active clients per server (going off
> of UDP socket stats from sar) to over 1000.

The other statistic you can look at is the client count from `rndc
status`. Its numbers should be basically the same as your socket stats,
modulo a factor of 2 for IPv4 and v6. But! when you hit a limit the
numbers will be clipped and will be no use for measuring demand and no use
for estimating what will happen when you lift the limit.

Also, there's another factor of 3 between queries from stub resolvers and
queries by named - each query from named has a 3s request timeout and an
overall 9s or 10s timeout (IIRC).

So (I think) this basically means when you lose your uplink you can
roughly expect the BIND client count to be about 3x3 = 9 times your
normal qps, plus angry manual retries, minus frustrated coffee breaks.

> The servers themselves were quietly twiddling their thumbs at 0.1 load:
> strictly a case of the application doing the throttling.

Yep, mostly waiting for replies that will never come, which doesn't
require much CPU.

Tony.
-- 
f.anthony.n.finch  <dot at dotat.at>  http://dotat.at/
Forties, Cromarty: Southwest 4 or 5, backing south 6 to gale 8, perhaps severe
gale 9 later. Slight or moderate, becoming rough or very rough later. Showers,
then rain. Good, becoming moderate or poor.