BIND 9.20.6: spurious recursive lookup failures after longish uptime
Havard Eidnes
he at uninett.no
Thu Mar 13 11:21:32 UTC 2025
Hi,
I wondered a while whether this would be more appropriate to post
here or as an issue in ISC's gitlab, but came to the conclusion
that for now the best place would be here. The reason is that
the "how to reproduce the problem" bit is quite fuzzy.
If someone from ISC wants this reported as a gitlab issue as
well, I can do that, of course.
Context: we are running 4 nodes in an anycast setup, providing
our users with DNS recursor service, and RPZ service to a subset
of these users.
We have been using BIND 9.20 for a while, and have followed the
ISC upgrades shortly after they were published, so we were up
until recently running 9.20.6 for this service.
Recently we started receiving reports from some of our users that
... "DNS lookups are un-reliable". An example which I managed to
catch / reproduce (based on a report for one of the other 3
nodes):
$ dig @osl-res.uninett.no. freebsd.org. a
; <<>> DiG 9.14.7 <<>> @osl-res.uninett.no. freebsd.org. a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 51745
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 16c89ea584a0a45c0100000067d2ad42211e91f71ee4fdcc (good)
;; QUESTION SECTION:
;freebsd.org. IN A
;; Query time: 27 msec
;; SERVER: 2001:700:0:102::ca53#53(2001:700:0:102::ca53)
;; WHEN: Thu Mar 13 11:02:42 CET 2025
;; MSG SIZE rcvd: 68
$ dig @osl-res.uninett.no. freebsd.org. a
; <<>> DiG 9.14.7 <<>> @osl-res.uninett.no. freebsd.org. a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2380
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 893098498db1b2330100000067d2ad4b2f511f5ac2cf4c48 (good)
;; QUESTION SECTION:
;freebsd.org. IN A
;; ANSWER SECTION:
freebsd.org. 3600 IN A 96.47.72.84
;; Query time: 30 msec
;; SERVER: 2001:700:0:102::ca53#53(2001:700:0:102::ca53)
;; WHEN: Thu Mar 13 11:02:51 CET 2025
;; MSG SIZE rcvd: 84
$
The name server in question does not have any connectivity issues
that I'm aware of, and ... it really doesn't make a whole lot of
sense to me that it would at one instant reply with SERVFAIL only
to seconds later respond with a DNSSEC-validated OK reply. I've
unsuccessfuly looked in the logs for the SERVFAIL for this
domain, but apparently our logging does not catch those.
At the time when this was done, the name server had been running
for weeks:
osl-res: {1} ps axu | egrep 'PID|named'
USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME COMMAND
named 6739 114 2.6 1363112 866384 ? Osl 27Feb25 14435:20.10 /usr/p
osl-res: {2}
This node serves in the order of peak around 3000 qps, and rarely
if ever serves less than 700 qps during a 24-hour cycle. This
makes it somewhere between difficult and impossible to provide a
precise reproducer description which is obviously preferred for a
proper bug report.
It also has an instance of RFC 9462 applied, which is "discovery
of designated resolvers", pointing clients to the DoT and DoH
endpoints this instance serves by publishing _dns.resolver.arpa
SVCB records in the DNS view for the clients. As a consequence,
a fair number of queries (20%? 30%?) arrive over those
transports.
For now we have downgraded BIND to 9.18.34 on the two nodes where
similar trouble has been reported, and we will in all probability
do the same for the remaining two nodes in the cluster. ...which
is a shame, really, but having to deal with this sort of issue
popping up at unpredictable times, exposing our users to it is
... not exactly ideal.
So... What I guess I'm doing with this message is ask if anyone
else have been experiencing anything resembling this problem, or
if anyone have any more clues to share to guide further debugging
of this problem?
FWIW, we're running BIND on NetBSD/amd64 10.0 on these nodes.
Best regards,
- Håvard
More information about the bind-users
mailing list