Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

Fri Mar 7 07:58:36 UTC 2014

Hello

We are facing a similar problem by getting an intermittent SERVER FAILS on
several domains and specifically during the high traffic.
Please note that the IPV6 dual stack is not configured in the Operating
system and we are not using any IPV6 option in the BIND configuration file.

1- We compiled several BIND versions on different CentOS platforms 
CentOS release 5.10 with BIND 9.9.5 and BIND 9.7.2-P2 : Problem Persists
CentOS release 5.6 with BIND 9.9.5 and BIND 9.7.2-P2 : Proble Persits

2- We bypassed all network devices (Firewall, Shaper, IPS, LOADBALANCER):
Problem persists

3- TCPDUMP performed on the name servers showed the SERVERFAIL in the
capture

4- Dig debugging output shows intermittent SERVER FAIL:

dig www.mcafee.com
>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49448  ot fo other domains

5- We noticed during our debugging a failure when using dig +trace

;; Received 493 bytes from 192.5.5.241#53(f.root-servers.net) in 64 ms

dig: couldn't get address for 'k.gtld-servers.net': failure

Regards 

Daniel Dawalibi
Senior Systems Engineer
e-mail:daniel.dawalibi at idm.net.lb

Jisr Al Bacha P.O. Box 11-316 Beirut Lebanon
tel +961 1 512513 ext. 366| fax +961 1 510474
tech support 1282 | http://www.idm.net.lb

PLEASE CONSIDER THE ENVIRONMENT BEFORE YOU PRINT THIS E-MAIL
Confidentiality Notice: The information in this document and attachments is
confidential and may also be legally privileged. It is intended only for the
use of the named recipient. Internet communications are not secure and
therefore IDM does not accept legal responsibility for the contents of this
message. If you are not the intended recipient, please notify us immediately
and then delete this document. Do not disclose the contents of this document
to any other person, nor take any copies. Violation of this notice may be
unlawful.

-----Original Message-----
From: bind-users-bounces+daniel.dawalibi=idm.net.lb at lists.isc.org
[mailto:bind-users-bounces+daniel.dawalibi=idm.net.lb at lists.isc.org] On
Behalf Of Kostas Zorbadelos
Sent: Wednesday, March 05, 2014 3:16 PM
To: Bind Users Mailing List
Subject: Sporadic but noticable SERVFAILs in specific nodes of an anycast
resolving farm running BIND

Greetings to all,

we operate an anycast caching resolving farm for our customer base, based on
CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS package BIND
9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the stock CentOS
package).

The problem is that we have noticed sporadic but noticable SERVFAILs in
3 out of 10 total machines. Cacti measurements obtained via the BIND XML
interface show traffic from 1.5K queries/sec (lowest loaded machines) to 15K
queries/sec (highest). The problem is that in 3 specific machines in a
geolocation with a BIND restart we notice after a period of time that can
range between half an hour and several hours SERVFAILs in resolutions. The 3
machines do not have the highest load in the farm (6-8K q/sec). The
resolution problems are noticable in the customers ending up in these
machines but do not show up as high numbers in the BIND XML Resolver
statistics (ServFail number).

We reproduce the problem, by querying for a specific domain name using a
loop of the form

while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1; dig
www.linux-tutorial.info @localhost; sleep 2; done  | grep SERVFAIL

The www.linux-tutorial.info is not the only domain experiencing resolution
problems of course. The above loop can run for hours even without issues on
low-traffic hours (night, after a clean BIND restart) but during the day it
shows quite a few SERVFAILs, which affect other domains as well.

During the problem we notice with tcpdump, that when SERVFAIL is produced,
no query packet exits the server for resolution. We have noticed nothing in
BIND logs (we even tried to raise debugging levels and log all relevant
categories). An example capture running the above
loop: 

# tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535
bytes

14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info.
(41)                                             
14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A?
www.linux-tutorial.info. (52) ^^^^ Success

14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info.
(41)
14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53:
30244% [1au] A? www.linux-tutorial.info. (52) ^^^^ Success

14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info.
(41) ^^^^ SERVFAIL

14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info.
(41) ^^^^ SERVFAIL

14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info.
(41)
14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287%
[1au] A? www.linux-tutorial.info. (52) ^^^^ Success

The main actions we have done on the problem machines are

- change the BIND version (we initially used a custom compiled 9.9.2, we
  moved to 9.9.5 and finally switched over to the CentOS stock package
  9.8.2rc1). We noticed the problem in all versions

- disable IPtables (we use a ruleset with connection tracking in all of
  our machines with no problems on the other machines in the
  farm). Again no solution

- introduce query-source-v6 address in named.conf (we already had
  query-source). Each machine has a single physical interface and 3
  loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
  of the network. No solution. 

The main difference in the 3 machines from the rest is the IPv6 operation.
Those machines are dual stack, having /30 (v4) and /127 (v6) on the physical
interface. Needless to say that the next trial is to remove the relevant
IPv6 configuration.

I understand that there are many parameters to the problem, we try and debug
the issue several days now. Any suggestion, suspicion or hint is highly
welcome. I can provide all sorts of traces from the machines (I already have
pcap files at the moment of the problem, plus pstack, rndc status, OS
process limits, rndc recursing, rndc dumpdb -all, according to 

https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-ser
ver.html)

Thanks in advance,

Kostas

-- 
Kostas Zorbadelos		
twitter:@kzorbadelos		http://gr.linkedin.com/in/kzorba 
----------------------------------------------------------------------------
()  www.asciiribbon.org - against HTML e-mail & proprietary attachments /\
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to
unsubscribe from this list

bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users