Intermittent Issues Resolving Microsoft Hostnames

Wed May 4 18:31:23 UTC 2016

I ran several digs using:

dig @ns1-prodeodns.glbdns.o365filtering.com. A zulily-com.mail.protection.outlook.com. +short

without error.  As mentioned previously by Mark Andrews:

> SERVFAIL usually means that the server is configured for the zone
> but doesn't have a current copy.

You gave a snip of the error that is logged, but you might also consider pulling a tcpdump to see both sides of the actual conversation.  It might provide additional insight.

John

________________________________
From: bind-users-bounces at lists.isc.org <bind-users-bounces at lists.isc.org> on behalf of Rob Heilman <rheilman at echolabs.net>
Sent: Wednesday, May 4, 2016 1:02 PM
To: bind-users at lists.isc.org
Subject: Intermittent Issues Resolving Microsoft Hostnames

We run BIND 9.9.5-9 on Debian x86_64 to support a moderately sized email hosting system.  System info listed at the end of this message.  We are seeing intermittent but frequent issues resolving Microsoft records.  The hostnames are usually in the form of *.mail.protection.outlook.com<http://mail.protection.outlook.com/> or *.mail.eo.outlook.com<http://mail.eo.outlook.com/>.  They range from k-12/university organizations, small businesses, to large commercial companies.  Some examples follow:

03-May-2016 09:16:48.001 query-errors: debug 1: client 10.10.10.95#44080 (zulily-com.mail.protection.outlook.com<http://zulily-com.mail.protection.outlook.com/>): query failed (SERVFAIL) for zulily-com.mail.protection.outlook.com/IN/A<http://zulily-com.mail.protection.outlook.com/IN/A> at query.c:7004
03-May-2016 09:16:48.002 query-errors: debug 2: fetch completed at resolver.c:3074 for zulily-com.mail.protection.outlook.com/A<http://zulily-com.mail.protection.outlook.com/A> in 0.000067: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

04-May-2016 09:32:38.498 query-errors: debug 1: client 10.10.10.95#44080 (hanes-com.mail.protection.outlook.com<http://hanes-com.mail.protection.outlook.com/>): query failed (SERVFAIL) for hanes-com.mail.protection.outlook.com/IN/A<http://hanes-com.mail.protection.outlook.com/IN/A> at query.c:7004
04-May-2016 09:32:38.498 query-errors: debug 2: fetch completed at resolver.c:3074 for hanes-com.mail.protection.outlook.com/A<http://hanes-com.mail.protection.outlook.com/A> in 0.004677: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

04-May-2016 12:47:12.935 query-errors: debug 1: client 10.10.10.95#44080 (pitt-edu.mail.protection.outlook.com<http://pitt-edu.mail.protection.outlook.com/>): query failed (SERVFAIL) for pitt-edu.mail.protection.outlook.com/IN/A<http://pitt-edu.mail.protection.outlook.com/IN/A> at query.c:7004
04-May-2016 12:47:12.935 query-errors: debug 2: fetch completed at resolver.c:3074 for pitt-edu.mail.protection.outlook.com/A<http://pitt-edu.mail.protection.outlook.com/A> in 0.000085: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

04-May-2016 12:47:30.918 query-errors: debug 1: client 10.10.10.96#48950 (mdfoodbank-org.mail.eo.outlook.com<http://mdfoodbank-org.mail.eo.outlook.com/>): query failed (SERVFAIL) for mdfoodbank-org.mail.eo.outlook.com/IN/A<http://mdfoodbank-org.mail.eo.outlook.com/IN/A> at query.c:7004
04-May-2016 12:47:30.918 query-errors: debug 2: fetch completed at resolver.c:3074 for mdfoodbank-org.mail.eo.outlook.com/A<http://mdfoodbank-org.mail.eo.outlook.com/A> in 0.000078: failure/success [domain:mail.eo.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

I have added config statements to send query-errors to dedicated files and increased debugging to 10 on that channel.  The referenced sections of resolver.c and query.c are as follows:

resolver.c

fctx_try(fetchctx_t *fctx, isc_boolean_t retrying, isc_boolean_t badcache) {
        isc_result_t result;
        dns_adbaddrinfo_t *addrinfo;

        FCTXTRACE("try");

        REQUIRE(!ADDRWAIT(fctx));

        addrinfo = fctx_nextaddress(fctx);
        if (addrinfo == NULL) {
                /*
                 * We have no more addresses.  Start over.
                 */
                fctx_cancelqueries(fctx, ISC_TRUE);
                fctx_cleanupfinds(fctx);
                fctx_cleanupaltfinds(fctx);
                fctx_cleanupforwaddrs(fctx);
                fctx_cleanupaltaddrs(fctx);
                result = fctx_getaddresses(fctx, badcache);
                if (result == DNS_R_WAIT) {
                        /*
                         * Sleep waiting for addresses.
                         */
                        FCTXTRACE("addrwait");
                        fctx->attributes |= FCTX_ATTR_ADDRWAIT;
                        return;
                } else if (result != ISC_R_SUCCESS) {
                        /*
                         * Something bad happened.
                         */
                        fctx_done(fctx, result, __LINE__);

query.c

                /*
                 * Switch to the new qname and restart.
                 */
                ns_client_qnamereplace(client, fname);
                fname = NULL;
                want_restart = ISC_TRUE;
                if (!WANTRECURSION(client))
                        options |= DNS_GETDB_NOLOG;
                goto addauth;
        default:
                /*
                 * Something has gone wrong.
                 */
                QUERY_ERROR(DNS_R_SERVFAIL);

Does anyone know what these logged errors indicate or where I can research them further in the documentation?  So far my searches are coming up empty.

Thanks,
Rob Heilman

# uname -a
Linux fe2 3.16.0-4-686-pae #1 SMP Debian 3.16.7-ckt25-1 (2016-03-06) i686 GNU/Linux
# /usr/sbin/named -v
BIND 9.9.5-9+deb8u6-Debian (Extended Support Version)
#
sar reports average 1m load average under .5 and CPU idle over 90%.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20160504/7b150a8b/attachment-0001.html>