Issues resolving outlook.office365.com

Mark Andrews marka at isc.org
Mon Jun 20 06:39:06 UTC 2016


A fix for this is in review and should be in the next maintainance
release.

Mark

In message <16A2CDFD-694D-444A-A760-17C9D7517BCD at open.ch>, Thomas Sturm writes:
>
> I am now able to reliably reproduce the behaviour with dig querying BIND
> 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:
>
> $ while true; do dig outlook.office365.com +noauthority +noadditional
> +tries=1 +retry=0; sleep 0.1; done
>
> Wait for 5 minutes, once the TTL expires, this should show about 5-7
> SERVFAIL responses.
>
> prefetch 1 or 2 makes it harder to reproduce and it only happens
> (sometimes) on loaded systems. prefetch 10 makes it go away.
>
> It never happens after restarting or flushing the cache. And it never
> happens when querying x seconds _after_ the TTL expired. Could there be
> an issue processing cached client requests during cache expiry, and since
> it only happens on 9.10, potentially related to prefetching?
>
>
>
> > On 16.06.2016, at 10:00, Thomas Sturm <tst at open.ch> wrote:
> >
> > Hi,
> >
> > We are experiencing strange intermittent issues when resolving
> outlook.office365.com, but also with other domains like e.g.
> amazonaws.com or snort.org. But let’s choose office365.com as example for
> now. outlook.office365.com is a CNAME to lb.geo.office365.com, and
> office365.com delegates the geo subdomain to different nameservers; 2 of
> them are showing some issues on intodns.com [1] (which may or may not be
> related to this problem).
> >
> > When querying one of the office365.com nameservers, it correctly
> delegates, as far as I understand:
> >
> > # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
> >
> > ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net
> +noadditional +nostats
> > ;; global options: +cmd
> > ;; Got answer:
> > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
> > ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
> > ;; WARNING: recursion requested but not available
> >
> > ;; OPT PSEUDOSECTION:
> > ; EDNS: version: 0, flags:; udp: 4000
> > ;; QUESTION SECTION:
> > ;lb.geo.office365.com.		IN	A
> >
> > ;; AUTHORITY SECTION:
> > geo.office365.com.	300	IN	NS	
> glb1.glbdns2.microsoft.com.
> > geo.office365.com.	300	IN	NS	ns1.p21.dynect.net.
> > geo.office365.com.	300	IN	NS	ns3.p21.dynect.net.
> > geo.office365.com.	300	IN	NS	ns4.p21.dynect.net.
> > geo.office365.com.	300	IN	NS	ns2.p21.dynect.net.
> > geo.office365.com.	300	IN	NS	
> glb2.glbdns2.microsoft.com.
> >
> > Still, BIND (sometimes) decides to return SERVFAIL to the client
> immediately after receiving this response. Some interesting debug log
> lines:
> >
> > resolver: debug 3: resquery 0x7f26fecc8010 (fctx
> 0x7f26fecb4458(lb.geo.office365.com/A)): sent
> > resolver: debug 3: resquery 0x7f26fecc8010 (fctx
> 0x7f26fecb4458(lb.geo.office365.com/A)): response
> > resolver: debug 10: received packet:
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> noanswer_response
> > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response:
> lb.geo.office365.com (in 'office365.com'?): 1 172499
> > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION:
> lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cache_message
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> [result: success] query canceled in response();  responding
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cancelquery
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> nameservers now above QDOMAIN
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> stopeverything
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cancelqueries
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> sendevents
> > client: error: query client=0x7f2700055ca0 thread=0x7f2709813700
> (lb.geo.office365.com/A): query_find: unexpected error after resuming:
> SERVFAIL
> > query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com):
> query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837
> >
> > “nameservers now above QDOMAIN” sounds like a geo.office365.com
> nameserver refers back to an office365.com nameserver? The thing is
> though, I cannot see any such response packet in tcpdump. Is this
> information taken (wrongly) from cache then? The same log message appears
> at all times for any of the failing domains we’ve seen so far.
> >
> > Note that this doesn’t seem to happen with an empty cache and we are
> also not able to trigger it on a test machine. It only happens on loaded
> machines once the cache TTL of the queried record expires. We can
> reproduce it with the latest patch levels of both 9.10 and 9.9.
> >
> > Regards,
> > Thomas
> >
> >
> > [1]
> http://intodns.com/geo.office365.com______________________________________
> _________
> > Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
> >
> > bind-users mailing list
> > bind-users at lists.isc.org
> > https://lists.isc.org/mailman/listinfo/bind-users
>
>
> --
> thomas sturm
> principal engineer
>
> open systems ag
> raeffelstrasse 29
> ch-8045 zurich
> t: +41 58 100 10 10
> f: +41 58 100 10 11
>
> tst at open.ch
>
> http://www.open.ch
>
>

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: marka at isc.org


More information about the bind-users mailing list