Issues resolving outlook.office365.com

Thu Jun 16 14:48:56 UTC 2016

I am now able to reliably reproduce the behaviour with dig querying BIND 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:

$ while true; do dig outlook.office365.com +noauthority +noadditional +tries=1 +retry=0; sleep 0.1; done

Wait for 5 minutes, once the TTL expires, this should show about 5-7 SERVFAIL responses. 

prefetch 1 or 2 makes it harder to reproduce and it only happens (sometimes) on loaded systems. prefetch 10 makes it go away. 

It never happens after restarting or flushing the cache. And it never happens when querying x seconds _after_ the TTL expired. Could there be an issue processing cached client requests during cache expiry, and since it only happens on 9.10, potentially related to prefetching?

> On 16.06.2016, at 10:00, Thomas Sturm <tst at open.ch> wrote:
> 
> Hi,
> 
> We are experiencing strange intermittent issues when resolving outlook.office365.com, but also with other domains like e.g. amazonaws.com or snort.org. But let’s choose office365.com as example for now. outlook.office365.com is a CNAME to lb.geo.office365.com, and office365.com delegates the geo subdomain to different nameservers; 2 of them are showing some issues on intodns.com [1] (which may or may not be related to this problem). 
> 
> When querying one of the office365.com nameservers, it correctly delegates, as far as I understand:
> 
> # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
> 
> ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
> ;; WARNING: recursion requested but not available
> 
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4000
> ;; QUESTION SECTION:
> ;lb.geo.office365.com.		IN	A
> 
> ;; AUTHORITY SECTION:
> geo.office365.com.	300	IN	NS	glb1.glbdns2.microsoft.com.
> geo.office365.com.	300	IN	NS	ns1.p21.dynect.net.
> geo.office365.com.	300	IN	NS	ns3.p21.dynect.net.
> geo.office365.com.	300	IN	NS	ns4.p21.dynect.net.
> geo.office365.com.	300	IN	NS	ns2.p21.dynect.net.
> geo.office365.com.	300	IN	NS	glb2.glbdns2.microsoft.com.
> 
> Still, BIND (sometimes) decides to return SERVFAIL to the client immediately after receiving this response. Some interesting debug log lines:
> 
> resolver: debug 3: resquery 0x7f26fecc8010 (fctx 0x7f26fecb4458(lb.geo.office365.com/A)): sent
> resolver: debug 3: resquery 0x7f26fecc8010 (fctx 0x7f26fecb4458(lb.geo.office365.com/A)): response
> resolver: debug 10: received packet:
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): noanswer_response
> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response: lb.geo.office365.com (in 'office365.com'?): 1 172499
> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION: lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cache_message
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): [result: success] query canceled in response();  responding
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelquery
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): nameservers now above QDOMAIN
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): stopeverything
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelqueries
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): sendevents
> client: error: query client=0x7f2700055ca0 thread=0x7f2709813700 (lb.geo.office365.com/A): query_find: unexpected error after resuming: SERVFAIL
> query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com): query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837
> 
> “nameservers now above QDOMAIN” sounds like a geo.office365.com nameserver refers back to an office365.com nameserver? The thing is though, I cannot see any such response packet in tcpdump. Is this information taken (wrongly) from cache then? The same log message appears at all times for any of the failing domains we’ve seen so far. 
> 
> Note that this doesn’t seem to happen with an empty cache and we are also not able to trigger it on a test machine. It only happens on loaded machines once the cache TTL of the queried record expires. We can reproduce it with the latest patch levels of both 9.10 and 9.9. 
> 
> Regards,
> Thomas
> 
> 
> [1] http://intodns.com/geo.office365.com_______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
> 
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

-- 
thomas sturm
principal engineer

open systems ag
raeffelstrasse 29
ch-8045 zurich
t: +41 58 100 10 10
f: +41 58 100 10 11

tst at open.ch

http://www.open.ch

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4117 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20160616/5aabe2b1/attachment.bin>