Issues resolving outlook.office365.com

Thomas Sturm tst at open.ch
Wed Jul 6 13:29:52 UTC 2016


Hi Mark,

I may have found another (possibly related?) bug:

I noticed that when validating a signed zone using delv by querying a local BIND caching server (v9.10.3-P4), it sometimes suddenly alerts "no valid RRSIG”. Indeed, when querying “dig ds mydomain +dnssec", it returns the DS records, but no RRSIG at all. The following sequence of commands (output simplified) makes me think this might be related to prefetch/cache expiry as well (prefetch value 2):

$ while true; do dig ds mydomain; sleep 1; done
;; ANSWER SECTION:
mydomain.		3	IN	DS	[…]
mydomain.		3	IN	DS	[…]
mydomain.		3	IN	RRSIG	DS […]

;; ANSWER SECTION:
mydomain.		3600	IN	DS	[…]
mydomain.		3600	IN	DS	[…]
mydomain.		2	IN	RRSIG	DS […]

;; ANSWER SECTION:
mydomain.		3599	IN	DS	[…]
mydomain.		3599	IN	DS	[…]
mydomain.		1	IN	RRSIG	DS […]

;; ANSWER SECTION:
mydomain.		3598	IN	DS	[…]
mydomain.		3598	IN	DS	[…]
mydomain.		0	IN	RRSIG	DS […]

;; ANSWER SECTION:
mydomain.		3597	IN	DS	[…]
mydomain.		3597	IN	DS	[…]


What’s your take on this?

Regards,
Thomas


> On 20.06.2016, at 08:39, Mark Andrews <marka at isc.org> wrote:
> 
> 
> A fix for this is in review and should be in the next maintainance
> release.
> 
> Mark
> 
> In message <16A2CDFD-694D-444A-A760-17C9D7517BCD at open.ch>, Thomas Sturm writes:
>> 
>> I am now able to reliably reproduce the behaviour with dig querying BIND
>> 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:
>> 
>> $ while true; do dig outlook.office365.com +noauthority +noadditional
>> +tries=1 +retry=0; sleep 0.1; done
>> 
>> Wait for 5 minutes, once the TTL expires, this should show about 5-7
>> SERVFAIL responses.
>> 
>> prefetch 1 or 2 makes it harder to reproduce and it only happens
>> (sometimes) on loaded systems. prefetch 10 makes it go away.
>> 
>> It never happens after restarting or flushing the cache. And it never
>> happens when querying x seconds _after_ the TTL expired. Could there be
>> an issue processing cached client requests during cache expiry, and since
>> it only happens on 9.10, potentially related to prefetching?
>> 
>> 
>> 
>>> On 16.06.2016, at 10:00, Thomas Sturm <tst at open.ch> wrote:
>>> 
>>> Hi,
>>> 
>>> We are experiencing strange intermittent issues when resolving
>> outlook.office365.com, but also with other domains like e.g.
>> amazonaws.com or snort.org. But let’s choose office365.com as example for
>> now. outlook.office365.com is a CNAME to lb.geo.office365.com, and
>> office365.com delegates the geo subdomain to different nameservers; 2 of
>> them are showing some issues on intodns.com [1] (which may or may not be
>> related to this problem).
>>> 
>>> When querying one of the office365.com nameservers, it correctly
>> delegates, as far as I understand:
>>> 
>>> # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
>>> 
>>> ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net
>> +noadditional +nostats
>>> ;; global options: +cmd
>>> ;; Got answer:
>>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
>>> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
>>> ;; WARNING: recursion requested but not available
>>> 
>>> ;; OPT PSEUDOSECTION:
>>> ; EDNS: version: 0, flags:; udp: 4000
>>> ;; QUESTION SECTION:
>>> ;lb.geo.office365.com.		IN	A
>>> 
>>> ;; AUTHORITY SECTION:
>>> geo.office365.com.	300	IN	NS	
>> glb1.glbdns2.microsoft.com.
>>> geo.office365.com.	300	IN	NS	ns1.p21.dynect.net.
>>> geo.office365.com.	300	IN	NS	ns3.p21.dynect.net.
>>> geo.office365.com.	300	IN	NS	ns4.p21.dynect.net.
>>> geo.office365.com.	300	IN	NS	ns2.p21.dynect.net.
>>> geo.office365.com.	300	IN	NS	
>> glb2.glbdns2.microsoft.com.
>>> 
>>> Still, BIND (sometimes) decides to return SERVFAIL to the client
>> immediately after receiving this response. Some interesting debug log
>> lines:
>>> 
>>> resolver: debug 3: resquery 0x7f26fecc8010 (fctx
>> 0x7f26fecb4458(lb.geo.office365.com/A)): sent
>>> resolver: debug 3: resquery 0x7f26fecc8010 (fctx
>> 0x7f26fecb4458(lb.geo.office365.com/A)): response
>>> resolver: debug 10: received packet:
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> noanswer_response
>>> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response:
>> lb.geo.office365.com (in 'office365.com'?): 1 172499
>>> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION:
>> lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> cache_message
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> [result: success] query canceled in response();  responding
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> cancelquery
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> nameservers now above QDOMAIN
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> stopeverything
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> cancelqueries
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> sendevents
>>> client: error: query client=0x7f2700055ca0 thread=0x7f2709813700
>> (lb.geo.office365.com/A): query_find: unexpected error after resuming:
>> SERVFAIL
>>> query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com):
>> query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837
>>> 
>>> “nameservers now above QDOMAIN” sounds like a geo.office365.com
>> nameserver refers back to an office365.com nameserver? The thing is
>> though, I cannot see any such response packet in tcpdump. Is this
>> information taken (wrongly) from cache then? The same log message appears
>> at all times for any of the failing domains we’ve seen so far.
>>> 
>>> Note that this doesn’t seem to happen with an empty cache and we are
>> also not able to trigger it on a test machine. It only happens on loaded
>> machines once the cache TTL of the queried record expires. We can
>> reproduce it with the latest patch levels of both 9.10 and 9.9.
>>> 
>>> Regards,
>>> Thomas
>>> 
>>> 
>>> [1]
>> http://intodns.com/geo.office365.com______________________________________
>> _________
>>> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
>> unsubscribe from this list
>>> 
>>> bind-users mailing list
>>> bind-users at lists.isc.org
>>> https://lists.isc.org/mailman/listinfo/bind-users
>> 
>> 
>> --
>> thomas sturm
>> principal engineer
>> 
>> open systems ag
>> raeffelstrasse 29
>> ch-8045 zurich
>> t: +41 58 100 10 10
>> f: +41 58 100 10 11
>> 
>> tst at open.ch
>> 
>> http://www.open.ch
>> 
>> 
> 
> -- 
> Mark Andrews, ISC
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742                 INTERNET: marka at isc.org


-- 
thomas sturm
principal engineer

open systems ag
raeffelstrasse 29
ch-8045 zurich
t: +41 58 100 10 10
f: +41 58 100 10 11

tst at open.ch

http://www.open.ch

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4117 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20160706/dca8f4b6/attachment.bin>


More information about the bind-users mailing list