NXDOMAIN problems

Tue Nov 17 05:41:29 UTC 2020

One other detail may be important: I just added a bridge interface and virtual machines.  I presume the VPN tunnel was using the hardware interface (enp5s0) before, and is using the bridge (br0) now.  OpenConnect creates the tunnel (tun0); both the name and inspection of the code indicate the tunnel is based on the TUN interface, at the IP layer, instead of the TAP interface, at the MAC layer.  If some of the communication is not using IP then I presume it could be disappearing at the bridge.

This theory seems to imply that DNS lookup will always fail, which is not the case.  dig always works (though not a lot of tests) and general lookup rarely works.  I presume the general lookups go through bind, though maybe lwres is involved. If dig and bind use different communication methods that have different abilities to traverse the network stack that might explain some of the differences.

I don't think the virtual network is running any DNS servers since a) with bridging it is not an option and b) they are getting IPs from my main machine.  But if they were, that could definitely mess things up.

This is on Debian 10 (buster) with a Linux 4.19 kernel and bind 9.11.5.

________________________________________
From: Boylan, Ross
Sent: Monday, November 16, 2020 2:58 PM
To: bind-users at lists.isc.org
Cc: Ross Boylan
Subject: NXDOMAIN problems

I have been experiencing NXDOMAIN errors persistently, though not 100% of the time, for a machine I am trying to reach.  The queries worked OK before today.  I not only don't know what's causing it, but am having trouble tracing what's going on inside of bind.  I'd be grateful for help on either front, getting DNS to work or debugging.

There are a lot of complications.  In brief, the machine and name resolution for it are only available through VPN; I have a search list which should cause some failed lookups if the original doesn't work; and I'm using views.  Some details follow, and then discussion of my debugging attempts.

DETAILS

The remote machine is only accessible though VPN, and the nameserver that knows how to find it is also accessible only through VPN.  The IP of that nameserver is first on my forwarders list on my local machine.  When failures happen the replies indicate the request was addressed to the public-facing nameservers; it is good that they don't provide any info, but they shouldn't be getting the request.

I also added the target domain (ucsf.edu) to my search list.  So when I ask for mymachine.ucsf.edu, this will also generate a query for mymachine.ucsf.edu.ucsf.edu if the first query fails.  The second query is asking for a non-existent domain, and so maybe that is the proximate source of the NXDOMAIN.

The machine I'm making the query from is in my own domain, which is why I'm running BIND.  I use views, and the query is processed through my "inside" view according to the logs.  ucsf.edu is NOT a domain I manage.

DEBUGGING

I directed, either explicitly or via default, all channels to a file and I have tried rndc trace as high as 4.  But I can't tell if the values are coming from the cache or where external queries are going.  Even after flushing the cache I didn't see any info on outbound queries.  I tried using the query-errors channel first, but it didn't seem to capture anything.  I guess NXDOMAIN is not considered an error.

Occasionally I've had success, particularly after flushing the cache (though that doesn't always work).  But when I try 30 seconds later I get NXDOMAIN.

Every query I have directed explicitly (with dig) at the campus nameserver has succeeded.

The VPN connection has always been a bit touchy, and the problem first arose immediately after it went down for somewhere between 30 seconds and a couple of minutes.  My initial theory was that had caused a failure to be cached, but the way I get failures right after successes is not consistent with that.

Thanks for any help.

Ross Boylan