BIND Lameness

Wed Apr 4 21:30:31 UTC 2012

On 04/03/2012 02:30 PM, Barry Margolin wrote:
> In article<mailman.421.1333467523.63724.bind-users at lists.isc.org>,
>   Keith Burgoyne<keith at silverorange.com>  wrote:
>
>> On 04/03/2012 11:14 AM, Barry Margolin wrote:
>>> In article<mailman.419.1333434497.63724.bind-users at lists.isc.org>,
>>>    Chuck Swiger<cswiger at mac.com>   wrote:
>>>
>>>> On 4/2/2012 10:37 PM, Keith Burgoyne wrote:
>>>> [ ... ]
>>>>> I've recently replaced the master server at 24.222.7.11, and am now
>>>>> running
>>>>> bind 9.7.3.
>>>>>
>>>>> My question is: I keep seeing log entries like
>>>>>
>>>>> Apr 2 23:24:17 clementine named[5870]: lame server resolving
>>>>> 'comuna.silverorange.com' (in 'silverorange.com'?): 24.222.7.12#53
>>>>> Apr 2 23:24:01 clementine named[5870]: lame server resolving 'veseys.com'
>>>>> (in
>>>>> 'veseys.com'?): 24.222.7.12#53
>>>>>
>>>>> and the list goes on. I don't get a lot, probably a few a minute. But
>>>>> where
>>>>> do
>>>>> they come from?
>>>>
>>>> Does the following help:
>>>>
>>>>      http://www.dnsvalidation.com/reports/4f7a96b37d79ee3769000012
>>>>      http://www.dnsvalidation.com/reports/4f7a97bd7d79ee3d4200000c
>>>>
>>>> ns3.silverorange.com seems to be down, and the other two nameservers being
>>>
>>> Since the log message is specifically about ns1, how could ns3's status
>>> be relevant?
>>>
>>>> listed aren't responding to TCP port 53.
>>>
>>> Why would clementine be trying TCP?  His server appears to support
>>> EDNS0, so it shouldn't need it.
>>>
>>> I'm not saying this isn't a problem, but I don't think it would cause
>>> this symptom.
>>>
>>
>> @Chuck: I've sorted out the TCP issue with ns2. I'm not sure why the
>> tests are failing for ns3, though. I can dig successfully on it from a
>> variety of networks, using both TCP and UDP. I re-ran the tests, and ns1
>> seems to check out.
>>
>> @Barry: The previous sysadmin set some of our NS entries on some of our
>> domains in no particular order. As a result, some of the domains list
>> the master (ns1.silverorange.com) as the second or third entry. Not sure
>> if that matters. I'm correcting that now, and will know if that makes a
>> difference a little later.
>
> There's no significance to order of any records in DNS.  By default,
> BIND shuffles them each time it answers, to get simple load balancing
> (unless the client sorts them).  And client nameservers use NS records
> by remembering response times, and preferring the faster servers (a
> simple form of GSLB).
>
> If a server doesn't respond, you don't get a "lame server" message, it
> just goes on to the next server.  And it will be remembered as a slow
> server, so it won't be tried again for a while.
>

I think I have narrowed down the problem to an issue with my internal 
DNS. ns1.silverorange.com is split into two separate views: internal and 
external. Obviously, internal handles all local hostnames and allows for 
recursive look ups to local clients. External blocks recursive lookups 
for all domains not hosted by us.

By enabling "querylog yes;" in named.conf, I could see that lame errors 
appear to be generated only by requests for domains we host that live in 
the external view, made from the internal side. For example:

Apr  4 17:55:24 clementine named[22480]: client 192.168.0.254#34358: 
view internal: query: silverorange.com IN A -EDC (192.168.0.12)
Apr  4 17:55:24 clementine named[22480]: lame server resolving 
'silverorange.com' (in 'silverorange.com'?): 24.222.7.12#53

Why would using the internal view cause a lame error?

Thanks again for your help!