BIND 9.5.x

Tue Sep 16 18:58:17 UTC 2008

Hello,

I have some interesting behavior that others may have run into that I've
been unable to solve.

Here's the scenario and I have this issue with both 9.5.1b1 and 9.5.0-P2.

Both images compile fine and I do use FD_SETSIZE.  I originally had the
related crash with the sockets issue but that isn't the issue here.
Setting FD_SETSIZE takes care of the crashing aspect.  I thought all
was good to go until....

During normal use which can last from one day to perhaps a week or two
the servers run fine.  At some point there is either a loss of connectivity
to the roots or a large burst of recursive queries that apparently sends the 
recursive count way up to the recursive limit.  

In bind 9.5.0-P2 while the recursive query limit symptom is reached 
and I then try to reach the roots I cannot reach them from the 
server.  However, I can reach them from other machines not running BIND.   
I couldn't ping certain key gateways from the server although I could reach
those gateways from other nearby machines and that's the disturbing part.  Once
in this state other applications cannot reach known good destinations.

When I stop the bind process I can then reach the roots and the previously
unreachable areas in the route path towards the roots.   So, everything
does work fine for a while until some event occurs which seems to be related
to sockets.  At this point it becomes a sticky event in that it may take
a little time for the system to become normal after stopping bind.

So it sounds like either a resource issue for the app in that the
patched version has used up all the available network resources and no
new resources can be reserved so other applications like ping stop working
until the resources are freed up including bind. 

I guess this could be a combo issue between the OS and the BIND application.
Or there something else I need to adjust.  I could offload some of the
querying applications but that seems to be a temporary fix at best.

In bind 9.5.0-P2 once this event happens it apparently never recovers since it
can't reach the roots and the queries from clients never stop that is
until the process is killed and restarted at which point there is a good chance
it will repeat itself due to the backlog and rebuilding the cache.

In bind 9.5.1b1 this seems better behaved in that it seems to eventually recover
but does have the same symptom of not being able to reach destinations that 
are reachable by other machines.   So I guess I'm at a loss to
understand what needs to change or adjusted so this doesn't happen.   Is this a
scalability issue?  DOS vector?  Need a patch for the OS?  Run a earlier
release of bind?

My OS is solaris 10 and I've seen this also with Solaris 9.   

Thanks for any insight,

Robert