question about negative caching (Servfail)
    Jan Koum 
    jkb at yahoo-inc.com
       
    Thu Jan 17 06:48:57 UTC 2002
    
    
  
folks,
looking at rfc2308 section 7.2 it states:
   A server MAY cache a dead server indication.  If it does so it MUST
   NOT be deemed dead for longer than five (5) minutes.  The indication
   MUST be stored against query tuple <query name, type, class, server
   IP address> unless there was a transport layer indication that the
   server does not exist, in which case it applies to all queries to
   that specific IP address.
is this feature implemented anywhere?  why am i asking this:  we have
discovered a situation in which named will fail.  we run a very large
site, so imagine the following scenario:
- auth servers for hotmail.com go down.
- 400+ yahoo MTA machines try to send large volumes of mail to hotmail
- those 400+ machines begin to query our caching servers, which in turn
  try to talk to hotmail but can't, and they get back SERVFAIL.
- for some reason very large numbers of SERVFAILs create a situation where
  bind is taking up all the CPU (97% or more, leaving 0 idle CPU)
  [our guess is that if servers are down, SERVFAIL takes a very long time
  to get back, causing named to keep state using memory/cpu/etc for all
  these delayed requests.]
- more and more mail gets queued up and the situation gets worse....
the hack-ish solution we have right now is a little perl script which runs
tcpdump and if out of X number of packets, 10% of them show SERVFAIL,
we blackhole the domain by creating a fake zone entry in named.conf
i have never spent any time profiling or debugging bind in these
situations to see where it is taking up so much CPU, since usually these
problems need to get fixed right away.
i am curious to know if this is a known problem or if we have discovered
something new?  maybe next time i can do some gdb/ktrace to see where
exactly in named the problem is...
thanks,
-- yan
P.S. - this is my first post to this mailing list.  please excuse me if
this is not a correct forum for these types of questions.
    
    
More information about the bind-workers
mailing list