named 8.3.4 dies under recursive cache load (with ACL'ed zones?)

Sat Feb 15 18:30:20 UTC 2003

[ On Saturday, February 15, 2003 at 17:52:39 (+0000), Paul Vixie wrote: ]
> Subject: Re: named 8.3.4 dies under recursive cache load (with ACL'ed zones?) 
>
> greg, allow me to apologize on behalf of the bind4.9 and bind8 effort
> for the fact that named cannot recover from a malloc() failure.

Heh.

Seriously I really don't think that can be what's happening.  The
process right now is running with a virtual size of 105MB and I've never
seen it rise above 200MB (though I haven't been able to look at it just
before it crashes of course).

# ps -lp 24945      
  UID   PID PPID CPU PRI NI    VSZ    RSS WCHAN  STAT TT     TIME COMMAND
32769 24945    1   4   2  4 105888 105220 select SNs  ?? 46:38.61 /usr/sbin/named -u dns -g dns -u dns -g dns 

The process is running with its resource limits set to far exceed its
normal uses:

# sysctl proc.24945
proc.24945.corename = %n.core
proc.24945.rlimit.cputime.soft = unlimited
proc.24945.rlimit.cputime.hard = unlimited
proc.24945.rlimit.filesize.soft = unlimited
proc.24945.rlimit.filesize.hard = unlimited
proc.24945.rlimit.datasize.soft = 134217728
proc.24945.rlimit.datasize.hard = 1073741824
proc.24945.rlimit.stacksize.soft = 2097152
proc.24945.rlimit.stacksize.hard = 33554432
proc.24945.rlimit.coredumpsize.soft = unlimited
proc.24945.rlimit.coredumpsize.hard = unlimited
proc.24945.rlimit.memoryuse.soft = 520945664
proc.24945.rlimit.memoryuse.hard = 520945664
proc.24945.rlimit.memorylocked.soft = 173648554
proc.24945.rlimit.memorylocked.hard = 520945664
proc.24945.rlimit.maxproc.soft = 500
proc.24945.rlimit.maxproc.hard = 4116
proc.24945.rlimit.descriptors.soft = 13196
proc.24945.rlimit.descriptors.hard = 13196

The system has a 512MB of RAM and a gigabyte of swap that it never
really touches (except for things like swapping out getty processes that
never ever execute):

# head /var/run/dmesg.boot                                                                             
NetBSD 1.5W (ACI) #0: Thu Mar 28 16:43:52 EST 2002
    woods at proven:/work/woods/NetBSD-src/sys/arch/i386/compile/ACI
cpu0: Intel Pentium II (Klamath) (686-class), 299.21 MHz
cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 2-way
cpu0: L2 cache 512 KB 32b/line 4-way
cpu0: features 80fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 80fbff<PGE,MCA,CMOV,MMX>
total memory = 511 MB
avail memory = 445 MB
using 900 buffers containing 52360 KB of memory
# swapctl -lk
Device      1K-blocks     Used    Avail Capacity  Priority
/dev/sd0b      524170     1776   522394     0%    0
/dev/sd1b      524170     1748   522422     0%    0
Total         1048340     3524  1044816     0%

Unless maybe it's blowing the stack at some point.....

More likely though is that a wild pointer is corrupting the heap and
confusing the heck out of malloc() and friends.

I suppose I could probably try building it against ElectricFence, or
maybe VMalloc or dmalloc.  If there's heap corruption then maybe one of
those would help find it without having to run under the debugger.

>  you
> should upgrade to bind9 which handles this condition properly.

So I give up on a significant number of features I find quite useful
(and in fact almost critical in this scenario) just so that I can have a
named that'll sit spinning its thumbs when it runs out of memory instead
of one that dies ungracefully?  I'm not sure what to think.

>  if you
> remain with bind8 then you should put it in a restart wrapper like:

Yes, of course.  That doesn't help though if there's a serious bug.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods at ieee.org>;           <woods at robohack.ca>
Planix, Inc. <woods at planix.com>; VE3TCP; Secrets of the Weird <woods at weird.com>