innfeed crashing on Linux

Wed Sep 29 15:48:20 UTC 2004

Hi Mark,

Thanks so much for the info.  A few responses are below.

Mark Hittinger wrote:
>>Sep 28 15:46:29 innfeed[30295]: bigbird:0 Setting up a reopen callback
>>[Crash here -aw]
>>...
>>This is a fairly recent behavior.  Any ideas???
> 
> 
> Anne the reopen callback queues up a subtask to check on the connection
> later - like its busy now or not responding now.
> 
> Innfeed is allocating some memory on the stack to hold the "note to self"
> and then sticking the note on its wall of post-it notes.  I think the wall
> might be filling up with too many :-)
> 

Thanks - this clears up some things.

> Are you getting a core file?  It would be very helpful to get a stack backtrace
> on the core file with gdb.
> 

Although I used to have *lots*, I don't have any core files at the 
moment as I turned off all relay to bigbird and tidied up.  (I assumed 
turning on debugging would be sufficient.)  And, I may not be able to 
get any more soon as bigbird is apparently finally back.  It had been 
down for almost three weeks, so your thoughts are consistent with the 
symptoms I saw.  That is, I didn't start having problems until it had 
been down quite a while.

It will be interesting to see what happens when I restart the relay to 
bigbird.

> Anyway one idea is that you are running into the process stack limit.  If
> you get in your Linux box and do the "limit" command under bash you'll see
> something like this:
> 
> [bugs at hlt ~]$ limit
> ...
> stacksize       8192 kbytes
> ...
> 
> Linux will bomb the program if the stack tries to grow beyond 8meg.  I have
> seen programs with large data structures reach this limit and die unexpectedly.
> 

My stacksize is is a little larger than that:

stacksize       10240 kbytes

But, if it's innfeed's stack, could it grow to 10M within three seconds 
or less?  It was crashing very quickly.

> If I remember right your site is processing a bunch of tiny files.   My
> thinking is that your unique load footprint could cause the innfeed to get to
> the 8meg of stack under the extreme condition of the destination host being
> always unresponsive.
> 

We are now handling very large volumes of files of all sizes.  (I 
recently added a new data stream with files of sizes that are, at times, 
greater than 20MB - love the way INN just handles that!)

I'm guessing we're handling roughly a max hourly volume of 3GB, an 
average hourly volume of 2.1GB and an average number of 120,000 
articles/hour.  This would be quite a backlog for bigbird after it's 
been down for a while.  I think I started seeing this problem after it 
had been down for almost two weeks.

I'm not sure if our load footprint is more or less unique than some of 
the top Usenet sites.

> You could set up a seperate innfeed process for bigbird.  That way the others
> would not restart.
> 

That's a interesting idea.  One thing, though, is that all the crashing 
was hindering innd such that article handling was negatively impacted. 
Even if that process is only feeding one site, the excessive restart 
overhead may be too much.

> If the stacksize limit is the culprit we might have to add a bit of code to
> innfeed to kick up the limit.  Another trick would be to set the stacksize
> to unlimited in the rc.news script, i.e.:
> 
> ##  Start the show.
> echo 'Starting innd.'
> limit stacksize unlimited
> eval ${WHAT} ${RFLAG} ${INNFLAGS}
> 

Well, for our purposes it doesn't make sense to actually try to relay 
the entire backlog to bigbird (beside the fact that all those articles 
are long gone out of the cycbuff).  Our foremost goal is near real time 
delivery, so trying to push (as opposed to pull) stuff that old is 
pointless.**

I'm not sure how to best handle this situation.  It's rare for one of 
our machines to be down so long, but, obviously, it happens.  Could 
logic be added to innfeed to determine the length of time a site has 
been down and deal with the backlog in a better way?    For now, I'm 
just going to try to remember that if a site has been down a long time 
we should turn off the relay to it.

**This brings me to a feature that I would like to see in INN.  That is, 
I'd like to be able to specify an article age beyond which the server 
would not push the article (as opposed to artcutoff on the receiving 
side).  Have I missed something?

> This is just a guess and the core file (if there is one) backtrace would let
> us make a better guess.
> 

If I get another I'll gdb it and see what I get.  Although, if I'm 
lucky, this won't happen again for a long time!

Thanks, Mark!!

Anne
-- 
***************************************************
Anne Wilson			UCAR Unidata Program		
anne at unidata.ucar.edu		       P.O. Box 3000
               			  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://my.unidata.ucar.edu/
****************************************************