Another sort of interesting/sort of sad failure mode

Sat May 18 01:31:14 UTC 2002

Hi,

Running INN 2.4.0 (20020315 prerelease) with CNFS/tradover, I recently 
had the overviews partition fill up... in typical "let's pound on it with 
a hammer and see what happens" fashion the thought came to mind that one 
could cd down into the overviews directory and then do a find . -size 
+<value>c -exec rm {} \; to clean up the worst of the bloated overview 
files, and then soldier on (albeit with a loss of article data for those 
particular groups). Empirically, however, that appears to be incorrect. :-)

Having done my find-and-rm, I appear to have uncovered an interesting 
situation where INN semi-starts, but then just sits and spins at 99.9% of 
the processor capacity on one (of two) processors... looking at the logs, 
it appears to successfully do incoming.conf name resolution, but then 
just hangs, probably at the point one or more of the box's feeds tries
to hand it an article. 

Running strace -c -d -f -F -p on the magic INN process, the process 
appears, sure enough, to be stuck sitting in a wait state:

  [wait(0x137f) = 912]
 pid 912 stopped, [SIGSTOP]

1) My first working hypothesis was that my brute force find-and-rm cleanup 
on the overviews partition resulted in deletion of one or more large 
overview .DAT file, but did not also always result in the removal of the 
corresonding (but typically smaller) .IDX file, a hypothesis which appears 
to be correct:

  % ls -laR | grep DAT | wc -l
  26008
  % ls -laR | grep IDX | wc -l
  26414

One or more of those 406 "extra" IDX files (or 406 "missing" DAT files) 
might then have resulted in a blocking condition at startup.

Unfortunately, removing the corresponding "extra" IDX files did not
correct the observed "just sit and spin" behavior. 

2) On to hypothesis B: group.index appeared to be rather small (just a 
couple hundred K, no doubt it got zing'd by my none-to-discriminating 
find-and-rm project). 

renaming that group.index to group.other, and then letting groups.index 
rebuild, however, appears to have been just what the doctor ordered -- 
or at least the server stopped doing the sit and spin problem, and began
to accept articles again. 

There is, however, still an irritating bit of behavior in that if you 
connect via trn to one of the groups that got the find-and-rm massage, 
trn does: 

Caught a SIGSEGV--.newsrc restored
Abort (core dumped)

but then if you restart trn, it rolls right along... I haven't had a chance
to look at this further yet. 

Thoughts for a Friday:

-- better overview management tools continue to be something that would
   be REALLY nice to have available

-- the existing tradover code should be more defensive with respect
   to handling potentially incoherent overview state

-- nnrpd is doing something which trn obviously believes to be less than
   sane, possibly associated with differing expectations between active
   and overviews for groups that have been massaged. Less than sane 
   state should result in something more graceful than a SIGSEGV from
   trn (but nnrpd shouldn't get that crazy anyhow).

Regards,

Joe