Another sort of interesting/sort of sad failure mode
Joe St Sauver
JOE at OREGON.UOREGON.EDU
Sat May 18 01:31:14 UTC 2002
Hi,
Running INN 2.4.0 (20020315 prerelease) with CNFS/tradover, I recently
had the overviews partition fill up... in typical "let's pound on it with
a hammer and see what happens" fashion the thought came to mind that one
could cd down into the overviews directory and then do a find . -size
+<value>c -exec rm {} \; to clean up the worst of the bloated overview
files, and then soldier on (albeit with a loss of article data for those
particular groups). Empirically, however, that appears to be incorrect. :-)
Having done my find-and-rm, I appear to have uncovered an interesting
situation where INN semi-starts, but then just sits and spins at 99.9% of
the processor capacity on one (of two) processors... looking at the logs,
it appears to successfully do incoming.conf name resolution, but then
just hangs, probably at the point one or more of the box's feeds tries
to hand it an article.
Running strace -c -d -f -F -p on the magic INN process, the process
appears, sure enough, to be stuck sitting in a wait state:
[wait(0x137f) = 912]
pid 912 stopped, [SIGSTOP]
1) My first working hypothesis was that my brute force find-and-rm cleanup
on the overviews partition resulted in deletion of one or more large
overview .DAT file, but did not also always result in the removal of the
corresonding (but typically smaller) .IDX file, a hypothesis which appears
to be correct:
% ls -laR | grep DAT | wc -l
26008
% ls -laR | grep IDX | wc -l
26414
One or more of those 406 "extra" IDX files (or 406 "missing" DAT files)
might then have resulted in a blocking condition at startup.
Unfortunately, removing the corresponding "extra" IDX files did not
correct the observed "just sit and spin" behavior.
2) On to hypothesis B: group.index appeared to be rather small (just a
couple hundred K, no doubt it got zing'd by my none-to-discriminating
find-and-rm project).
renaming that group.index to group.other, and then letting groups.index
rebuild, however, appears to have been just what the doctor ordered --
or at least the server stopped doing the sit and spin problem, and began
to accept articles again.
There is, however, still an irritating bit of behavior in that if you
connect via trn to one of the groups that got the find-and-rm massage,
trn does:
Caught a SIGSEGV--.newsrc restored
Abort (core dumped)
but then if you restart trn, it rolls right along... I haven't had a chance
to look at this further yet.
Thoughts for a Friday:
-- better overview management tools continue to be something that would
be REALLY nice to have available
-- the existing tradover code should be more defensive with respect
to handling potentially incoherent overview state
-- nnrpd is doing something which trn obviously believes to be less than
sane, possibly associated with differing expectations between active
and overviews for groups that have been massaged. Less than sane
state should result in something more graceful than a SIGSEGV from
trn (but nnrpd shouldn't get that crazy anyhow).
Regards,
Joe
More information about the inn-workers
mailing list