Natterings about history files

Sat Feb 24 23:32:04 UTC 2001

Fabien Tassin <fta at sofaraway.org> writes:
> According to Bill Davidsen:

>> My view is that trying to eliminate the performance hit of expire is
>> like trying to improve the code in a bubble sort. No matter how
>> cleverly you code it, it is still not efficient.

Well, you still have to perform the functionality of expire in one place
or another, unless I'm missing something major.

> I don't want to try to improve the code of expire but rather change the
> way entries pass from the exist state to the remembered state, including
> article removal. It can be done without changing a single line in
> expire.  If articles removal can be done on the fly, the daily task will
> only have to rewrite the history files without too old remembered
> entries.

But that's still expire; you're just doing the same thing continuously
throughout the day that INN is currently doing in batch at night.  It's
not clear to me that that's a gain from the performance perspective,
although it's definitely a gain for a few things from a quality of
implementation perspective (it would cut down on expired articles showing
up in CNFS groups in overview, for example).

Oh, BTW, note that for a transit-only server you don't need to store *any*
information about articles in history other than the fact that you've seen
the message ID.  You don't have to support article retrieval at all, so
it's okay for articles to be identified solely by storage token.  That
simplifies the history file issues dramatically.

>> Your comment about CNFS type performance is sort of on the track of
>> what I was thinking, but how about an expire daemon which simply trims
>> old info as it reaches /remember/ time? Going to history per day would
>> make that quite easy, going to a database (and based on how well db
>> hasn't worked for many people, some *other* database) setup would allow
>> deletion on the fly.

Note that we *have* a database already, and with some minor changes we can
support deletion on the fly.  But I'm really worried about write
performance.

You want all of your history writes to either be buffered or to memory
mapped memory that's only periodically flushed, as near as I can tell, or
people see the history write problems that they're seeing now (where the
writes to history.index are completely unbuffered and have bad locality of
reference).  That drastically limits the viable history designs.

The main problem with expire performance right now is due to the fact that
we're doing roughly the same work twice; first, expireover checks to see
if each article exists when doing its run, and then expire does the same
thing.  I've been trying to think of some better approaches; one idea
would be to have expireover generate a hash of articles that were removed,
and then expire could just query that.  (That would require running
expireover first.)  Another approach would be for expire to do the checks
to see if the articles still exist and expireover just query history for
every article, but additionally rewrite history entries in place when it
removes articles (for group-based expire).

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>