Natterings about history files

Sat Mar 3 11:29:54 UTC 2001

Bill Davidsen <davidsen at tmr.com> writes:

> I think the idea of circular history is impractical for the reasons
> given. However, steady state history, spreading the work and impact of
> expire over a day is certainly a desirable goal. News feed and readers
> are more and more a 24 hour effort; there is no good time to run expire
> in terms of load, and no good time to force a bandwidth spike to
> catchup.

I'm really unconvinced that this is true for most sites.  It *definitely*
isn't true for Stanford, or for most other small or medium-sized sites.
I'm unsure that it's even true for ISPs, but I suppose you'd know better
than I.

> And while disk is cheap, building a copy of the database on the fly is
> still a lot of i/o.

So's doing it throughout the entire day.  The total I/O cannot be *more*
by doing it at night; it's almost certainly *less*.

> Russ, you can make the argument that batch processing is more efficient
> than timesharing, too. But don't expect anyone to go back to to handing
> the operators deck of punch cards (in the ligurative sense, for
> sure).

That's a really bad analogy for a whole bunch of reasons.

First off, batch processing *is* more efficient; ask anyone who's tried to
run a compute server.  Most of us running university systems have at least
tried at some point to set up a batch compute job system because it's
rather staggering how much more work the expensive hardware can get done
if you pipeline jobs rather than slam them all into the processor at the
same time.  There have been Usenix papers about this, as I recall.  If you
try to run more jobs than CPUs, you end up thrashing your processor cache,
or even swapping jobs out to disk, and *everything* ends up taking longer.

Second, news expiration isn't batch processing; there aren't users waiting
for the results.  It's something that goes on on the server that should be
pretty much transparent to the users under most circumstances.

Third, it *is* more efficient on disk I/O to do it all at the same time
rather than spreading out small reads and writes throughout the day and
mixed into your normal news traffic so that your disk cache isn't as
useful.  I think that's pretty obvious.

> And for folks running article per file, there's the issue that they get
> a lot of inodes in use up until expire, and then a sparse deletion. This
> has performance implications which depend on the underlying o/s, but are
> never a plus that I can see.

I can see places where spreading some portions of expire out is better,
and I think INN should support that.  And you may be correct that
spreading out all of expire makes for a cleaner and easier architecture to
maintain.  But I don't think it's as obvious of a win as you think.

> I care about this only for that reason. I really would like to see a
> better mechanism for removing extries, and having a standard way to
> interface would make that easier. CNFS has made the article part of
> expire steady state, I'd like to do that for history as well, because I
> don't see what we have as scaling.

Sure, it should be possible for people to try stuff out.  I think you're
fooling yourself if you think that steady-state expiration is going to buy
you any scaling over nightly expire given the same basic underlying
structures, though.  (In other words, different structures will give you
better performance, but those performance gains would be realized by the
same structure using a nightly expire too.)

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>