Natterings about history files

Mon Feb 26 11:51:17 UTC 2001

On 24 Feb 2001, Russ Allbery wrote:

> 
> Fabien Tassin <fta at sofaraway.org> writes:
> > According to Russ Allbery:
> 
> >> Unfortunately, if I understand how this works correctly, we can't use a
> >> perfect hash function even for old history files, since the point of a
> >> perfect hash function is that you'll only ever look up keys that you
> >> fed to your function generator when you made it.  Since one of the main
> >> purposes (and the most common queries) for old history entries is
> >> whether one has an article that one *doesn't* have, this doesn't work
> >> very well.  We'd be looking up a whole bunch of keys that we didn't
> >> give to the perfect hash generator, and which therefore would likely
> >> collide left and right in unuseful ways.
> 
> > Do you have an idea of the rate collision in the current model ?
> > I assume it is quite high unfortunatly. In the past, Matt Dillon did
> > the exercise for its Diablo history system but I can't put my hand on
> > its report.
> 
> I don't have a firm idea, no; bear in mind that as our history format is
> currently designed, a lot of collisions is to be expected since we're
> using a table that's 2/3rds full.  (That's with the collisions when
> hashing into a table; we have a separate problem of multiple message IDs
> hashing to the same value, and I'm worried about that since we're not
> using all of the MD5 hash like we really should.)

  I think the idea of circular history is impractical for the reasons
given. However, steady state history, spreading the work and impact of
expire over a day is certainly a desirable goal. News feed and readers are
more and more a 24 hour effort; there is no good time to run expire in
terms of load, and no good time to force a bandwidth spike to catchup. And
while disk is cheap, building a copy of the database on the fly is still a
lot of i/o.

> What do you mean by an expired?  (I'm not at all sure that just spreading
> the work that expire does throughout the day is a good idea.  You then
> have to deal with locking problems on top of everything else, and you
> can't batch your writes for better performance so it would be even
> slower.)

Russ, you can make the argument that batch processing is more efficient
than timesharing, too. But don't expect anyone to go back to to handing
the operators deck of punch cards (in the ligurative sense, for sure). And
for folks running article per file, there's the issue that they get a lot
of inodes in use up until expire, and then a sparse deletion. This has
performance implications which depend on the underlying o/s, but are never
a plus that I can see.

> > are your keys HASH or message-id ?
> 
> Message ID.  The details of the hash should be internal to the history
> implementation.

I like that. The whole thing should be a black box.

> That's the goal.  One should be able to drop a Berkeley DB history
> implementation in place if one wants to try it without having the rest of
> INN care.

  I care about this only for that reason. I really would like to see a
better mechanism for removing extries, and having a standard way to
interface would make that easier. CNFS has made the article part of expire
steady state, I'd like to do that for history as well, because I don't see
what we have as scaling.

-- 
bill davidsen <davidsen at tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.