Natterings about history files

Sat Mar 3 11:42:55 UTC 2001

This is a *really* good idea.  In several months of thinking about this, I
hadn't thought of the idea of using the Date header as a hint to help with
multiple history files.  Cool!  An entirely new technique to throw into
the mix.

Forrest J Cavalier <mibsoft at epix.net> writes:

> Rolling history files are very nice, except the downside pointed out by
> Russ.  But I think that can be solved this way:

>   1. Keep an N hour cache of articles offered/accepted.  (N==4
>      should be about right, I think, but it can be larger.)

>   2. When offered, check ONLY the cache, not the history file
>      for that message ID.  If not found, tell the peer to
>      send it.  (HISlookup times will be VERY short.)

> You say: "Wait!  That causes duplicates!"

> Hold on there, I'm not finished....

>   3. When you finally do get the article, get the date header.  If
>      it was in the last N hours, you presume the lookup from
>      the cache was correct.  Store it.  

The drawback here is that you're making a strong assumption about the
correlation between the article Date header and the arrival time that
could be inaccurate in the case of articles dated into the future, and the
failure mode is to accept a duplicate.

In other words, suppose you have an article dated 12 hours in the future,
and you keep a 12 hour cache of message IDs.  You've already accepted the
article, 18 hours ago.  You're offered the article again, and it's not in
the cache so you say to send it.  You accept it, look at the date header,
see that the date header is 6 hours in the past, assume the cache is
correct, and accept it again.

The solution may be to just adjust INN's date fuzz so that you keep a
cache for fuzz + N hours instead of just the N hour threshold you use for
deciding whether the cache is accurate.  I'm guessing that INN's day in
the future restriction is too generous already and that an hour or two is
probably sufficient.

Another drawback is that with slow peers you end up accepting more network
traffic, but that's probably a blip.

>      If the article date header shows it is older than N hours, do the
>      full lookup. Check the history file corresponding to that date.
>      (This can be done by a downstream process, actually, since the
>      number of articles you get this way is presumably small.)

Yup.

> nnrpd lookup by message ID should change to be a look through the
> overview file, and then a look through every per-date file.

How do you do lookups by message ID via overview?

For nnrpd lookup by message ID, just search all of history.  I don't think
that's really a speed problem; isn't NEWNEWS the only client mechanism
that does many lookups by message ID?  nnrpd doesn't do that much of that
operation, and searching multiple small history files isn't *that* slow.
It's just too slow to do for every incoming article.

> To cancel/expire an article, you just write a hole, leaving the file
> offset position of all articles in the history intact.  This means you
> can do on-the-fly expire, instead of one big grind-to-a-crawl nightly
> process.

Yup, it's just another form of HISreplace, but with the designated
"non-existent article" token.

> Eventually, at /remember/ you just delete a per-date history file, as
> long as it is entirely empty.

And optionally you can combine the last few history files if you tend to
have stragglers in them.

> The downside of having holes in the history file is more required
> storage.  But already INN needs 2X the history file for temp storage
> during expire, so it's a big win.

We also make up a lot of space over the current system by switching to a
binary file format (and gain speed wins from not having to write as much
data).

> For long-lived archives, it still would be possible to rewrite history
> files and reclaim space.  IMPORTANT: dbz is not going to work for this.
> Deleting a key from that kind of hashed index (linear probing on
> collision insert!  blech!) is not acceptable, it breaks a data structure
> invariant condition.

Actually, I don't see any reason why dbz wouldn't work.  You never delete
keys, you only delete storage API tokens (which are part of the data).
Deleting a key would mean dropping memory of a message ID, and that's done
by either deleting an entire old history file or by building a new history
file out of the contents of several old ones.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>