Natterings about history files

Fri Mar 2 04:56:34 UTC 2001

In article <200102282219.f1SMJap24722 at bean.epix.net>,
Forrest J. Cavalier III <forrest at mibsoftware.com> wrote:
| 
| While people are throwing out ideas, I'll throw in
| mine.  (I'd love to see someone fund me doing it.  Hint.
| Hint.  This is the kind of contract work I do.)
| -----------------------------------------------------------
| 
| Rolling history files are very nice, except the downside
| pointed out by Russ.  But I think that can be solved
| this way:
| 
|   1. Keep an N hour cache of articles offered/accepted.  (N==4
|      should be about right, I think, but it can be larger.)
| 
|   2. When offered, check ONLY the cache, not the history file
|      for that message ID.  If not found, tell the peer to
|      send it.  (HISlookup times will be VERY short.)
| 
| You say: "Wait!  That causes duplicates!"
| 
| Hold on there, I'm not finished....
| 
|   3. When you finally do get the article, get the date header.  If
|      it was in the last N hours, you presume the lookup from
|      the cache was correct.  Store it.  
| 
|      If the article date header shows it is older than N hours,
|      do the full lookup. Check the history file corresponding to
|      that date.  (This can be done by a downstream process,
|      actually, since the number of articles you get this way
|      is presumably small.)

  No, you've got it and didn't say it! When something rolls out of the
"recent" history, you store the info in bins by posting time (and I
don't think it matters if it's worng by a bit, recent is by arrival).
Now you don't have to do a full lookup, you only have to look in the bin
which matches the Date: header.

	[... snip ...]

The below brings out some issues about expire which are fine in with
what F. said, but not with what I just said. But if the files are small
you can do micro expires every now and again. Kind of like a rolling
blackout, I guess, lock the file for a few seconds, which make s the
update a pain, but it should bs a few ms on one part of the data.
| ----------------------------------------------------------
| I have done the on-paper planning enough to see that the
| rest of the pieces fall into place.  But I won't explain
| them in detail here.  Just an outline:
| 
| To cancel/expire an article, you just write a hole, leaving
| the file offset position of all articles in the history
| intact.  This means you can do on-the-fly expire, instead of
| one big grind-to-a-crawl nightly process.
| 
| Eventually, at /remember/ you just delete a per-date history
| file, as long as it is entirely empty.  
| 
| If I recall, even the INN 2.2 style overview gets WAY nicer,
| because you have history files that are "addressable" and
| don't change.  But we don't use that overview any more anyway.
| 
| The downside of having holes in the history file is
| more required storage.  But already INN needs 2X the
| history file for temp storage during expire, so it's
| a big win.
| 
| For long-lived archives, it still would be possible to
| rewrite history files and reclaim space.  IMPORTANT: dbz
| is not going to work for this.  Deleting a key from that
| kind of hashed index (linear probing on collision insert!
| blech!) is not acceptable, it breaks a data structure
| invariant condition.

  I really have to think on this, if we can really always look in the
right file, and the files are kept small, You may be able to just
reindex, and the hash would be fewer bits, and all good things would
come of it. I don't think keys need be deleted, the data entry could
just be marked gone-walkabout until the file is cleaned.

  Am I overly optimistic or is there a good idea floating around here?
-- 
bill davidsen <davidsen at tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.