Natterings about history files

Tue Jan 16 20:37:29 UTC 2001

More old e-mail that I'm going through.  This is timely, since I was just
thinking today about history file design.

I've actually been fairly impressed each time I've tried to come up with a
better design of the history file.  I can't see a database format that's
fundamentally better without making other tradeoffs.  Sure, there are a
few other obvious improvements that we can make, but they're not
conceptual ones; the conceptual changes aren't clearly better.

Bill Davidsen <davidsen at tmr.com> writes:

> Russ was talking about multiple history files and an implementation
> conforming to the API. I have a thought on that, which I doubt I'll
> investigate until vacation in October...

> Once upon a time I stumbled on a program called "mph" for 'Make Perfect
> Hash.' The characteristics of a perfect hash are that every lookup will
> be satisfied in exactly one test, with no collisions. Of course this
> needs to be recalculated when the keys change (may need to be).

> If multiple history files were being used, it might be possible to make
> a perfect hash for the old entries, since no new entries would be
> added. The payback for this CPU intensive effort is that I believe there
> would be NO hash table, just a description of the perfect hash
> function. Sounds like a real time and disk space saver.

> Any thoughts on this, or am I remembering it wrong? I was using it to
> find entries on a VERY slow disk subsystem, and it worked really well.

Thanks... that provoked some interesting reading.

Unfortunately, if I understand how this works correctly, we can't use a
perfect hash function even for old history files, since the point of a
perfect hash function is that you'll only ever look up keys that you fed
to your function generator when you made it.  Since one of the main
purposes (and the most common queries) for old history entries is whether
one has an article that one *doesn't* have, this doesn't work very well.
We'd be looking up a whole bunch of keys that we didn't give to the
perfect hash generator, and which therefore would likely collide left and
right in unuseful ways.

It's also not necessarily true that a perfect hash function (at least as
generated by mph) would be a disk-space saver.  Sure, you don't need a
hash table in the current form, but the generated function uses an
internal hash table as part of the algorithm that's linear in the size of
the key space you feed the generator.  So you end up with a large hash
table anyway.

Anyway, one thing that I think might be a win if we stay with the nightly
expire idea for history entries would be separate, daily history files.
The lookup time would be somewhat slower, but it would save a lot of time
with expire since there's no need to even look at the history files that
are younger than /remember/.

The API that I'm currently considering is:

    struct history *HISopen(const char *path, int flags);
    bool HISclose(struct history *);
    bool HISsync(struct history *);

    void HISsetcache(struct history *, size_t size);

    bool HISlookup(struct history *, const char *key, time_t *arrived,
                   time_t *posted, time_t *expires, TOKEN *token);
    bool HIScheck(struct history *, const char *key);

    bool HISwrite(struct history *, const char *key, time_t arrived,
                  time_t posted, time_t expires, const TOKEN *token);
    bool HISreplace(struct history *, const char *key, time_t arrived,
                    time_t posted, time_t expires, const TOKEN *token);

    bool HISexpire(struct history *, time_t threshold,
                   bool (*exists)(TOKEN *));

    struct histstats *HISstats(struct history *);

    const char *HISerror(struct history *);

I want to hide details like the hashing function used inside the history
mechanism so that we can replace them and fiddle with them without having
to worry about the interface.  I'd also like to make the history struct
opaque to all callers.  The HISopen flags would be things like HIS_RDWR,
HIS_READ, HIS_MMAP, HIS_CREATE, and the like and would use the path as the
prefix (so, for example, the standard history file would be at
concat(innconf->pathdb, "/history")).

I think the best organizational structure to use in terms of code is to
make a new directory called history and build a library out of the history
code in there.  libinnhist?  libhistory is too general and would conflict
with part of readline.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>