Ternary search trees added to TODO
Forrest J. Cavalier III
mibsoft at epix.net
Mon Apr 23 13:13:51 UTC 2001
> makes the performance difference a little harder to judge. Plus
> unsuccessful lookup is frequently much faster, and the operation at each
> character of the key can be quite a bit faster; it depends on whether
> pointer chasing is faster than the mathematical operations the hash
> function is performing.
A lot depends on the data set.
I really don't want to quibble on list. Russ, you are doing great
work!
But I want to point out for future reference, that for INN:
- Most lookups are going to be successful.
- Most unsuccessful newsgroup lookups will match at least
the first few parts of the newsgroup name. (I.e. in the ternary
tree, you will always match "comp." or "alt." plus one or more other
names in the newsgroup hierarchy.) The failure will happen in
the lowest parts of the tree.
(Doing the ternary tree backwards helps here, but seriously
complicates wildmat searches of the tree, if that ever gets
implemented.)
- The ternary tree isn't going to be optimally balanced if you use
an ASCII character as the element. Consider what happens when
you insert a whole hierarchy. Most of the key comparisons match
until you get to the lowest layers of the tree. In effect, the
common newsgroup hierarchies cause the tree to branch less at
the top, and each branch isn't helping you narrow the search space.
- The current hash function is a multiply and add per character.
I haven't looked at the generated code for any CPUs, but I think
it would pipeline pretty nicely, whereas scattered memory
references don't.
- Pointer chasing isn't faster if you are in swap, which can be
helped if the ternary tree implementation uses some kind of
pooled allocation. (But if your INN machine gets into swap, you
have big problems anyway, so it might not be worth the worry.)
I am not familiar enough with caching systems to know what kind
of row size is getting used. I expect that memory buses are
cache filling using at least 4 bytes, and maybe 8 or more bytes
on each read. That helps the hash function more than the
pointer chasing.
I wouldn't bet my life on saying that the differences are clear
either way. It is just that the TODO comment seemed a more
definite on the benefits, when they aren't at all clear to me.
More information about the inn-workers
mailing list