History hash collisions

Forrest J. Cavalier III mibsoft at mibsoftware.com
Fri Nov 12 13:38:53 UTC 1999

Looking at another problem (not Usenet related) I
got to inspecting the Clayton dbz improvements.

I seem to recall discussing it at the time on this list,
but can't remember the details.  I think the discussion
also had applications to the Clayton (or was it Garzik)
pre-commit caches....

Apparently, the non-tagged DBZ code assumes that
the first 6 bytes of the md5 hash are unique for
all articles present in the history file.

If the 6 bytes are not unique, you will not be able to
retrieve an article, and I think you will get a duplicate
hash message during expire.

The old dbz code (and the tagged hash, I think) do
not assume that the hash (then 4 bytes) was unique.  If there
is a duplicate hash, the code checks against the key in the
data file.  

One could erroneously assume that the probability of collision
is simply 2^48, and that present usenet is safe from collisions.

Perhaps you have heard of the "Birthday Problem."  This is the
fact that even in a small set of people (N < 30) there is a very
high probability (>70%) that two of them have the same birthday.

The real probability of a collision of N 6 byte hashes is actually
one minus
         S! / (S^N) / (S-N)!
where S=2^48.

If I've done my math and simulation correctly (not loosing precision)
using N of 2.5 million articles gets a probability over 1.1%.  (This
means that each time a new 2.5 million articles comes through, there
is a 1.1% chance of having a duplicate hash resulting in failure.)

If you want to keep a big newsspool (archived storage, etc)
of 25 million articles, there is a 67% chance of the same problem.

What are typical sizes of history (number of entries) now?

Is anyone else bothered by this potential for failure?
A 1.1% chance of failureseems like a lot to me.  Then
again, it is only a 1.1% chance of loosing a single
article.  That happens on Usenet all the time for other

And you can also put it into perspective with Highwind's
products, which don't even guarantee that you can retrieve by
message ID....You have to be very on-top of things to make
sure you have enough storage defined for it....

I noticed a thread on news.software.nntp recently.
Should this discussion get moved there?

More information about the inn-workers mailing list