Date parsing

Tue Sep 3 05:54:24 UTC 2002

So today, in-between writing with friends, I played about with something
that I've been wanting to play with since I first saw parsedate, namely a
hand-written RFC 2822 date parser.

First, the good news.  As I thought, yacc-generated general date parsers
are really slow and clunky when it comes to parsing specialized date
formats.

Using parsedate
233.50u 0.75s 3:58.05 98.4%
Using parsedate_rfc2822
27.96u 0.73s 0:29.05 98.7%

Those are times for parsing 4,025,010 date headers (aka every date header
I had in the overview database on my server).  So parsedate takes about
59 microseconds to parse a date header, and my new routine takes about 7
microseconds.  (Well, some amount of that is file I/O overhead, but as I
was using QIO, that probably wasn't too high.)

Obviously this is not a performance bottleneck in the server, though.  :)

My new code is also under 400 lines of C and only took me an afternoon to
write from (mostly) scratch, as opposed to the 863 lines of difficult to
understand C and yacc that parsedate is.

Now, the bad news.  Of those 4,025,010 date headers, 70,759 of them would
be rejected by the new parser, or about 1.76%.  Here's the analysis that I
did of them and posted to USEFOR:

    Sat, 24 Aug 2002  9:48:06 GMT
    Thu, 29 Aug 2002 9:49:18 +1000

Most common error by *far*, with 65,152 occurrences.  Missing the
mandatory first digit for the hour.  I even double-checked that this
really was required, in both the regular and obsolete syntax.

    Sat, 24 Aug 2002 12:57:10 BST

1,096 articles with BST as a time zone, which isn't one of the allowable
time zones in RFC 2822.  Another 365 with UTC and 127 with GMT+1, as well
as a few other non-numeric time zones not allowed by the obsolete syntax.
(UNDEFINED is my favorite.)

    Sat, 18 Aug 2002 20:41:32

No time zone.  2,074 of these.

    Sun, 25 Aug 2 12:33:49 GMT

That was cute.  Single-digit year.  Only 19 of those.

There are a bunch of other random, scattered problems, but far and away
the most common is the lack of a leading digit on the hour.  Which is
rather interesting.  I'm guessing that this must be a flaw in a particular
piece of software.

Incidentally, there was only one date that had a comment anywhere other
than at the end of the date string, and that was:

    (comment) Sat, 31 Aug 2002 14:03:41 +0200

I don't know if there were any folded dates, since I believe INN would
have rejected them.

So, I'm thinking about where we should go from here.  I think I'm going to
try throwing a lot more tests at my new date parser to make sure it really
gets all of the vagarities of RFC 2822 obsolete syntax, since the nested
comment stuff is tricky.  If anyone has a collection of pathological RFC
822 dates somewhere, please let me know.

I'm also going to try implementing a flag to the parser that will cause it
to only accept non-obsolete RFC 2822 dates (in other words, no obsolete
time zones, none of the bizarre comments, four digit years, and so forth).

Whether to use the parser in INN is another question.  Examining the INN
source base, the only program that uses parsedate to do anything other
than parse date headers in articles is convdate, and as useful as convdate
sometimes can be, one can whip up something very similar with Perl and
Date::Manip that can recognize a lot more dates.  So if I have a better
and more maintainable date parser, it would make sense to replace
parsedate with it, get a step closer to eliminating a yacc dependency, and
get rid of 900-odd lines of code that I doubt any of us really want to
maintain.

I'm not sure what to do about all those articles that would be rejected,
though.

I think that using the stricter parser in nnrpd for local posts is an
obvious thing to do, in the "be strict in what we generate" department.  I
wondered if it would be worth going a step further and using a strict
parser that doesn't permit the obsolete syntax of RFC 2822, but I'm
guessing that would catch a lot of news readers that are doign things they
shouldn't (like using GMT as the time zone instead of +0000, which is very
common) and basically just cause headaches for people.  But it looks like
using it for innd is a trickier question.

If the parser were modified to accept one-digit hours, that would cut the
percentage of articles rejected down to 0.14% (5,607 articles), but I'm
guessing that there would still be a few noticable ones in that pile.

Accepting dates with BST and UTC as time zones and dates with no time
zones would cut the rejected count down to 2,072 articles (0.05%), and
actually about 1,200 of those are articles from 1992 through 1995 on my
server in the slac.* hierarchy that have fully spelled-out weekday names,
so with those changes the rejections would probably be in the noise.

Anyone have any thoughts about all this?

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>

    Please send questions to the list rather than mailing me directly.
     <http://www.eyrie.org/~eagle/faqs/questions.html> explains why.