Replacing parsedate

Mon Jun 7 10:13:41 UTC 2004

I've finally gotten around to writing a lax version of the date parser.
I've verified that it parses all of the random crap found on the wire that
parsedate can handle by duplicating several parsedate bugs in the new code
and then running both it and parsedate against 4,455,339 dates (all of the
dates in my current news spool).  The new code, with these bug
duplications added, exactly matches the output of parsedate for all dates
except the following:

??, 17 ??? 2005 14:05:39 -0600
??, 17 ??? 2005 14:02:24 -0600
??, 17 ??? 2005 14:02:18 -0600
??, 17 ??? 2004 13:59:28 -0600
??, 17 ??? 2005 14:02:29 -0600
??, 17 ??? 2005 14:02:28 -0600
??, 17 ??? 2005 14:05:49 -0600
??, 17 ??? 2004 13:59:31 -0600
??, 17 ??? 2005 14:02:33 -0600

where the ? characters are various accented characters that the mailing
list software will barf on.  The new code rejects these dates as invalid;
the old parsedate code, for some inexplicable reason, parses them all as
2004-06-07 00:00:00.  I don't see any need to preserve this behavior.

I am going to remove some of the code that I put in only to duplicate the
parsedate bugs.  The new parser that I'll commit will have the following
additional differences from parsedate:

 * When a date does not have a time zone, parsedate interprets that date
   as being in the local time zone, except that it has buggy handling of
   daylight savings time and sometimes adjusts for it twice, sometimes
   getting the date wrong by an hour.  The new code will treat all dates
   without a time zone as being GMT.

 * parsedate accepts time zones like "+4", but interprets them as a time
   zone four minutes east of GMT.  The new code will correctly treat this
   as a time zone that's four hours east.

 * As a side effect of the previous change, the new code will handle cases
   like GMT+2 and GMT-5 correctly; parsedate does not.

The lax mode allows the following that the strict mode does not:

 * No attempt is made to verify weekday names at all.  If the date starts
   with something other than a digit, all characters up to the first space
   or comma are skipped and then parsing continues, looking for the day of
   the month.  This allows for things like "Wednesday" as well as
   localized day names.

 * In addition to the standard abbreviations (Jan, Jun), a trailing period
   is also allowed (Jan., Jun.) and the full name (January, June).  This
   matches what parsedate allows.

 * Hours, minutes, and seconds are permitted to omit the leading zero.

 * One- and two-digit time zones like +4 and -10 are allowed, and are
   taken to be in hours.

 * Five-digit time zones are allowed (there is some software out there
   that always appends the leading digit even if the time zone has a 10
   hour offset or greater).  Three-digit time zones are also allowed,
   omitting the leading zero on the hour.

 * All of the additional time zone abbreviations supported by parsedate
   are also supported, in addition to the ones allowed by RFC 2822.
   They're allowed with the same offsets as in parsedate, even though in
   some cases those look rather dubious to me.

 * If the time zone isn't present or can't be parsed, GMT is assumed
   rather than rejecting the date.

 * Multiple time zones are allowed, and only the last is used.  This is
   primarily there to support a time zone like "GMT+4", but there are some
   dates out there with things like "-0300 EST" at the end.

 * Any trailing garbage present in the date is just ignored, rather than
   causing a parse failure.

My intention is to remove parsedate and replace it with a call to this new
parser, so that we can be free of one yacc parser and all of the namespace
pollution that it causes.

The testing method used was:

news:~/spool/overview> find . -name \*.DAT -print | xargs cut -f4 > /tmp/dates
news:~/spool/overview> /tmp/compdate < /tmp/dates

where /tmp/compdate is the following program:

#include "config.h"
#include "clibrary.h"
#include <time.h>

#include "libinn.h"

int
main(void)
{
    char buffer[BUFSIZ];
    time_t old, new;

    while (fgets(buffer, sizeof(buffer), stdin) != NULL) {
        buffer[strlen(buffer) - 1] = '\0';
        old = parsedate(buffer, NULL);
        new = parsedate_rfc2822_lax(buffer);
        if (old != new)
            printf("%s mismatch (old: %ld, new: %ld)\n", buffer,
                   (long) old, (long) new);
    }
    return 0;
}

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>

    Please send questions to the list rather than mailing me directly.
     <http://www.eyrie.org/~eagle/faqs/questions.html> explains why.