Replacing parsedate
Russ Allbery
rra at stanford.edu
Mon Jun 7 10:13:41 UTC 2004
I've finally gotten around to writing a lax version of the date parser.
I've verified that it parses all of the random crap found on the wire that
parsedate can handle by duplicating several parsedate bugs in the new code
and then running both it and parsedate against 4,455,339 dates (all of the
dates in my current news spool). The new code, with these bug
duplications added, exactly matches the output of parsedate for all dates
except the following:
??, 17 ??? 2005 14:05:39 -0600
??, 17 ??? 2005 14:02:24 -0600
??, 17 ??? 2005 14:02:18 -0600
??, 17 ??? 2004 13:59:28 -0600
??, 17 ??? 2005 14:02:29 -0600
??, 17 ??? 2005 14:02:28 -0600
??, 17 ??? 2005 14:05:49 -0600
??, 17 ??? 2004 13:59:31 -0600
??, 17 ??? 2005 14:02:33 -0600
where the ? characters are various accented characters that the mailing
list software will barf on. The new code rejects these dates as invalid;
the old parsedate code, for some inexplicable reason, parses them all as
2004-06-07 00:00:00. I don't see any need to preserve this behavior.
I am going to remove some of the code that I put in only to duplicate the
parsedate bugs. The new parser that I'll commit will have the following
additional differences from parsedate:
* When a date does not have a time zone, parsedate interprets that date
as being in the local time zone, except that it has buggy handling of
daylight savings time and sometimes adjusts for it twice, sometimes
getting the date wrong by an hour. The new code will treat all dates
without a time zone as being GMT.
* parsedate accepts time zones like "+4", but interprets them as a time
zone four minutes east of GMT. The new code will correctly treat this
as a time zone that's four hours east.
* As a side effect of the previous change, the new code will handle cases
like GMT+2 and GMT-5 correctly; parsedate does not.
The lax mode allows the following that the strict mode does not:
* No attempt is made to verify weekday names at all. If the date starts
with something other than a digit, all characters up to the first space
or comma are skipped and then parsing continues, looking for the day of
the month. This allows for things like "Wednesday" as well as
localized day names.
* In addition to the standard abbreviations (Jan, Jun), a trailing period
is also allowed (Jan., Jun.) and the full name (January, June). This
matches what parsedate allows.
* Hours, minutes, and seconds are permitted to omit the leading zero.
* One- and two-digit time zones like +4 and -10 are allowed, and are
taken to be in hours.
* Five-digit time zones are allowed (there is some software out there
that always appends the leading digit even if the time zone has a 10
hour offset or greater). Three-digit time zones are also allowed,
omitting the leading zero on the hour.
* All of the additional time zone abbreviations supported by parsedate
are also supported, in addition to the ones allowed by RFC 2822.
They're allowed with the same offsets as in parsedate, even though in
some cases those look rather dubious to me.
* If the time zone isn't present or can't be parsed, GMT is assumed
rather than rejecting the date.
* Multiple time zones are allowed, and only the last is used. This is
primarily there to support a time zone like "GMT+4", but there are some
dates out there with things like "-0300 EST" at the end.
* Any trailing garbage present in the date is just ignored, rather than
causing a parse failure.
My intention is to remove parsedate and replace it with a call to this new
parser, so that we can be free of one yacc parser and all of the namespace
pollution that it causes.
The testing method used was:
news:~/spool/overview> find . -name \*.DAT -print | xargs cut -f4 > /tmp/dates
news:~/spool/overview> /tmp/compdate < /tmp/dates
where /tmp/compdate is the following program:
#include "config.h"
#include "clibrary.h"
#include <time.h>
#include "libinn.h"
int
main(void)
{
char buffer[BUFSIZ];
time_t old, new;
while (fgets(buffer, sizeof(buffer), stdin) != NULL) {
buffer[strlen(buffer) - 1] = '\0';
old = parsedate(buffer, NULL);
new = parsedate_rfc2822_lax(buffer);
if (old != new)
printf("%s mismatch (old: %ld, new: %ld)\n", buffer,
(long) old, (long) new);
}
return 0;
}
--
Russ Allbery (rra at stanford.edu) <http://www.eyrie.org/~eagle/>
Please send questions to the list rather than mailing me directly.
<http://www.eyrie.org/~eagle/faqs/questions.html> explains why.
More information about the inn-workers
mailing list