Fwd: [Off-Topic] Interesting Date header trivia

Russ Allbery rra at stanford.edu
Tue Sep 3 22:27:15 UTC 2002


Some additional findings from Andrew.

Based on this, it sounds to me like if we support in a sloppy mode of the
parser one-digit hours, the additional time zones BST, CEST, CET, and UTC,
and articles without time zones, we get pretty much everything that we
really care about.

I'm going to check with him on what large news outsourcers do.


To: usenet-format at landfield.com
Subject: Re: [Off-Topic] Interesting Date header trivia
From: Andrew Gierth <andrew at erlenstar.demon.co.uk>
Date: 03 Sep 2002 21:56:23 +0100

>>>>> "AG" == Andrew Gierth <andrew at erlenstar.demon.co.uk> writes:

 AG> I'm re-running my analysis out of curiosity and will post the
 AG> results in due course.

total headers analysed: 14,543,552 (a little under 9 days worth of a
more or less full feed, including spam etc., excluding control
messages)

My approach is to start with the strictest possible interpretation
and work from there looking at variations. So I define a "perfect"
date header as one which matches:

$D = qr/mon|tue|wed|thu|fri|sat|sun/i;
$M = qr/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/i;
$Z = qr/gmt|[+-]\d\d\d\d/i;

/^($D,\s+)?(\d?\d)\s+($M)\s+(20\d\d)\s+(\d\d):(\d\d)(:\d\d)?\s*($Z)/

i.e. this expression requires 4-digit years, all digits present in
hours/minutes, timezone to be either GMT or a 4-digit offset, strict
3-letter names for both day-of-week and month, etc. Trailing material
after the timezone was ignored. (note: leading whitespace has been
stripped before the above test is applied)

out of 14,543,552 headers,
  335,815 (2.3%) were not "perfect" by the above definition.

Of those, 281665 failed only due to use of a timezone other than GMT
or a numeric offset. The breakdown of those was:

 215317 BST   (215025 of which are from clara.net)
  33520 CDT   (33379 of which are from giganews.com)
  18623 EST   (18507 of which are spam from Vespice/breezes.net)
   7133 PST   (7085 of which are test posts from microsoft.com)
   2169 UTC   (772 of which are a ba.weather posting robot,
               1324 of which are "nobody at deathstar" posts to "junk")
   1427 CST
   1130 CEST
   1026 PDT
   1001 EDT
    141 CET
     87 SAMST
     47 GM
     22 UT
      8 pdt
      3 UNDEFINED
      2 MEZ
      2 MEST
      1 est
      1 WET
      1 Mountain
      1 LON
      1 1200
      1 0600
      1 0000

That leaves 54,150 (0.37%) dates which were imperfect for some
other reason. Of those:

   4686 lacked a timezone completely
   7835 had 2-digit years but no other faults

That leaves 41,606 (0.29%) with other formatting problems.
It's worthy of note that more than 75% of those are blatant spam,
and several of the common malformed date formats are found _only_
in spam (at least in this sample); most notable is probably the
PostAgent program, which uses the full localised weekday name,
and which accounts for 23,065 of those 41,606 articles.

Eliminating most of the spam leaves 9,428 (0.065%) articles, which
include:

   6252 posts made with Hogwasher, which leaves off the leading zero
        from the 'hour' field
   1573 other posts with missing digits from the 'hour', 'minute' or
        'second' field
    701 posts with '+-nnnn' as the timezone, all of which look like
        probable spam
    549 posts with '-500 GMT' as the timezone, all of them from a
        jobs autoposting site
    288 posts from "Posting-Version: version 0.001b; nntp-post",
        with dates like "Tuesday, 27-Aug-02 21:36:19 GMT"
     24 posts with '02002' as the year, all from a jobs autoposter
     17 posts with '0102' as the year, all spam through mail2news gates

that leaves 24 posts, as follows:

<akbs54$t33$1 at nnrp.gol.com>
soc.culture.korean,soc.culture.china,soc.culture.japan
Mon, 26 Aug 2 00:07:33 GMT

<akbsj2$t3l$1 at nnrp.gol.com>
soc.culture.korean,soc.culture.china,soc.culture.japan
Mon, 26 Aug 2 00:14:59 GMT

<3d6993f0.577e.0 at pcez.com>
scnr.plug
Sun, 25 Aug 2002 19:35:28 -800

<3d69bcfb.60c4.0 at pcez.com>
scnr.plug
Sun, 25 Aug 2002 22:30:35 -800

<MSGID_2=3a237=2f10.10_3d6c0b15 at fidonet.org>
dk.videnskab.historie.genealogi
Tuesday, 27 Aug 2002 23:27:39 +100

<MSGID_2=3a280=2f17.0_d6bbf180 at fidonet.org>
nl.media.radio.zeezenders
Tuesday, 27 Aug 2002 19:03:36 +100

<7796b604-node=1.10.41 at FidoNet>
bit.listserv.catholic,catholic.fcml
Wed, 28 Aug 2002 23:37:45 +0

<nutchu45fgo1gntt4ba6r39fep21ee5635 at 4ax.com>
relcom.comp.os.windows.nt
\317\355, 24 \310\376\355 2002 05:41:59 +0400

<70qchuo4ktu0s11s2uuamlopg4jptn8v7e at 4ax.com>
relcom.politics
\317\355, 24 \310\376\355 2002 04:34:03 +0400

<DYR7153W37498.9097222222 at anonymous.poster>
comp.lang.postscript
30 Aug 2002 19.50.00 -0000

<MSGID_2=3a250=2f220=40fidonet_791230f8 at fidonet.org>
uk.radio.amateur
Friday, 30 Aug 2002 18:16:44 -500

<NOMSGID_2=3a252=2f110_020831_013049__e1ce7869 at fidonet.org>
uk.rec.gardening
Saturday, 31 Aug 2002 01:30:49 +000

<akpj9f$cac$1 at terabinaries.xmission.com>
comp.databases.informix
Sat,31 Aug 2002 12:38:56 +0800

<20020831101453Z16401-26943+4 at humbolt.nl.linux.org>
nlo.lists.securedistros
Sat,31 Aug 2002 18:15:07 +0800

<blablabla2 at enyo.de>
de.test
(comment) Sat, 31 Aug 2002 14:03:41 +0200

<akqjbu$28s$1$830fa17d at news.demon.co.uk>
demon.test
Sat,31 Aug 2002 15:23 +0100

<m2u1l9xzp8$hbh at microsoft.com>
alt.test
Sun Sep  1 21:45:23

<MSGID_2=3a341=2f14.91_3d72954f at fidonet.org>
es.comp.infosistemas.bbs
Monday, 02 Sep 2002 00:31:03 +100

<3D6E2F36.MD-1.0a.ponuda at madjionicar.com>
rs.oglasi
29 Aug 102 14:27:03 +0200

<MSGID_2=3a250=2f220=40fidonet_7a8adcc2 at fidonet.org>
uk.radio.amateur
Monday, 02 Sep 2002 11:04:42 -500

<MSGID_2=3a237=2f10.10_3d73e8eb at fidonet.org>
dk.videnskab.historie.genealogi
Monday, 02 Sep 2002 22:36:08 +100

<7b018d40-node=1.10.41 at FidoNet>
bit.listserv.catholic,catholic.fcml
Tue, 03 Sep 2002 02:26:06 +0

<7b018d3f-node=1.10.41 at FidoNet>
bit.listserv.catholic,catholic.fcml
Tue, 03 Sep 2002 02:26:00 +0

<MSGID_2=3a341=2f14.91_3d72a110 at fidonet.org>
es.comp.infosistemas.bbs
Monday, 02 Sep 2002 01:21:00 +100

-- 
Andrew.


More information about the inn-workers mailing list