wildmat routines and text

Sun Jul 23 00:41:15 UTC 2000

Please note:  This message is crossposted between two mailing lists which
are closed to non-subscribers.  Please direct any followups appropriately;
if they're about INN, please limit them to inn-workers, and please send
all NNTP standards discussion to ietf-nntp.

I've finished a new wildmat implementation for INN that adds support for
comma and ! (and optionally @) as discussed in other messages here.  I've
not checked it into INN yet because I want to also add UTF-8 support
(which actually doesn't look to be that hard) and make sure it's fully
tested and get some more eyes looking at it, given how core of a routine
it is.  I probably will put it on the current development branch shortly.

Below is the documentation that I wrote for INN on how wildmat patterns
work, with all the references to @ removed.  This may be suitable for the
standard, although it could probably use some pruning before put into an
RFC since it's intended to be wordy and clear right now.

At <http://www.eyrie.org/~eagle/nntp/> you'll find wildmat.c, the new
implementation, and wildmat-t.c, the test suite that I wrote while writing
it (which may resolve any additional ambiguities).  The test suite is in
the public domain; wildmat.c, being based on Rich $alz's implementation,
is covered by the license found in LICENSE in that directory (basically
BSD with the advertising clause).  Also in that directory are wildmat.pod
and wildmat.3, the man page for those routines in two formats, which
includes the text below with some additions for @.

Any comments, corrections, and feedback on any of this is very much
welcome.

  A wildmat expression follows rules similar to those of shell filename
  wildcards but with some additions and changes.  A wildmat expression
  is composed of one or more wildmat patterns separated by commas.  Each
  character in the wildmat pattern matches a literal occurance of that
  same character in the text, with the exception of the following
  metacharacters:

  ?       Matches any single character.

  *       Matches any sequence of zero or more characters.

  [...]   A character set, which matches any single character that falls
          within that set.  The presence of a character between the
          brackets adds that character to the set; for example, "[amv]"
          specifies the set containing the characters "a", "m", and "v". 
          A range of characters may be specified using "-"; for example,
          "[0-5abc]" is equivalent to "[012345abc]".  The order of
          characters is as defined in the UTF-8 character set, and if the
          start character of such a range falls after the ending character
          of the range in that ranking the results of attempting a match
          with that pattern are undefined.

          In order to include a literal "]" character in the set, it must
          be the first character of the set (possibly following "^"); for
          example, "[]a]" matches either "]" or "a".  To include a literal
          "-" character in the set, it must be either the first or the
          last character of the set.  Backslashes have no special meaning
          inside a character set, nor do any other of the wildmat
          metacharacters.

  [^...]  A negated character set.  Follows the same rules as a character
          set above, but matches any character not contained in the set. 
          So, for example, "[^]-]" matches any character except "]" and
          "-".

  \       Turns off any special meaning of the following character; the
          following character will match itself in the text.  "\" will
          escape any character, including another backslash or a comma
          that otherwise would separate a pattern from the next pattern in
          an expression.  Note that "\" is not special inside a character
          range (no metacharacters are).

  In addition, "!" (and possibly "@") have special meaning as the first
  character of a pattern; see below.

  When matching a wildmat expression against some text, each
  comma-separated pattern is matched in order from left to right.  In
  order to match, the pattern must match the whole text; in regular
  expression terminology, it's implicitly anchored at both the beginning
  and the end.  For example, the pattern "a" matches only the text "a"; it
  doesn't match "ab" or "ba" or even "aa".  If none of the patterns match,
  the whole expression doesn't match.  Otherwise, whether the expression
  matches is determined entirely by the rightmost matching pattern; the
  expression matches the text if and only if the rightmost matching
  pattern is not negated.

  For example, consider the text "news.misc".  The expression "*" matches
  this text, of course, as does "comp.*,news.*" (because the second
  pattern matches).  "news.*,!news.misc" does not match this text because
  both patterns match, meaning that the rightmost takes precedence, and
  the rightmost matching pattern is negated.  "news.*,!news.misc,*.misc"
  does match this text, since the rightmost matching pattern is not
  negated.

  Note that the expression "!news.misc" can't match anything.  Either the
  pattern doesn't match, in which case no patterns match and the
  expression doesn't match, or the pattern does match, in which case
  because it's negated the expression doesn't match.  "*,!news.misc", on
  the other hand, is a useful pattern that matches anything except
  "news.misc".

  "!" has significance only as the first character of a pattern; anywhere
  else in the pattern, it matches a literal "!" in the text like any other
  non-metacharacter.

  If the wildmat_poison interface is used, then "@" behaves the same as
  "!" except that if an expression fails to match because the rightmost
  matching pattern began with "@", WILDMAT_POISON is returned instead of
  WILDMAT_FAIL.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>