INN config file parsing infrastructure

Russ Allbery rra at stanford.edu
Sun Apr 30 11:42:46 UTC 2000


I've been thinking about this off and on for the past few weeks and
fiddling with a few things, and I think I'm ready to start writing code.
So before I did that, I figured I should try to write out my ideas.  :)

The goal is to implement one configuration file parsing library that all
parts of INN can use and which understands one standard syntax.  After
some various discussions here and elsewhere, I'm currently favoring the
following syntax:

A parameter takes the form:

    parameter: value

with arbitrary leading whitespace.  The parameter itself should consist of
alphanumeric characters plus '-' and is taken from a fixed set of valid
parameters.  There should be an option in the parser saying whether or not
to ignore unknown parameters or complain about them.

Parameters are of three basic types:  strings, numbers, and booleans.  (We
may want to divide numbers into ints and longs).  Booleans use the same
syntax as inn.conf does now, namely that any of "true", "yes", or "on" are
valid, and probably "1" as well.  The value is checked to make sure it
matches the parameter type.

Values, regardless of type, may be enclosed in double quotes.  They must
be enclosed in double quotes if they contain any whitespace, or either of
the characters '}' or ';'.  Inside double quotes, a backslash escapes the
next character, regardless of what it is, and can be used to put double
quotes in the string and to continue lines.  Note that something like:

    parameter: "first line\
        second line"

would result in a literal newline being part of the string assigned to
that parameter.

Whitespace between the colon and the first non-whitespace character or
opening quote of the value is ignored.

Parameters may occur at the "top level" of the file, or they may be
enclosed in groups.  In the underlying implementation, the top level of
the file is an implicit group.  Different parameters may be valid inside
different groups.  A group contained inside another group (bearing in mind
that the top level is an implicit group) inherits the values of any
parameters that it has in common with its enclosing group.

The syntax for a group is:

    name tag {
       ...
    }

The name is chosen from a fixed set of possible groups, with the same
range of valid characters as parameters.  tag is an arbitrary
user-supplied tag and is subject to the same quoting rules as parameter
values.  (Some configuration files may put additional restrictions on the
tags, of course.)

A parameter value may be terminated with *either* a newline or a ';'
character.  So, for example:

peer example { accept-from: news.example.com; feed-to: news.example.com }

is equivalent to:

    peer example {
        accept-from: news.example.com
        feed-to: news.example.com
    }

which is also equivalent to:

    peer "example" {
        accept-from:    news.example.com
        feed-to:        "news.example.com"
    }

Semicolons inside quotes don't terminate the value, of course.  Finally,
# (outside of a quoted value) begins a comment that lasts to the end of
the line.  The parser should be able to handle lines of arbitrary length.

Now, for the parser.  The goal of the parser implementation is for it to
be as data-driven as possible, so that the absolute minimum of code has to
be written for each separate configuration file, only those things that
are unique about that file.  That should also make the parser code much
easier and simpler to follow.

The basic idea of the parser is that every group (including the implicit
group around the entire file) generates a populated struct, containing the
value of all of its parameters (with the parameters set to their default
values if they weren't specified) and containing pointers to any child
groups that it had (probably most usefully as a null-terminated array).
So calling the parse function on a file, giving it a file definition (see
below), will return a pointer to the implicit group around the entire
file, and the rest of the program can then pull the information out of
that group or its subgroups.

The core specification for a configuration file will be in the form of a
.def file that will be parsed by a Perl script and used to generate the
documentation for all the parameters (in the form of an =over/=back list
that can be substituted into the appropriate documentation file
automatically, at the location of some appropriate token), a header file
with the basic structure definitions, and a C file that contains the data
used by the parser as well as a static const instantiation of the
structures with the default values.

The structure of the .def file should be valid Perl code (so that we can
let the Perl interpretor do all the syntax verification for us
automatically).  A .def file is divided into two types of entries, group
definitions and sets of parameters.  A set of parameters is just a list of
related parameters, including their name, type, default value, one-line
documentation string (for a comment in the header file), and full
documentation (in POD).  A group definition specifies the name of the
group, the name of the struct that corresponds to it, a list of the
parameter sets that are allowed in that group, an optional function used
to initialize a new group (which will be called at the beginning of the
group before any of its parameters are read and can be used to set
dynamically determined default values), and an optional function to call
after the group has been fully parsed (to set derived default values or
overriding defaults from environment variables, etc.).

A script will generate, from the .def file, .c and .h files that contain
the data used by the parser and the rest of the program.  Each group will
have a structure generated for it, looking something like:

struct example_group {
    char *      parameter1;     /* Short comment */
    /* ... */
};

All the valid parameters for that group are given, with the short comments
taken from the short description in the .def file.  This looks very much
like the innconf struct.

In the .c file, there will be one static const instantiation of this
structure for each group that contains all the defaults.  There will also
be an array of parameter structs for each group:

struct inn_conf_parameter {
    char *      name;
    size_t      offset;
    parameter_t type;
};

where offset is the offset into the group struct (from offsetof) and
parameter_t is an enum of VAR_STRING, VAR_NUMBER, and VAR_BOOLEAN.  Each
group will also have an instantiation of a struct like this:

struct inn_conf_group {
    char *      name;

    group_init_t init;
    group_final_t final;

    struct inn_conf_group **subgroups;
    struct inn_conf_parameter **parameters;
};

group_init_t and group_final_t are the types of functions to initialize a
new group and to handle derived defaults and the like after the group has
been parsed.

Once all the arrays of parameters and subgroups (and their parameters)
have been filled out, the structure of the configuration file that can be
parsed will be completely determined.  The .def parser will generate a
static const instantiation that can be passed to the parser.  The parser
will generate something like:

struct inn_conf_data {
    char *      group_name;
    char *      group_tag;
    bool        *set;

    void *      data;

    inn_conf_data **subgroups;
}

where tag is the group tag given at the start of the group, set is an
array determining whether any given parameter was set (used for checking
for duplicate settings and the like), data is a pointer to the actual
struct corresponding to that group (which can be written to by casting the
pointer to char * and adding the offset for a given parameter, and then
writing the value (of the type given by the parameter type) into that
memory location.

I believe this syntax and parsing structure will be flexible and easy
enough to use that we can replace all the current configuration parsing in
INN with files parsed via this library in the long term.  Part of this
effort will definitely need to be a script to upgrade existing
configuration files to the new syntax, since there are subtle differences
from even those configuration files (like inn.conf) that are currently
somewhat close.

What do people think?  Any additional comments before I start writing
code?  This structure should also be writeable as well as readable,
although I haven't fully fleshed out all of those details in my head (but
I've gone far enough to convince myself there are no fundamental
limitations that would interfere with that).

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>



More information about the inn-workers mailing list