Zone parser in source code?

Sat Feb 9 02:09:41 UTC 2002

> "Justin Scott" <lists at darktech.org> ha scritto nel messaggio
> news:a3v67o$t2j at pub3.rc.vix.com...
> >
> > Hey all..  perhaps one of the developers can help me out here..
> >
> > I am in the process of writing a zone parser in ColdFusion for a web
> > application I am building.  I have a fairly working parser in place,
> > but it...
> 
> To write a parser is not easy!!
> If you do not know C it is hard to understad the parser.
> I know that a lot of programmers do not write parsers directly in C because
> it is difficult but they user tools as LEX that is a sort of language to
> define grammar's rules that generate an output code in C.
>  Regards
> Marco.

I agree with Marco.  Even a simplistic parser that doesn't blow up
with observed real-world data takes some effort if starting from
scratch.  Here was my first stab at this sort of thing:

while (<ZONEFILE>) {
    $linenum++;
    next if /^$/ || /^\s*;/;
    if ($multi) {
        $original_line .= $_;               # Save the original input line
    } else {                                # so that it can be echoed in
        $original_line = $_;                # case there's an error.
    }
    s/([^\\]);.*$/$1/;                      # strip comments
    s/\s+$//;                               # chop trailing whitespace
    if ($multi || /(.*)\s\(/) {             # deal with multi-line records
        $n = ($multi) ? " " : $1;           # force issue if already known
        $n = ($n =~ s/[^\\]\"/$&/g) + 0;    # count unescaped double quotes
        unless ($n % 2) {                   # even count means unquoted ' ('
            if ($multi) {                   # already past the first line
                s/^\s+//;                   # remove leading whitespace
                $n = " ";                   # blank replaces removed newline
            } else {                        # still in the first line
                $n = "";                    # leave the first line intact
            }
            $multi .= "$n$_";               # append continuation fragment
            next if $multi !~ /(.*)\s\)/;   # next line if no ' )' found
            $n = $1;                        # save ' )' prefix for testing

            $n = $1;                        # save ' )' prefix for testing
            $n = ($n =~ s/[^\\]\"/$&/g) + 0;# count unescaped double quotes
            next if ($n % 2);               # ' )' turned out to be quoted
            $_ = $multi;                    # else transfer concat'd lines
            $multi = "";                    # not doing this is a Bad Thing
        }
    }
    ################################################
    #
    # The "$_" variable now contains the assembled
    # DNS record for the tasks that need to be done.
    #
    ################################################

Robustly handling non-real-world but still-legal syntax takes
a lot more effort.  The following parser was recently
reverse-engineered from observing the behavior of the BIND 9
'named-checkzone' utility and how it dealt with various
syntactic pathologies.  I don't claim that it is a 100%
translation of 'lib/isc/lex.c' into Perl but it's close:

$continuation_line = $open_paren_count = $open_quote = 0;
$line_num = 0;
#
# The following code block for reading text from a master zone data
# file and assembling that data into DNS records emulates the strict
# behavior of the BIND 9 lexer.  DNS records which are not compliant
# with the syntax specifications of RFC-1035 will be flagged as errors
# and skipped from further processing.
# Illegal zone file syntax that the lexers in BIND 4/8 mistakenly
# allowed must now be fixed.
#
while (<ZONEFILE>) {
    $line_num++;
    if ($continuation_line || /["()]/) {
        unless ($continuation_line) {
            $original_line = $_;
            $line_buffer = "";
            $split_line_num = $line_num;
            $bad_record = 0;
        } else {
            #
            # As a sanity check for unbalanced quotes/parentheses,
            # keep track of the accumulated length of the concatenated
            # continuation lines and quit reading the zone file if it
            # exceeds a certain threshold.
            # This threshold number was determined by deliberately
            # breaking a zone file and submitting it to the BIND 9
            # utility 'named-checkzone'.  The isc_lex_gettoken()
            # call fails with a "ran out of space" error after
            # running through about 131500 bytes (approximately
            # 3000 lines of a typical zone file).
            #
            $original_line .= $_;
            last if length($original_line) > 131500;
        }

        # Scan the line character-by-character keeping in mind
        # the following special characters and quoting hierarchy:
        #
        #   '\'  ->  An escape (backslash) cancels any special meaning
        #            of the character that immediately follows.  This
        #            includes backslashes, double-quotes, semicolons,
        #            left and right parentheses, and the newline.
        #
        #   '"'  ->  The double-quote character quotes whitespace, the
        #            ";()" characters which are special to RFC-1035,
        #            and escaped newlines until a matching unescaped
        #            double-quote is reached.
        #
        #   ';'  ->  Signifies the start of a comment.  It and
        #            the remaining characters that follow are
        #            ignored up to and including the next escaped
        #            or unescaped newline character.
        #
        #   '('  ->  Signifies that subsequent newline characters
        #            are to be ignored, i.e., quoted, if encountered.
        #            Does not perform any other quoting function.
        #            May be nested to multiple levels.
        #
        #   ')'  ->  Cancels the effect of the "(" character at
        #            the current nesting level.  Newlines are still
        #            ignored until the outer-most opening "(" is
        #            balanced by the corresponding ")" character.
        #
        #   1. An escaped newline is only valid within matching
        #      double-quote characters.  The lexer will report
        #      an "unexpected end of input" error otherwise.
        #
        #   2. An unescaped newline effectively cancels an open
        #      double-quote character.  The lexer will report an
        #      "unbalanced quotes" error if this situation occurs.
        #      If, however, there are also one or more open parentheses
        #      in effect, the lexer will continue to scan for their
        #      closing ")" counterparts to try to complete the disposition
        #      of the defective record.  Quoting will be cancelled and
        #      not be toggled by subsequent double-quote characters until
        #      the balancing parentheses are found.
        #
        #   3. If the nesting level of parentheses goes negative, the
        #      lexer will immediately report the imbalance and discard
        #      the rest of the line.  If an odd number of double-quote
        #      characters are part of the refuse, this may have a
        #      side-effect of introducing an "unbalanced quotes" error
        #      in a subsequent line.  Since resynchronization has to
        #      occur at some point, however, the lexer's chosen priority
        #      is to balance parentheses.
        #
        chop;
        $last_char = "";      # don't carry an escape from the previous line
        while (length($_)) {
            ($char, $_) = split(//, $_, 2);
            if ($char eq "\\" && $last_char eq "\\") {
                #
                # An escape character which is itself escaped
                # becomes an ordinary backslash character.
                # Move it into the buffer and remove its ability
                # to escape the next character in the byte stream.
                #
                $line_buffer .= $char;
                $last_char = "";
                next;
            }
            if ($char eq '"' && $last_char ne "\\") {
                $open_quote = !$open_quote unless $bad_record;
                $line_buffer .= $char;
                $last_char = $char;
                next;
            }
            unless ($open_quote || $last_char eq "\\") {
                #
                # Encountering an unquoted and unescaped semicolon
                # marks the start of a comment.  There is no need
                # to scan the rest of the line. 
                # 
                last if $char eq ";";
                #
                # Unquoted and unescaped parentheses are not part of
                # the DNS resource record but an RFC-1035 construct
                # to ignore embedded newlines.
                # Keep track of them to maintain the current nesting
                # level but do not include these characters in the
                # line buffer of the record that we are assembling.
                #
                if ($char eq "\(") {
                    $open_paren_count++;
                    next;
                }
                if ($char eq "\)") {
                    $open_paren_count--;
                    last if $open_paren_count < 0;
                    next;
                }
            }
            if ($open_quote) {
                if ($char eq "\\" && $last_char ne "\\" && !length($_)) {
                    #
                    # An escaped newline has been encountered.
                    # Replace the backslash with a newline so
                    # it can be converted to BIND 9 presentation
                    # format in the next block.
                    #
                    $char = "\n";
                }
                if (ord($char) < 32) {
                    #
                    # Adopt the BIND 9 presentation format in which
                    # quoted non-printing characters other than a
                    # space get converted into a backslash character
                    # followed by the non-printing character's
                    # three-digit decimal equivalent ASCII value.
                    # An escaped newline followed by a tab, for
                    # example, would appear as "\010\009".
                    #
                    $line_buffer .= "\\0";
                    $line_buffer .= "0" if ord($char) < 10;
                    $line_buffer .= ord($char);
                } else {
                    $line_buffer .= $char;
                }
                $last_char = $char;
            } else {
                #
                # Preservation of cosmetic whitespace is unnecessary
                # since the assembled record will be parsed again into
                # its DNS components.
                #
                $char = " " if $char eq "\t";
                unless ($char eq " " && $last_char eq " ") {
                    $line_buffer .= $char;
                    $last_char = $char;
                }
            }
        }
        # Assess the situation now that the character scan
        # of the current line is complete.
        #
        if ($open_paren_count < 0) {
            print STDERR "Unbalanced parentheses; file '$spcl_file', line $line_num\n";
            print STDERR "> $original_line";
            $continuation_line = $open_paren_count = 0;
            next;
        }
        if ($open_quote && $last_char ne "\n") {
            print STDERR "Unbalanced quotes; file '$spcl_file', line $line_num\n";
            print STDERR "> $original_line";
            $open_quote = 0;
            $bad_record = 1 if $open_paren_count;
            next;
        }
        $continuation_line = $open_quote + $open_paren_count;
        next if $continuation_line || $bad_record;
        $_ = $line_buffer;
        s/\s+$//;                           # chop trailing whitespace
        next if /^$/;                       # line was only a comment 
    } else {
        next if /^\s*$/ || /^\s*;/;
        $original_line = $_;
        s/([^\\]);.*/$1/;                   # strip comments
        if (/\\$/) {
            #
            # Escaped newlines are only valid when quoted.
            #
            print STDERR "Unexpected end of input; file '$spcl_file', line $line_num\n";
            print STDERR "> $original_line";
            next;
        }
        s/\s+$//;                           # chop trailing whitespace
    }
    ################################################
    #
    # The "$_" variable now contains the assembled
    # DNS record for the tasks that need to be done.
    #
    ################################################
}
&CLOSE(*ZONEFILE);
if ($open_quote || $open_paren_count) {
    $char = ($open_quote) ? "quotes" : "parentheses";
    print STDERR "Unable to process file '$spcl_file'\ndue to unbalanced $char.  The syntax problem begins at line $split_line_num.\n";
}

Enjoy,

Andris Kalnozols
Hewlett-Packard Laboratories
andris at hpl.hp.com