Zone parser in source code?
Andris Kalnozols
andris at hpl.hp.com
Sat Feb 9 02:09:41 UTC 2002
> "Justin Scott" <lists at darktech.org> ha scritto nel messaggio
> news:a3v67o$t2j at pub3.rc.vix.com...
> >
> > Hey all.. perhaps one of the developers can help me out here..
> >
> > I am in the process of writing a zone parser in ColdFusion for a web
> > application I am building. I have a fairly working parser in place,
> > but it...
>
> To write a parser is not easy!!
> If you do not know C it is hard to understad the parser.
> I know that a lot of programmers do not write parsers directly in C because
> it is difficult but they user tools as LEX that is a sort of language to
> define grammar's rules that generate an output code in C.
> Regards
> Marco.
I agree with Marco. Even a simplistic parser that doesn't blow up
with observed real-world data takes some effort if starting from
scratch. Here was my first stab at this sort of thing:
while (<ZONEFILE>) {
$linenum++;
next if /^$/ || /^\s*;/;
if ($multi) {
$original_line .= $_; # Save the original input line
} else { # so that it can be echoed in
$original_line = $_; # case there's an error.
}
s/([^\\]);.*$/$1/; # strip comments
s/\s+$//; # chop trailing whitespace
if ($multi || /(.*)\s\(/) { # deal with multi-line records
$n = ($multi) ? " " : $1; # force issue if already known
$n = ($n =~ s/[^\\]\"/$&/g) + 0; # count unescaped double quotes
unless ($n % 2) { # even count means unquoted ' ('
if ($multi) { # already past the first line
s/^\s+//; # remove leading whitespace
$n = " "; # blank replaces removed newline
} else { # still in the first line
$n = ""; # leave the first line intact
}
$multi .= "$n$_"; # append continuation fragment
next if $multi !~ /(.*)\s\)/; # next line if no ' )' found
$n = $1; # save ' )' prefix for testing
$n = $1; # save ' )' prefix for testing
$n = ($n =~ s/[^\\]\"/$&/g) + 0;# count unescaped double quotes
next if ($n % 2); # ' )' turned out to be quoted
$_ = $multi; # else transfer concat'd lines
$multi = ""; # not doing this is a Bad Thing
}
}
################################################
#
# The "$_" variable now contains the assembled
# DNS record for the tasks that need to be done.
#
################################################
Robustly handling non-real-world but still-legal syntax takes
a lot more effort. The following parser was recently
reverse-engineered from observing the behavior of the BIND 9
'named-checkzone' utility and how it dealt with various
syntactic pathologies. I don't claim that it is a 100%
translation of 'lib/isc/lex.c' into Perl but it's close:
$continuation_line = $open_paren_count = $open_quote = 0;
$line_num = 0;
#
# The following code block for reading text from a master zone data
# file and assembling that data into DNS records emulates the strict
# behavior of the BIND 9 lexer. DNS records which are not compliant
# with the syntax specifications of RFC-1035 will be flagged as errors
# and skipped from further processing.
# Illegal zone file syntax that the lexers in BIND 4/8 mistakenly
# allowed must now be fixed.
#
while (<ZONEFILE>) {
$line_num++;
if ($continuation_line || /["()]/) {
unless ($continuation_line) {
$original_line = $_;
$line_buffer = "";
$split_line_num = $line_num;
$bad_record = 0;
} else {
#
# As a sanity check for unbalanced quotes/parentheses,
# keep track of the accumulated length of the concatenated
# continuation lines and quit reading the zone file if it
# exceeds a certain threshold.
# This threshold number was determined by deliberately
# breaking a zone file and submitting it to the BIND 9
# utility 'named-checkzone'. The isc_lex_gettoken()
# call fails with a "ran out of space" error after
# running through about 131500 bytes (approximately
# 3000 lines of a typical zone file).
#
$original_line .= $_;
last if length($original_line) > 131500;
}
# Scan the line character-by-character keeping in mind
# the following special characters and quoting hierarchy:
#
# '\' -> An escape (backslash) cancels any special meaning
# of the character that immediately follows. This
# includes backslashes, double-quotes, semicolons,
# left and right parentheses, and the newline.
#
# '"' -> The double-quote character quotes whitespace, the
# ";()" characters which are special to RFC-1035,
# and escaped newlines until a matching unescaped
# double-quote is reached.
#
# ';' -> Signifies the start of a comment. It and
# the remaining characters that follow are
# ignored up to and including the next escaped
# or unescaped newline character.
#
# '(' -> Signifies that subsequent newline characters
# are to be ignored, i.e., quoted, if encountered.
# Does not perform any other quoting function.
# May be nested to multiple levels.
#
# ')' -> Cancels the effect of the "(" character at
# the current nesting level. Newlines are still
# ignored until the outer-most opening "(" is
# balanced by the corresponding ")" character.
#
# 1. An escaped newline is only valid within matching
# double-quote characters. The lexer will report
# an "unexpected end of input" error otherwise.
#
# 2. An unescaped newline effectively cancels an open
# double-quote character. The lexer will report an
# "unbalanced quotes" error if this situation occurs.
# If, however, there are also one or more open parentheses
# in effect, the lexer will continue to scan for their
# closing ")" counterparts to try to complete the disposition
# of the defective record. Quoting will be cancelled and
# not be toggled by subsequent double-quote characters until
# the balancing parentheses are found.
#
# 3. If the nesting level of parentheses goes negative, the
# lexer will immediately report the imbalance and discard
# the rest of the line. If an odd number of double-quote
# characters are part of the refuse, this may have a
# side-effect of introducing an "unbalanced quotes" error
# in a subsequent line. Since resynchronization has to
# occur at some point, however, the lexer's chosen priority
# is to balance parentheses.
#
chop;
$last_char = ""; # don't carry an escape from the previous line
while (length($_)) {
($char, $_) = split(//, $_, 2);
if ($char eq "\\" && $last_char eq "\\") {
#
# An escape character which is itself escaped
# becomes an ordinary backslash character.
# Move it into the buffer and remove its ability
# to escape the next character in the byte stream.
#
$line_buffer .= $char;
$last_char = "";
next;
}
if ($char eq '"' && $last_char ne "\\") {
$open_quote = !$open_quote unless $bad_record;
$line_buffer .= $char;
$last_char = $char;
next;
}
unless ($open_quote || $last_char eq "\\") {
#
# Encountering an unquoted and unescaped semicolon
# marks the start of a comment. There is no need
# to scan the rest of the line.
#
last if $char eq ";";
#
# Unquoted and unescaped parentheses are not part of
# the DNS resource record but an RFC-1035 construct
# to ignore embedded newlines.
# Keep track of them to maintain the current nesting
# level but do not include these characters in the
# line buffer of the record that we are assembling.
#
if ($char eq "\(") {
$open_paren_count++;
next;
}
if ($char eq "\)") {
$open_paren_count--;
last if $open_paren_count < 0;
next;
}
}
if ($open_quote) {
if ($char eq "\\" && $last_char ne "\\" && !length($_)) {
#
# An escaped newline has been encountered.
# Replace the backslash with a newline so
# it can be converted to BIND 9 presentation
# format in the next block.
#
$char = "\n";
}
if (ord($char) < 32) {
#
# Adopt the BIND 9 presentation format in which
# quoted non-printing characters other than a
# space get converted into a backslash character
# followed by the non-printing character's
# three-digit decimal equivalent ASCII value.
# An escaped newline followed by a tab, for
# example, would appear as "\010\009".
#
$line_buffer .= "\\0";
$line_buffer .= "0" if ord($char) < 10;
$line_buffer .= ord($char);
} else {
$line_buffer .= $char;
}
$last_char = $char;
} else {
#
# Preservation of cosmetic whitespace is unnecessary
# since the assembled record will be parsed again into
# its DNS components.
#
$char = " " if $char eq "\t";
unless ($char eq " " && $last_char eq " ") {
$line_buffer .= $char;
$last_char = $char;
}
}
}
# Assess the situation now that the character scan
# of the current line is complete.
#
if ($open_paren_count < 0) {
print STDERR "Unbalanced parentheses; file '$spcl_file', line $line_num\n";
print STDERR "> $original_line";
$continuation_line = $open_paren_count = 0;
next;
}
if ($open_quote && $last_char ne "\n") {
print STDERR "Unbalanced quotes; file '$spcl_file', line $line_num\n";
print STDERR "> $original_line";
$open_quote = 0;
$bad_record = 1 if $open_paren_count;
next;
}
$continuation_line = $open_quote + $open_paren_count;
next if $continuation_line || $bad_record;
$_ = $line_buffer;
s/\s+$//; # chop trailing whitespace
next if /^$/; # line was only a comment
} else {
next if /^\s*$/ || /^\s*;/;
$original_line = $_;
s/([^\\]);.*/$1/; # strip comments
if (/\\$/) {
#
# Escaped newlines are only valid when quoted.
#
print STDERR "Unexpected end of input; file '$spcl_file', line $line_num\n";
print STDERR "> $original_line";
next;
}
s/\s+$//; # chop trailing whitespace
}
################################################
#
# The "$_" variable now contains the assembled
# DNS record for the tasks that need to be done.
#
################################################
}
&CLOSE(*ZONEFILE);
if ($open_quote || $open_paren_count) {
$char = ($open_quote) ? "quotes" : "parentheses";
print STDERR "Unable to process file '$spcl_file'\ndue to unbalanced $char. The syntax problem begins at line $split_line_num.\n";
}
Enjoy,
Andris Kalnozols
Hewlett-Packard Laboratories
andris at hpl.hp.com
More information about the bind-users
mailing list