BIND 10 #2375: implement MasterLexer getNextToken() and ungetToken()
BIND 10 Development
do-not-reply at isc.org
Tue Nov 20 10:19:38 UTC 2012
#2375: implement MasterLexer getNextToken() and ungetToken()
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
jinmei | Status: reviewing
Type: task | Milestone:
Priority: | Sprint-20121120
medium | Resolution:
Component: | Sensitive: 0
libdns++ | Sub-Project: DNS
Keywords: | Estimated Difficulty: 5
Defect Severity: N/A | Total Hours: 0
Feature Depending on Ticket: |
loadzone-ng |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:12 vorner]:
> Sometimes, I wonder if you write more code or more trac comments :-P
In this particular case I had a lot of time to kill in my flight (and
writing code wasn't a best choice as I wanted to keep the battery
alive longer:-)
> Replying to [comment:7 jinmei]:
> > '''missing other signature of getNextToken'''
> >
> > It's probably not clear from the ticket's description, but it was
> > actually intended to implement another version of getNextToken
> > (I thought it was mentioned in the main ticket of `MasterLexer`).
> > It's equivalent of BIND 9's isc_lex_getmastertoken(). This can wait,
> > however, until we implement rdata extensions to support the generic
> > lexer. In that case we need to create a ticket for it.
>
> To clarify, so it should be a thin wrapper around the current
getNextToken?
Yes (depending on the definition of "thin", now that our body of
getNextToken() is also pretty small). The other version of
`getNextToken` would call the currently defined one with specifying
some particular set of options, and handle the case of unexpected
EOL or unexpected string when a number is requested.
> > Secondly, I don't think we need to catch exceptions (and recover from
> > it) in `getNextToken`. [...]
>
> I'm not sure. I could remove it, but I think it is better this way for
two reasons:
> * The general approach of trying to be safe.
> * In case a big master file includes another one and the other one is
not
> readable for some reason, it might be good to continue parsing (and
> reporting errors) at least for the big one. It is probably not that
> important, because it would be rather rare occasion, but I don't
think it
> would be that hard to implement either.
>
> So, I'd ask. Is there a disadvantage to handling the exceptions?
At least an explicit try-catch can be expensive (even in the normal
case of no exceptions), depending on how the exception handling is
implemented in that environment (compiler/std C++ lib).
Secondly, try-catch complicates the code logic (just like any explicit
error handling does). If we need to do explicit try-catch immediately
above its helper class, that seems to indicate some design problem to
me, because we've lost one big advantage of exceptions: defer the
explicit error handling at a sufficiently high layer that has
sufficient context to tell how to deal with it and keep the rest of
code simpler.
So, to me, the question to be asked is opposite: whether there's any
advantage of handling the exception here.
And, I don't see much in this case; it's rethrown anyway, so the upper
layer still needs to handle it properly, and, I suspect it's not only
rare (as you noted above) but also quite hopeless to recover. If we
encounter an IO error after successfully opening the file, it's very
unlikely that the next read from the stream/file now succeeds. So,
the only meaningful recovery action that the upper layer
(`MasterLoader`) can take is to pop this source and try to continue.
But just calling ungetToken() doesn't ensure we are successfully
back to the state when we push this source.
As such, I'd still remove the try-catch. But as I write these
comments I've just realized the main purpose of catching the exception
is to ensure `ungetToken` is called when an exception happens. As
already stated I doubt its effectiveness, but if we still want to
ensure it, we can do it with a guard object:
{{{#!cpp
namespace {
class NextTokenGuard {
NextTokenGuard(MasterLexer* lexer) : lexer_(lexer) {}
~NextTokenGuard() {
if (lexer_ != NULL) {
lexer_->ungetToken();
}
}
void release() { lexer_ = NULL; }
private:
MasterLexer* lexer_;
};
}
MasterLexer::Token
MasterLexer::getNextToken(Options options) {
...
NextTokenGuard guard(this);
impl_->token_ = Token(Token::NO_TOKEN_PRODUCED);
...
guard.release();
return (impl_->token_);
}
}}}
Maybe you don't like it because the total amount of code of this
version is even bigger than just using try-catch. But at least it
simplifies the main code of `getNextToken()` and doesn't have the
performance issue with any type of try-catch implementation unless an
exception is actually thrown.
> > Thirdly (probably a minor point), the BIND 9 version of ungettoken
> > doesn't reset last_was_eol. [...]
>
> I don't know if it is a bug on bind9 side, or this case was just never
needed.
> But from the logical point of view, it should be reset, at least if the
job
> description for ungetToken is to return to the state before the last
token. I
> think that „peeking“ at a token should be safe and sideeffectless
operation. If
> there's something that depends on the fact that ungetToken does not
return the
> last-eof condition back in bind9, I think it is time to rethink the
approach,
> because that's hard to understand and read. Such code is expected in
perl, not
> in C++ ;-).
I agree your version is more logical. But I don't think it wise to
just ignore the fact that the BIND 9 version has been working for
quite a long time against many very unusual or even broken inputs.
So, while I'm okay with resetting last_was_eol in the end, I suggest
you really think about any possible problematic cases where the
difference can cause, rather than just dismissing it "because it's
more logical". I'd also ask this simple question (why it doesn't
presever-restore last_was_eol) at bind10-dev (some core BIND 9
developers subscribe to it). Actually I suspect it's less likely
you'll get an answer, but we can then at least know there's not a
simple reason for it, if any.
> I looked at the current master, which already contains the string
states. There
> doesn't seem to be any state that returns something other than NULL from
its
> handle. The first question therefore is, is this OK for tests or is it
> incomplete then? The second, if there's no such state, will there be? If
not,
> could we simplify the design by a lot?
Hmm, I'm not sure how it relates to this discussion, but that's a good
question anyway. After clarifying and simplifying the state
transitions for our limited need, we basically have only two cases:
- everything is completed within State::start(). state() sets the
token and returns NULL
- one transition from start() to other derived state. Each state
terminates the transition and its handle() always sets the token and
returns NULL
This won't change when we complete the transition with the NUMBER
state.
I'm not sure it simplifies the code, though. Instead of the for
loop, we could do this:
{{{#!cpp
const State* state = start(options); // or State::start(lexer,
options)
if (state != NULL) {
state.handle(*this);
}
}}}
or, we might introduce a null state class and write as follows:
{{{#!cpp
start(options)->handle(*this);
// or State::start(*this, options)->handle(*this);
}}}
> I'm not sure it can be reasonably tested by read-only accessors. The
goal of
> the fake state is to change the internal state and mess around with it.
I know, but since `getNextToken()` itself doesn't rely on the internal
states much, I don't see the need for arbitrarily modifying the
states.
> > - we should probably unify NOT_STARTED and NO_TOKEN_PRODUCED. The
former
> > is only used as an initial placeholder value.
>
> I'm not sure. I was thinking about that, but then I decided not to. The
situations
> are really different. The NOT_STARTED is part of normal lifetime of the
lexer.
> The NO_TOKEN_PRODUCED should actually never ever happen except for the
short time
> in which the new token is being computed. Considering the enum values
are cheap
> (the enum already takes a whole byte, so adding few more changes
nothing), I don't
> see compelling reason to have them unified. I think that could be more
confusing
> than having them separated.
But NOT_STARTED is also never expected to be shown outside
`MasterLexer`, and in that sense it seems to be quite similar to
NO_TOKEN_PRODUCED. But I don't insist it strongly.
> > - getNextToken(): the behavior on atEOF() is different from that of
> > the BIND 9 version:
> > {{{#!c
> > if (isc_buffer_remaininglength(source->pushback) == 0 &&
> > source->at_eof)
> > {
> > if ((options & ISC_LEXOPT_DNSMULTILINE) != 0 &&
> > lex->paren_count != 0) {
> > lex->paren_count = 0;
> > return (ISC_R_UNBALANCED);
> > }
> > if ((options & ISC_LEXOPT_EOF) != 0) {
> > tokenp->type = isc_tokentype_eof;
> > return (ISC_R_SUCCESS);
> > }
> > return (ISC_R_EOF);
> > }
> > }}}
> > (Note: in our implementation we can assume LEXOPT_DNSMULTILINE and
> > LEXOPT_EOF).
>
> I'm probably lost here. I added a test to see the unbalanced parenthesis
is
> reported (and it is). So, what other difference is there? I'm not really
confident
> in bind9 code, I didn't write any bit of it O:-).
I didn't write any bit of lex.c or am not particularly familiar with
it, either; I've read this implementation for the purpose of this
porting mostly as a first time reader. So, in this context we
shouldn't be so different. And, the difference of the two versions on
this point was pretty clear to me. This check takes place before
doing any state transaction:
{{{#!c
if (isc_buffer_remaininglength(source->pushback) == 0 &&
source->at_eof)
{
if ((options & ISC_LEXOPT_DNSMULTILINE) != 0 &&
lex->paren_count != 0) {
lex->paren_count = 0;
return (ISC_R_UNBALANCED);
}
if ((options & ISC_LEXOPT_EOF) != 0) {
tokenp->type = isc_tokentype_eof;
return (ISC_R_SUCCESS);
}
return (ISC_R_EOF);
}
}}}
and the equivalent part of your version is:
{{{#!cpp
if (impl_->source_ == NULL || impl_->source_->atEOF()) {
isc_throw(isc::InvalidOperation, "No source to read tokens from");
}
}}}
so, in the BIND 9 version, if the caller calls "get (next) token"
after the input source reaches an EOF, it would return an EOF token
(or an unbalanced error), while your version throws an exception (that
would indicate a caller's bug).
When I made this comment I didn't know if this causes any real
disruption - I assumed it's beyond the responsibility of a reviewer.
But now that we've had this discussion, I've taken a closer look at
it. Assume we are parsing the following (broken) RR in the "lenient"
mode.
{{{
foo.example.com. IN TXT "incomplete qstring<EOF>
}}}
What we'd expect in this case is that the unbalanced qstring is
detected (resulting in an ERROR token), and the next call to
getNextToken() returns EOF. If I read it correctly BIND 9's
implementation works that way. But your version doesn't. It can be
confirmed by the following test:
{{{#!cpp
TEST_F(MasterLexerTest, eofAfterUnbalancedQstring) {
ss << "\"unbalanced then EOF";
lexer.pushSource(ss);
EXPECT_EQ(MasterLexer::Token::ERROR,
lexer.getNextToken(MasterLexer::QSTRING).getType());
// We've reached EOF, and the next call to getNextToken() should
return it
EXPECT_EQ(MasterLexer::Token::END_OF_FILE,
lexer.getNextToken().getType());
}
}}}
Another example is an unbalanced '(' followed by EOF. We first need
to detect the error and then detect EOF.
So, we should clearly do something, at least if we want to be
compatible with BIND 9.
One obvious way would be to add the same code logic as BIND 9 to
`getNextToken()`. But, from a closer look I guess we can probably
just stop throwing the exception. The EOF will be handled in the
start() phase. This way we can handle the above cases as intended,
and avoid having duplicate code (although we cannot detect a really
redundant call to `getNextToken()` after EOF). I also noticed
if we fixed the code that way we probably don't need the atEOF()
method.
> > - this might be safe depending on what happens in handle(), but I'd
> > protect state in scoped_ptr:
> > {{{#!cpp
> > const State* state(State::getFakeState(NULL, 0, &token_));
> > state->handle(*this);
> > delete state;
> > }}}
>
> I addressed this. But I didn't notice having `const` in my code. That
could
> suggest you did your usual constify run through the code, but forgot to
push.
> If it is the case, could you push to the original branch and I'd cherry-
pick it
> onto the current branch?
Ah, right. I've just pushed it. I also made another minor editorial
fix on the old branch, which has been pushed too.
'''comments on the revised version of code'''
- `MasterLexerTest::getUnbalanced` what we really need to test is
whether we can get an EOF after detecting `UNBALANCED_PAREN`.
see the discussion above.
- master_lexer_unittest.cc: in the following comment "getToken" should
be "getNextToken"?
{{{#!cpp
// By calling getToken again, we verify even the source got back to
// original (as the second fake state checks it gets "234"). We must
// push it as a fake start again so it is picked.
}}}
also maybe "fake start" should be "fake state"? (It still doesn't
perfectly parse to me, although I think I can understand what it
tries to say).
'''other general things'''
- maybe I forgot to comment this: I wonder if it's better to return
`const MasterLexer::Token&` referring to its internal `token_`
from `getNextToken()`. While the cost of copying `Token` is
probably not very expensive in this case, I don't see any need
to make a copy here anyway (in the case of string the copied `Token`
object has a reference to a placeholder within `MasterLexer`, which
will be quite ephemeral anyway).
--
Ticket URL: <http://bind10.isc.org/ticket/2375#comment:13>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list