[bind10-dev] arbitrary character and plain char
JINMEI Tatuya / 神明達哉
jinmei at isc.org
Tue May 8 01:17:43 UTC 2012
At Mon, 30 Apr 2012 09:37:57 +0200,
Michal 'vorner' Vaner <michal.vaner at nic.cz> wrote:
> I agree that signed char can cause problems, but I believe it's the other way
> around ‒ the \255 is not safe, but a binary byte with high value is (assuming we
> never try to act as if it is a number).
>
> If we have code:
> char c;
> read(0, &c, 1);
> char c = d;
Do you mean "d = c" here? Assuming so...
> write(1, &d, 1);
>
> It'll work even with bytes >127, the same thing as was put inside will be
> output. I believe this is guaranteed by the fact that any data type is
> pointer-aliasable to array of unsigned chars and we use the char only as a place
> where we store data by a pointer here, not a variable with some actual
> properties.
I don't think this would (necessarily) work as we wish.
I see at the point of read(), the size-1 region of the memory will be
filled with some implementation-defined bit pattern corresponding to
the (unsigned) 8-bit value retrieved by the read() call.
But if, for example, the corresponding unsigned char value for the
retrieved data is 255,
char d = c;
will essentially mean
char d = 255;
and the result is undefined. Naturally, we cannot be sure what kind
of bit pattern will be written out in the subsequent write() call.
If, for example, the code were
char c;
read(0, &c, 1);
write(1, &c, 1);
then it would probably reserve the retrieved bit pattern regardless of
its value. In this case, the char is indeed used as an opaque
placeholder. But that doesn't mean anything for us, because for the
purpose of lexer, we cannot completely handle it as opaque data; we
need to examine the character to see whether it's a '"', '(', ' ',
etc. Once we do something like
if (c == '"') {
// start a quoted string...
}
the compiler will need to (re)interpret c as a plain char value, and
the magic bit pattern stored there could make the expression true even
if the value is not actually ord('"') = 0x41.
Anyway...
> I don't disagree with the fact we should switch to arrays of unsigned chars, or
> more correctly, uint8 variables. It's just I don't think mandating them to be
> spelled as \255 will help in this regard anyhow (unless we really handle the
> case of char = signed char and high bit somehow explicitly).
Okay, so let's focus on whether it works if we generally assume an
escaped form for character values larger than 127, rather than
discussing tricky language details. What I'd suggest is:
State that a zone file that our lexer accepts must consist of
character values smaller than 128 (so that they can safely be
represented by a plain char). If the file contains other character
values, it may or may not work (actually I guess it will happen to
work on many platforms in practice even if we naively handle them
via plain chars), but we say technically the behavior is undefined.
Now I don't understand why you don't think it will work...could you be
more specific about how exactly it will fail?
---
JINMEI, Tatuya
Internet Systems Consortium, Inc.
More information about the bind10-dev
mailing list