MIME Distribution: headers

Julien ÉLIE julien at trigofacile.com
Sat Aug 22 19:18:32 UTC 2009


Hi,

>> Another question:  with the active.times file, I do not know what is the
>> best we can do in order to write the newsgroup creator's name in
>> UTF-8...  I think that only ctlinnd matters for that (mod-active and
>> controlchan write "usenet" or something like that -- I have not
>> checked).  Is putting a warning in the man page of ctlinnd enough?  The
>> encoding depends on the one of the shell used!
>
> Yeah, that one is hard.  I'm not sure there's any really good solution
> there other than a warning... I guess the other option would be to check
> the string we're about to write to be sure it's correctly formed UTF-8,
> and if it isn't, fail with an error instead of creating the group.
>
> We probably need a general function to check for correctly formed UTF-8
> anyway.

Does someone know a good UTF-8 checker to validate a string?

I have seen the regexp here:
    http://www.w3.org/International/questions/qa-forms-utf-8

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;


But I haven't find yet a good, standalone implementation in C.

For instance, control characters like \x08 are accepted by this
implementation <http://snowplow.org/martin/utf8checker/> (though
released under Apache License v2, but as it is said in the README file,
we could ask for GPL v2 to match what we have most in INN).

Perl implements it fine in utf8.c (Perl_is_utf8_string) but it is not
standalone.

Maybe the most straightforward method would be to write the direct
check according to the regexp?

-- 
Julien ÉLIE

« Les mathématiques peuvent être définies comme une science
  dans laquelle on ne sait jamais de quoi on parle, ni si ce qu'on dit
  est vrai. » (Bertrand Russell) 




More information about the inn-workers mailing list