sed and encodings

Julien ÉLIE julien at trigofacile.com
Mon Jan 19 19:39:09 UTC 2009


An interesting pointer:
    http://linuxproblem.org/art_21.html

%%
sed behaving strangely in UTF-8 environment

You are using a Linux distribution with UTF-8 encoding
such as SuSE 9.1.  You are using sed to operate on files
containing German Umlauts or other non-Ascii characters.
sed is behaving quite strangly:  an expression like

    sed 's/.*/x/'

normally should replace an arbitrary string by a single x.
The dot, however, does not match non-Ascii characters any more!

The problem occurs if you operate on ISO-8859 (Latin)
encoded files.  A non-ascii character is misinterpreted
in UTF-8 as a sequence of characters or - even worse -
as an invalid UTF-8 string.  So sed classifies the character
as something not being matched by a dot.  Strange and dangerous...
%%


That's weird.
Does somebody know how to prevent sed from behaving like that?

It becomes problematic!

-- 
Julien ÉLIE

« Loving unconditional means forgiving and learning to live
  with his imperfections.  Because in the end
  you'll realize that it is what you love the most. »




More information about the inn-workers mailing list