sed and encodings

Mon Jan 19 20:56:57 UTC 2009

Julien ÉLIE wrote:
> Hi William,
> >    sed 's/.*/x/'
> > normally should replace an arbitrary string by a single x.
> > The dot, however, does not match non-Ascii characters any more!
> I replaced all the occurrences of sed /.*/ with cut or perl.

I would expect that cut and perl display the same kind of behaviour as
sed on these systems. If you tell the program that its input is UTF-8,
but you feed it something completely different, then the "garbage in,
garbage out" principle applies. What should a program really do when it
receives only half a character?

A reasonable solution might be to forcibly set $LANG to something that
doesn't support multibyte characters (e.g. LANG=C) and treat all input
as a stream of bytes.

However I don't have a system to test this: sed with .* still works on
all the machines I have access to - even when I feed it broken UTF-8
input.

Greetings,
Johan