How does Perl's length() function counts unicode characters? - perl

Why length() says this is 4 logical characters (I would expect it to say 1):
$ perl -lwe 'print length("πŸͺ")'
4
I guess something is wrong with my expectation. :-) What is it?

Unless you tell Perl that the source code of the script is in utf8 Perl assumes ASCII. This means that by default the Perl interpreter sees πŸͺ as 4 separate characters. If you change your one liner to perl -Mutf8 -lwe 'print length("πŸͺ")' You see length providing your expected output.
The utf8 pragma tells Perl that the source unit is in utf8 and not ASCII. See perldoc utf8 for more info.

Related

Wide character error in print statement in perl [duplicate]

If I run the following Perl program:
perl -e 'use utf8; print "ιΈ‘\n";'
I get this warning:
Wide character in print at -e line 1.
If I run this Perl program:
perl -e 'print "ιΈ‘\n";'
I do not get a warning.
I thought use utf8 was required to use UTF-8 characters in a Perl script. Why does this not work and how can I fix it? I'm using Perl 5.16.2. I have the same issue if this is in a file instead of being a one liner on the command line.
Without use utf8 Perl interprets your string as a sequence of single byte characters. There are four bytes in your string as you can see from this:
$ perl -E 'say join ":", map { ord } split //, "ιΈ‘\n";'
233:184:161:10
The first three bytes make up your character, the last one is the line-feed.
The call to print sends these four characters to STDOUT. Your console then works out how to display these characters. If your console is set to use UTF8, then it will interpret those three bytes as your single character and that is what is displayed.
If we add in the utf8 module, things are different. In this case, Perl interprets your string as just two characters.
$ perl -Mutf8 -E 'say join ":", map { ord } split //, "ιΈ‘\n";'
40481:10
By default, Perl's IO layer assumes that it is working with single-byte characters. So when you try to print a multi-byte character, Perl thinks that something is wrong and gives you a warning. As ever, you can get more explanation for this error by including use diagnostics. It will say this:
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the :utf8 layer to the
output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the
warning is to add no warnings 'utf8'; but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see open and perlfunc/binmode.
As others have pointed out you need to tell Perl to accept multi-byte output. There are many ways to do this (see the Perl Unicode Tutorial for some examples). One of the simplest ways is to use the -CS command line flag - which tells the three standard filehandles (STDIN, STDOUT and STDERR) to deal with UTF8.
$ perl -Mutf8 -e 'print "ιΈ‘\n";'
Wide character in print at -e line 1.
ιΈ‘
vs
$ perl -Mutf8 -CS -e 'print "ιΈ‘\n";'
ιΈ‘
Unicode is a big and complex area. As you've seen, many simple programs appear to do the right thing, but for the wrong reasons. When you start to fix part of the program, things will often get worse until you've fixed all of the program.
All use utf8; does is tell Perl the source code is encoded using UTF-8. You need to tell Perl how to encode your text:
use open ':std', ':encoding(UTF-8)';
Encode all standard output as UTF-8:
binmode STDOUT, ":utf8";
You can get close to "just do utf8 everywhere" by using the CPAN module utf8::all.
perl -Mutf8::all -e 'print "ιΈ‘\n";'
When print receives something that it can't print (character larger than 255 when no :encoding layer is provided), it assumes you meant to encode it using UTF-8. It does so, after warning about the problem.
You can use this,
perl -CS filename.
It will also terminates that error.
In Spanish you can find this error when beside of begin using:
use utf8;
Your editor encoding is in a different encoding. So what you see on the editor is not what Perl does. To solve that error just change the editor encoding to Unicode/UTF-8.

How to print hex string in linux shell with perl

In linux shell, if I do perl -e "print '\x6f\xe3\xff\xff\xff\x7f' x 10", the out put is:
\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f
But what I need is the ascii string of \x6f\xe3\xff\xff\xff\x7f, how to make it?
Escape sequences like \x6f are only expanded in double-quoted strings. You have them in a single-quoted string. Reverse the use of quotes in your example (I've also added -CS to make the characters print properly):
$ perl -CS -e 'print "\x6f\xe3\xff\xff\xff\x7f" x 10'
oãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿ
Update: The perlrun manual page explains about the various Perl command-line switches.
The -C switch controls various Unicode features. -CS is a quick way to tell Perl that the three standard filehandles (STDIN, STDOUT and STDERR) should be treated as providing or expecting a stream of UTF8-encoded data. This means that anything read from STDIN will be decoded from UTF8 to Perl characters and anything sent to STDOUT or STDERR will be encoded from Perl characters to UTF8. In this case, I only really needed -CI (which only applies that transformation to STDOUT) but I have got into the habit of handling all three filehandles together.

How can I make Perl 6 be round-trip safe for Unicode data?

A naΓ―ve Perl 6 program is not round-trip safe with respect to Unicode. It appears as if it internally uses Normalization Form Composition (NFC) for the Str type:
$ perl -CO -E 'say "e\x{301}"' | perl6 -ne '.say' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+00e9
U+000a
Poking through the docs I can't see anything about this behavior and I find it very shocking. I can't believe you have to drop back to the byte level to round-trip text:
$ perl -CO -E 'say "e\x{301}"' | perl6 -e 'while (my $byte = $*IN.read(1)) { $*OUT.write($byte) }' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+0065
U+0301
U+000a
Do all text files have to be in NFC to be safely round-tripped with Perl 6? What if the document is supposed to be in NFD? I must be missing something here. I cannot believe this is intentional behavior.
The answer seems to be to use the Uni type (the base class for NFD, NFC, etc), but it doesn't really do that now and there is no good way to get the file into a Uni string. So, until some unnamed point in the future, you cannot roundtrip a non-normalized file unless you treat it as bytes.
Use UTF8-C8. From the documentation:
You can use UTF8-C8 with any file handle to read the exact bytes as
they are on disk. They may look funny when printed out, if you print
it out using a UTF8 handle. If you print it out to a handle where the
output is UTF8-C8, then it will render as you would normally expect,
and be a byte for byte exact copy.

Process text as utf-16 via perl one-liner?

perl has an option perl -C to process utf-8, is it possible to tell perl one-liner the input is in utf-16 encoding? The BEGIN block might be used to change encoding explicitly, any simpler way there?
Can Encode do what you want? You then might have to use encode() and decode() in your script so it might be no shorter than:
perl -nE 'BEGIN {binmode STDIN, ":encoding(utf16)" } ; ...'
There is a PERL_UNICODE environment variable, but it is fairly limited: it simply mimics -C if I recall correctly.
I once tried to find out why there aren't -C switches for "popular" forms of UTF and it seemed to come down to whether or not they are frequently used; are or are not well understood (endianness sometimes counts - who knew?); are - or should be - obsolete; ... : in other words it's not as simple as it seems.
perl -MEncode -E 'say for Encode->encodings(":all")' will show ~ 9 different UTF encodings.
In addtion to the usual suspects (perlrun, perlunitut, perlunicode, etc.), one of the most interesting perl resources on Unicode is right here on Stackoverflow and makes for fascinating reading.
c.f. #Leon Timmerman's example and perldoc open which is fairly thorough:
% perl -Mopen=":std,:encoding(utf-16)" -E 'print <>' UTF16.txt > other.txt
% file other.txt
other.txt: Big-endian UTF-16 Unicode text, with CRLF line terminators
Edit: Another recent discussion asking how to "Turn Off" binmode(STDOUT, ":utf8") Locally touches on PerlIO and "layers" and has a neat solution that might lend itself to a one-liner. See UTF-16 perl input output as well.
I will try to find a real example using Encode to preserve encoding that can be one-lined. It would go something like this "round trip". e.g.:
% file UTF16.txt
UTF16.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
... slurp it up and redirect it to a different file:
% perl -00 -MEncode="encode,decode" -E '
$text = decode("UTF-16LE", <>) ;
print encode("UTF-16LE", $text)' UTF16.txt > other.txt
% file other.txt
other.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
diff and print the size of the file in bytes:
% diff UTF16.txt other.txt
% perl -E 'say [stat]->[7] for #ARGV' UTF16.txt other.txt
2220
2220
You can do that using perl -Mopen=":std,IN,:encoding(utf-16)" -e '...'

Use of 'use utf8;' gives me 'Wide character in print'

If I run the following Perl program:
perl -e 'use utf8; print "ιΈ‘\n";'
I get this warning:
Wide character in print at -e line 1.
If I run this Perl program:
perl -e 'print "ιΈ‘\n";'
I do not get a warning.
I thought use utf8 was required to use UTF-8 characters in a Perl script. Why does this not work and how can I fix it? I'm using Perl 5.16.2. I have the same issue if this is in a file instead of being a one liner on the command line.
Without use utf8 Perl interprets your string as a sequence of single byte characters. There are four bytes in your string as you can see from this:
$ perl -E 'say join ":", map { ord } split //, "ιΈ‘\n";'
233:184:161:10
The first three bytes make up your character, the last one is the line-feed.
The call to print sends these four characters to STDOUT. Your console then works out how to display these characters. If your console is set to use UTF8, then it will interpret those three bytes as your single character and that is what is displayed.
If we add in the utf8 module, things are different. In this case, Perl interprets your string as just two characters.
$ perl -Mutf8 -E 'say join ":", map { ord } split //, "ιΈ‘\n";'
40481:10
By default, Perl's IO layer assumes that it is working with single-byte characters. So when you try to print a multi-byte character, Perl thinks that something is wrong and gives you a warning. As ever, you can get more explanation for this error by including use diagnostics. It will say this:
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the :utf8 layer to the
output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the
warning is to add no warnings 'utf8'; but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see open and perlfunc/binmode.
As others have pointed out you need to tell Perl to accept multi-byte output. There are many ways to do this (see the Perl Unicode Tutorial for some examples). One of the simplest ways is to use the -CS command line flag - which tells the three standard filehandles (STDIN, STDOUT and STDERR) to deal with UTF8.
$ perl -Mutf8 -e 'print "ιΈ‘\n";'
Wide character in print at -e line 1.
ιΈ‘
vs
$ perl -Mutf8 -CS -e 'print "ιΈ‘\n";'
ιΈ‘
Unicode is a big and complex area. As you've seen, many simple programs appear to do the right thing, but for the wrong reasons. When you start to fix part of the program, things will often get worse until you've fixed all of the program.
All use utf8; does is tell Perl the source code is encoded using UTF-8. You need to tell Perl how to encode your text:
use open ':std', ':encoding(UTF-8)';
Encode all standard output as UTF-8:
binmode STDOUT, ":utf8";
You can get close to "just do utf8 everywhere" by using the CPAN module utf8::all.
perl -Mutf8::all -e 'print "ιΈ‘\n";'
When print receives something that it can't print (character larger than 255 when no :encoding layer is provided), it assumes you meant to encode it using UTF-8. It does so, after warning about the problem.
You can use this,
perl -CS filename.
It will also terminates that error.
In Spanish you can find this error when beside of begin using:
use utf8;
Your editor encoding is in a different encoding. So what you see on the editor is not what Perl does. To solve that error just change the editor encoding to Unicode/UTF-8.