If I run the following Perl program:
perl -e 'use utf8; print "鸡\n";'
I get this warning:
Wide character in print at -e line 1.
If I run this Perl program:
perl -e 'print "鸡\n";'
I do not get a warning.
I thought use utf8 was required to use UTF-8 characters in a Perl script. Why does this not work and how can I fix it? I'm using Perl 5.16.2. I have the same issue if this is in a file instead of being a one liner on the command line.
Without use utf8 Perl interprets your string as a sequence of single byte characters. There are four bytes in your string as you can see from this:
$ perl -E 'say join ":", map { ord } split //, "鸡\n";'
233:184:161:10
The first three bytes make up your character, the last one is the line-feed.
The call to print sends these four characters to STDOUT. Your console then works out how to display these characters. If your console is set to use UTF8, then it will interpret those three bytes as your single character and that is what is displayed.
If we add in the utf8 module, things are different. In this case, Perl interprets your string as just two characters.
$ perl -Mutf8 -E 'say join ":", map { ord } split //, "鸡\n";'
40481:10
By default, Perl's IO layer assumes that it is working with single-byte characters. So when you try to print a multi-byte character, Perl thinks that something is wrong and gives you a warning. As ever, you can get more explanation for this error by including use diagnostics. It will say this:
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the :utf8 layer to the
output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the
warning is to add no warnings 'utf8'; but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see open and perlfunc/binmode.
As others have pointed out you need to tell Perl to accept multi-byte output. There are many ways to do this (see the Perl Unicode Tutorial for some examples). One of the simplest ways is to use the -CS command line flag - which tells the three standard filehandles (STDIN, STDOUT and STDERR) to deal with UTF8.
$ perl -Mutf8 -e 'print "鸡\n";'
Wide character in print at -e line 1.
鸡
vs
$ perl -Mutf8 -CS -e 'print "鸡\n";'
鸡
Unicode is a big and complex area. As you've seen, many simple programs appear to do the right thing, but for the wrong reasons. When you start to fix part of the program, things will often get worse until you've fixed all of the program.
All use utf8; does is tell Perl the source code is encoded using UTF-8. You need to tell Perl how to encode your text:
use open ':std', ':encoding(UTF-8)';
Encode all standard output as UTF-8:
binmode STDOUT, ":utf8";
You can get close to "just do utf8 everywhere" by using the CPAN module utf8::all.
perl -Mutf8::all -e 'print "鸡\n";'
When print receives something that it can't print (character larger than 255 when no :encoding layer is provided), it assumes you meant to encode it using UTF-8. It does so, after warning about the problem.
You can use this,
perl -CS filename.
It will also terminates that error.
In Spanish you can find this error when beside of begin using:
use utf8;
Your editor encoding is in a different encoding. So what you see on the editor is not what Perl does. To solve that error just change the editor encoding to Unicode/UTF-8.
Related
Why length() says this is 4 logical characters (I would expect it to say 1):
$ perl -lwe 'print length("🐪")'
4
I guess something is wrong with my expectation. :-) What is it?
Unless you tell Perl that the source code of the script is in utf8 Perl assumes ASCII. This means that by default the Perl interpreter sees 🐪 as 4 separate characters. If you change your one liner to perl -Mutf8 -lwe 'print length("🐪")' You see length providing your expected output.
The utf8 pragma tells Perl that the source unit is in utf8 and not ASCII. See perldoc utf8 for more info.
In linux shell, if I do perl -e "print '\x6f\xe3\xff\xff\xff\x7f' x 10", the out put is:
\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f\x6f\xe3\xff\xff\xff\x7f
But what I need is the ascii string of \x6f\xe3\xff\xff\xff\x7f, how to make it?
Escape sequences like \x6f are only expanded in double-quoted strings. You have them in a single-quoted string. Reverse the use of quotes in your example (I've also added -CS to make the characters print properly):
$ perl -CS -e 'print "\x6f\xe3\xff\xff\xff\x7f" x 10'
oãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿoãÿÿÿ
Update: The perlrun manual page explains about the various Perl command-line switches.
The -C switch controls various Unicode features. -CS is a quick way to tell Perl that the three standard filehandles (STDIN, STDOUT and STDERR) should be treated as providing or expecting a stream of UTF8-encoded data. This means that anything read from STDIN will be decoded from UTF8 to Perl characters and anything sent to STDOUT or STDERR will be encoded from Perl characters to UTF8. In this case, I only really needed -CI (which only applies that transformation to STDOUT) but I have got into the habit of handling all three filehandles together.
perl has an option perl -C to process utf-8, is it possible to tell perl one-liner the input is in utf-16 encoding? The BEGIN block might be used to change encoding explicitly, any simpler way there?
Can Encode do what you want? You then might have to use encode() and decode() in your script so it might be no shorter than:
perl -nE 'BEGIN {binmode STDIN, ":encoding(utf16)" } ; ...'
There is a PERL_UNICODE environment variable, but it is fairly limited: it simply mimics -C if I recall correctly.
I once tried to find out why there aren't -C switches for "popular" forms of UTF and it seemed to come down to whether or not they are frequently used; are or are not well understood (endianness sometimes counts - who knew?); are - or should be - obsolete; ... : in other words it's not as simple as it seems.
perl -MEncode -E 'say for Encode->encodings(":all")' will show ~ 9 different UTF encodings.
In addtion to the usual suspects (perlrun, perlunitut, perlunicode, etc.), one of the most interesting perl resources on Unicode is right here on Stackoverflow and makes for fascinating reading.
c.f. #Leon Timmerman's example and perldoc open which is fairly thorough:
% perl -Mopen=":std,:encoding(utf-16)" -E 'print <>' UTF16.txt > other.txt
% file other.txt
other.txt: Big-endian UTF-16 Unicode text, with CRLF line terminators
Edit: Another recent discussion asking how to "Turn Off" binmode(STDOUT, ":utf8") Locally touches on PerlIO and "layers" and has a neat solution that might lend itself to a one-liner. See UTF-16 perl input output as well.
I will try to find a real example using Encode to preserve encoding that can be one-lined. It would go something like this "round trip". e.g.:
% file UTF16.txt
UTF16.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
... slurp it up and redirect it to a different file:
% perl -00 -MEncode="encode,decode" -E '
$text = decode("UTF-16LE", <>) ;
print encode("UTF-16LE", $text)' UTF16.txt > other.txt
% file other.txt
other.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
diff and print the size of the file in bytes:
% diff UTF16.txt other.txt
% perl -E 'say [stat]->[7] for #ARGV' UTF16.txt other.txt
2220
2220
You can do that using perl -Mopen=":std,IN,:encoding(utf-16)" -e '...'
If I run the following Perl program:
perl -e 'use utf8; print "鸡\n";'
I get this warning:
Wide character in print at -e line 1.
If I run this Perl program:
perl -e 'print "鸡\n";'
I do not get a warning.
I thought use utf8 was required to use UTF-8 characters in a Perl script. Why does this not work and how can I fix it? I'm using Perl 5.16.2. I have the same issue if this is in a file instead of being a one liner on the command line.
Without use utf8 Perl interprets your string as a sequence of single byte characters. There are four bytes in your string as you can see from this:
$ perl -E 'say join ":", map { ord } split //, "鸡\n";'
233:184:161:10
The first three bytes make up your character, the last one is the line-feed.
The call to print sends these four characters to STDOUT. Your console then works out how to display these characters. If your console is set to use UTF8, then it will interpret those three bytes as your single character and that is what is displayed.
If we add in the utf8 module, things are different. In this case, Perl interprets your string as just two characters.
$ perl -Mutf8 -E 'say join ":", map { ord } split //, "鸡\n";'
40481:10
By default, Perl's IO layer assumes that it is working with single-byte characters. So when you try to print a multi-byte character, Perl thinks that something is wrong and gives you a warning. As ever, you can get more explanation for this error by including use diagnostics. It will say this:
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the :utf8 layer to the
output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the
warning is to add no warnings 'utf8'; but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see open and perlfunc/binmode.
As others have pointed out you need to tell Perl to accept multi-byte output. There are many ways to do this (see the Perl Unicode Tutorial for some examples). One of the simplest ways is to use the -CS command line flag - which tells the three standard filehandles (STDIN, STDOUT and STDERR) to deal with UTF8.
$ perl -Mutf8 -e 'print "鸡\n";'
Wide character in print at -e line 1.
鸡
vs
$ perl -Mutf8 -CS -e 'print "鸡\n";'
鸡
Unicode is a big and complex area. As you've seen, many simple programs appear to do the right thing, but for the wrong reasons. When you start to fix part of the program, things will often get worse until you've fixed all of the program.
All use utf8; does is tell Perl the source code is encoded using UTF-8. You need to tell Perl how to encode your text:
use open ':std', ':encoding(UTF-8)';
Encode all standard output as UTF-8:
binmode STDOUT, ":utf8";
You can get close to "just do utf8 everywhere" by using the CPAN module utf8::all.
perl -Mutf8::all -e 'print "鸡\n";'
When print receives something that it can't print (character larger than 255 when no :encoding layer is provided), it assumes you meant to encode it using UTF-8. It does so, after warning about the problem.
You can use this,
perl -CS filename.
It will also terminates that error.
In Spanish you can find this error when beside of begin using:
use utf8;
Your editor encoding is in a different encoding. So what you see on the editor is not what Perl does. To solve that error just change the editor encoding to Unicode/UTF-8.
I have a header creation Perl script and it works great most of the time, but every once in a while the thing breaks. I'll get right down to the meat of it, given the CRC number 772423333 the PERL pack function breaks.
my $dec = 772423333;
my $broken = pack("N", $dec);
print "Good:\t", uc(sprintf("%x", $dec)), "\nBad:\t$broken"; # eg. 2E0D0A3EA5
Forgive me for not knowing how to print the readable HEX, but this is what it returns.
Good: 2E0A3EA5
Bad: 2E0D0A3EA5
How do I remove the 0D?
Your example output isn't what your program prints. Your program prints "bad" out that in binary (as if it were printable characters, though its not), not in hex.
It works here (once I pipe it to a hex dumper, so I can read it), but I'm on Linux.
Most likely, where you're going wrong is that you need to call binmode on your output file handle (or alternatively open it with a :raw layer); you are seeing newline to CRLF translation. If you add binmode *STDOUT; immediately before your print (in your example code), I suspect you'll get the expected output.
[ On Unix, there is no newline-to-CRLF translation, so it works ]
Stop using Windows? 0D0A are the character codes of a Windows line ending (more commonly seen as "\r\n"), and you observe them because you are printing character 0A ("\n") to a handle (STDOUT) with the :crlf encoding, which automatically converts any \n characters to the sequence \r\n.
Call binmode on STDOUT to disable this encoding. Here's the view using an MSWin32 build of perl with the Cygwin utility od:
$ winperl -e 'print pack("N",772423333)' | od -c
0000000 . \r \n > 245
0000005
$ winperl -e 'binmode STDOUT; print pack("N",772423333)' | od -c
0000000 . \n > 245
0000004