Perl ord and chr working with unicode - perl

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).

First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/

Related

perl pack a utf-8 Chinese Character how to unpack to get this Character?

I am learning pack function in perl. I found I cannot unpack and get the origin value. Following is the code. File encode utf8. How can I unpack and get the Chinese character.
I have checked the perldoc. I am not sure which TEMPLATE I can use. Document said that:
U A Unicode character number. Encodes to a character in character mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in byte mode.
So I tried U. But it didn't work.
use Encode;
open(DAT,"+>T.dat");
binmode(DAT,":raw");
print DAT pack("f",-3.938345);
print DAT pack("l",1234556);
print DAT pack("U*","我");
seek(DAT,0,0);
read(DAT,$Val,4);
$V=unpack("f",$Val);
print "V $V\n";
read(DAT,$int,4);
$I=unpack("l",$int);
print "int $I\n";
read(DAT,$HZ,4);
$HZ=unpack("U*",$HZ);
print("HZ $HZ\n");
close(DAT);
And I have another question, I know one Chinese Character only take 2 bytes if encoded in GB2312. How can I pack one Character and only take 2 bytes space?
Unicode pack and unpack in Perl works the other way round:
use utf8;
binmode STDOUT,":utf8";
my $packed = pack("U*", 0x6211);
print "$packed\n"; # 我
my $unpacked = unpack("U*", "我");
printf "0x%X\n", $unpacked; # 0x6211

Why does decoding "€" to "€" also turn "é" into "é" in output?

I'm new to Perl scripting, and I'm facing some issues in decoding a string:
use HTML::Entities;
my $string='Rémunération €';
$string=decode_entitie($string);
print "$string";
The output I get looks like Rémunération €, when it should look like Rémunération €.
Can anyone please help me with this?
If you run this version of your code (with the typo in decode_entities fixed, strict mode and warnings enabled, and an extra print added) at a terminal:
use strict;
use warnings;
use HTML::Entities;
my $string='Rémunération €';
print "$string\n";
$string=decode_entities($string);
print "$string\n";
you should see the following output:
Rémunération €
Wide character in print at test.pl line 7.
Rémunération €
What happens is the following chain of events:
Your code is written in UTF-8, but don't have use utf8; in it, so Perl is parsing your source code (and, in particular, any string literals in it) byte by byte. Thus, the string literal 'é' is parsed as a two-character string, because the UTF-8 encoding of é takes up two bytes.
Normally, this doesn't matter (much), because your STDOUT is also not in UTF-8 mode, and so it just takes any byte string you give it and spits it out byte by byte, and your terminal then interprets the resulting output as UTF-8 (or tries to).
So, when you do print 'é'; Perl thinks you're printing a two-character string in byte mode, and writes out two bytes, which just happen to make up the UTF-8 encoding of the single character é.
However, when you run your string through decode_entities(), it decodes the € into an actual Unicode € character, which does not fit inside a single byte.
When you try to print the resulting string, Perl notices the "wide" € character. It can't print it as a single byte, so instead, it falls back to encoding the whole string as UTF-8 (and emitting a warning, if you have those enabled, as you should). But that causes the és (which were already encoded, since Perl never decoded them while parsing your code) to get double-UTF8-encoded, producing the mojibake output you see.
A simple fix is to add use utf8; to your code, and also set all your filehandles (including STDIN / STDOUT / STDERR) to UTF-8 mode by default, e.g. like this:
use utf8;
use open qw(:std :utf8);
With those lines prepended to the test script above, the output you get should be:
Rémunération €
Rémunération €

perl uri_escape_utf8 with arabic

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli
Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

How to let File::Queue to be able to process utf8 strings in perl?

I'm processing some data from XML files in perl and wanna use the FIFO File::Queue to divide and speed up the process.
One perl script parses the XML file and prepares JSON output for another script:
#!/usr/bin/perl -w
binmode STDOUT, ":utf8";
use utf8;
use strict;
use XML::Rules;
use JSON;
use File::Queue;
#do the XML magic: %data contains result
my $q = new File::Queue (File => './importqueue', Mode => 0666);
my $json = new JSON;
my $qItem = $json->allow_nonref->encode(\%data);
$q->enq($qItem);
As long %data contains numeric and a-z data only this works fine. But when one of the widechars occurs (eg. ł, ą, ś, ż etc.) i'm getting: Wide character in syswrite at /usr/lib/perl/5.10/IO/Handle.pm line 207.
I have tried to check if the string is valid utf8:
print utf8::is_utf8($qItem). ':' . utf8::valid($qItem)
and I did get 1:1 - so yes I do have the correct utf8 string.
I have find out that the reason could be that syswrite gets the filehandler to the queue file which is not aware to be a :utf8 encoded file.
Am I right? If so is there any way to force File:Queue to use the :utf8 file handler?
Maybe the File:Queue is not the best choice - should I use sth else to create FIFO queue between two perl scripts?
utf8::is_utf8 does not tell you whether your string is encoded using UTF-8 or not. (That information is not even available.)
>perl -MEncode -E"say utf8::is_utf8(encode_utf8(chr(0xE9))) || 0"
0
utf8::valid does not tell you whether your string is valid UTF-8 or not.
>perl -MEncode -E"say utf8::valid(qq{\xE9}) || 0"
1
Both check some internal storage details. You should never have a need for either.
File::Queue can only transmit strings of bytes. It's up to you to serialise the data you want to transmit into a string.
The primary means of serialising text is character encoding, or just encoding for short. UTF-8 is a character encoding.
For example, the string
dostępu
consists of the following chars (each a Unicode code point):
64 6F 73 74 119 70 75
Not all of those chars fit in bytes, so the string can't be sent using File::Queue. If you were to encode that string using UTF-8, you'd get a string composed of the following chars:
64 6F 73 74 C4 99 70 75
Those chars fit in bytes, so that string can be sent using File::Queue.
JSON, as you used it, returns strings of Unicode code points. As such, you need to apply a character encoding.
File::Queue doesn't provide an option to automatically encode strings for you, so you'll have to do it yourself.
You could use encode_utf8 and decode_utf8 from the Encode module
my $json = JSON->new->allow_nonref;
$q->enq(encode_utf8($json->encode(\%data)));
my $data = $json->decode(decode_utf8($q->deq()));
or you can let JSON do the encoding/decoding for you.
my $json = JSON->new->utf8->allow_nonref;
$q->enq($json->encode(\%data));
my $data = $json->decode($q->deq());
Looking at the docs.....
perldoc -f syswrite
WARNING: If the filehandle is marked ":utf8", Unicode
characters encoded in UTF-8 are written instead of bytes, and
the LENGTH, OFFSET, and return value of syswrite() are in
(UTF8-encoded Unicode) characters. The ":encoding(...)" layer
implicitly introduces the ":utf8" layer. Alternately, if the
handle is not marked with an encoding but you attempt to write
characters with code points over 255, raises an exception. See
"binmode", "open", and the "open" pragma, open.
man 3perl open
use open OUT => ':utf8';
...
with the "OUT" subpragma you can declare the default
layers of output streams. With the "IO" subpragma you can control
both input and output streams simultaneously.
So I'd guess adding use open OUT=> ':utf8' to the top of your program would help

In Perl, can I treat a string as a byte array?

In Perl, is it appropriate to use a string as a byte array containing 8-bit data? All the documentation I can find on this subject focuses on 7-bit strings.
For instance, if I read some data from a binary file into $data
my $data;
open FILE, "<", $filepath;
binmode FILE;
read FILE $data 1024;
and I want to get the first byte out, is substr($data,1,1) appropriate? (again, assuming it is 8-bit data)
I come from a mostly C background, and I am used to passing a char pointer to a read() function. My problem might be that I don't understand what the underlying representation of a string is in Perl.
The bundled documentation for the read command, reproduced here, provides a lot of information that is relevant to your question.
read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH
Attempts to read LENGTH characters of data into variable SCALAR
from the specified FILEHANDLE. Returns the number of
characters actually read, 0 at end of file, or undef if there
was an error (in the latter case $! is also set). SCALAR will
be grown or shrunk so that the last character actually read is
the last character of the scalar after the read.
An OFFSET may be specified to place the read data at some place
in the string other than the beginning. A negative OFFSET
specifies placement at that many characters counting backwards
from the end of the string. A positive OFFSET greater than the
length of SCALAR results in the string being padded to the
required size with "\0" bytes before the result of the read is
appended.
The call is actually implemented in terms of either Perl's or
system's fread() call. To get a true read(2) system call, see
"sysread".
Note the characters: depending on the status of the filehandle,
either (8-bit) bytes or characters are read. By default all
filehandles operate on bytes, but for example if the filehandle
has been opened with the ":utf8" I/O layer (see "open", and the
"open" pragma, open), the I/O will operate on UTF-8 encoded
Unicode characters, not bytes. Similarly for the ":encoding"
pragma: in that case pretty much any characters can be read.
See perldoc -f pack and perldoc -f unpack for how to treat strings as byte arrays.
You probably want to use sysopen and sysread if you want to read bytes from binary file.
See also perlopentut.
Whether this is appropriate or necessary depends on what exactly you are trying to do.
#!/usr/bin/perl -l
use strict; use warnings;
use autodie;
use Fcntl;
sysopen my $bin, 'test.png', O_RDONLY;
sysread $bin, my $header, 4;
print map { sprintf '%02x', ord($_) } split //, $header;
Output:
C:\Temp> t
89504e47
Strings are strings of "characters", which are bigger than a byte.1 You can store bytes in them and manipulate them as though they are characters, taking substrs of them and so on, and so long as you're just manipulating entities in memory, everything is pretty peachy. The data storage is weird, but that's mostly not your problem.2
When you try to read and write from files, the fact that your characters might not map to bytes becomes important and interesting. Not to mention annoying. This annoyance is actually made a bit worse by Perl trying to do what you want in the common case: If all the characters in the string fit into a byte and you happen to be on a non-Windows OS, you don't actually have to do anything special to read and write bytes. Perl will complain, however, if you have stored a non-byte-sized character and try to write it without giving it a clue about what to do with it.
This is getting a little far afield, largely because encoding is a large and confusing topic. Let me leave it off there with some references: Look at Encode(3perl), open(3perl), perldoc open, and perldoc binmode for lots of hilarious and gory details.
So the summary answer is "Yes, you can treat strings as though they contained bytes if they do in fact contain bytes, which you can assure by only reading and writing bytes.".
1: Or pedantically, "which can express a larger range of values than a byte, though they are stored as bytes when that is convenient". I think.
2: For the record, strings in Perl are internally represented by a data structure called a 'PV' which in addition to a character pointer knows things like the length of the string and the current value of pos.3
3: Well, it will start storing the current value of pos if it starts being interesting. See also
use Devel::Peek;
my $x = "bluh bluh bluh bluh";
Dump($x);
$x =~ /bluh/mg;
Dump($x);
$x =~ /bluh/mg;
Dump($x);
It might help more if you tell us what you are trying to do with the byte array. There are various ways to work with binary data, and each lends itself to a different set of tools.
Do you want to convert the data into a Perl array? If so, pack and unpack are a good start. split could also come in handy.
Do you want to access individual elements of the string without unpacking it? If so, substr is fast and will do the trick for 8 byte data. If you want other bit depths, take a look at the vec function, which treads a string as a bit vector.
Do you want to scan the string and convert certain bytes to other bytes? Then the s/// or tr/// constructs might be useful.
Allow me just to post a small example about treating string as binary array - since I myself found it difficult to believe that something called "substr" would handle null bytes; but seemingly it does - below is a snippet of a perl debugger terminal session (with both string and array/list approaches):
$ perl -d
Loading DB routines from perl5db.pl version 1.32
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
^D
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1> $str="\x00\x00\x84\x00"
DB<2> print $str
�
DB<3> print unpack("H*",$str) # show content of $str as hex via `unpack`
00008400
DB<4> $str2=substr($str,2,2)
DB<5> print unpack("H*",$str2)
8400
DB<6> $str2=substr($str,1,3)
DB<7> print unpack("H*",$str2)
008400
[...]
DB<30> #stra=split('',$str); print #stra # convert string to array (by splitting at empty string)
�
DB<31> print unpack("H*",$stra[3]) # print indiv. elems. of array as hex
00
DB<32> print unpack("H*",$stra[2])
84
DB<33> print unpack("H*",$stra[1])
00
DB<34> print unpack("H*",$stra[0])
00
DB<35> print unpack("H*",join('',#stra[1..3])) # print only portion of array/list via indexes (using flipflop [two dots] operator)
008400