Converting hex into UTF8 not working as expected in perl - perl

I'm trying to understand UTF8 in perl.
I have the following string Alizéh. If I lookup the hex for this string I get 416c697ac3a968 from https://onlineutf8tools.com/convert-utf8-to-hexadecimal (This matches with the original source of this string).
So I thought packing that hex and encoding it to utf8 should produce the unicode string. But it produces something very different.
Is anyone able to explain what I'm getting wrong?
Here is a simple test program to show my working.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
This prints:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
Any tips on how to take the hex value of a UTF8 string and turn it into a valid UTF8 scalar in perl?
There is some further weirdness I'll explain in this extended version
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
print "=========================================== Unaccent test start\n";
my $plaintest = unac_string('utf8', "Alizéh");
print "Alizéh passed to the unaccent gives $plaintest\n";
my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as $cleanpackedHexIntoPlainString\n";
my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Unaccenting the packed version gives $packedtest\n";
utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";
$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Now unaccenting the packed version gives $packedtest\n";
print "=========================================== Unaccent test finish\n\n";
This prints:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish
In this test it seems that the unaccent library accepts the packed version of the strings hex. I'm not sure why, could anyone please help me understand why that works?

Unicode strings are first-class values in Perl, you do not need to jump through these hoops. You just need to recognize and keep track of when you have bytes and when you have characters, Perl will not differentiate for you, and all byte strings are also valid character strings. Indeed, you are double-encoding your strings, which are still valid as the UTF-8 encoded bytes representing (the characters corresponding to) your UTF-8 encoded bytes.
use utf8; will decode your source code from UTF-8, so by declaring that your following literal strings are already unicode strings and can be passed to any API that correctly accepts characters. To get the same from a string of UTF-8 bytes (as you are producing by packing the hex representation of the bytes), use decode from Encode (or my nicer wrapper).
use strict;
use warnings;
use utf8;
use Encode 'decode';
my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;
Unicode strings need to be encoded to UTF-8 for output to something that expects bytes, such as STDOUT; a :encoding(UTF-8) layer can be applied to such handles to do this automatically, and the same to automatically decode from input handles. The exact nature of what should be applied depends entirely on where your characters are coming from and where they are going. See this answer for way too much information on the options available.
use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";

Related

Perl: Packing a sequence of bytes into a string

I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.
In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)
The code below works when I use the character directly in the source.
But nothing that passes through pack works.
For the UTF-8 case, I need to set the UTF-8 flag on the string $ch , but how.
The UCS-2 case fails, and I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
The code:
#!/usr/bin/perl
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
sub showme {
my ($name,$ch) = #_;
print "-------\n";
print "This is test: $name\n";
my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint
{
# https://perldoc.perl.org/bytes.html
use bytes;
my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
my $txt = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
print $txt,"\n";
}
print $ch, "\n";
print "Combine: $ch\n";
print "Concat: " . $ch . "\n";
print "Sprintf: " . sprintf("%s",$ch) . "\n";
print "-------\n";
}
showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8" , pack("HH","D0","B4")); # UTF-8 of д is D0B4
showme("Cyrillic UCS-2" , pack("HH","04","34")); # UCS-2 of д is 0434
Current output:
Looks good
-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes
д
Combine: д
Concat: д
Sprintf: д
-------
That's a no. Where does the 176 come from??
-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no
а
Combine: а
Concat: а
Sprintf: а
-------
This is even worse.
-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no
0
Combine: 0
Concat: 0
Sprintf: 0
-------
You have two problems.
Your calls to pack are incorrect
Each H represents one hex digit.
$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")' # XXX
D0.B0
$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")' # Better
D0.B4
$ perl -e'printf "%vX\n", pack("H*", "D0B4")' # Alternative
D0.B4
STDOUT is expecting decoded text, but you are providing encoded text
First, let's take a look at strings you are producing (once the problem mentioned above is fixed). All you need for that is the %vX format, which provides the period-separated value of each character in hex.
"д" produces a one-character string. This character is the Unicode Code Point for д.
$ perl -e'use utf8; printf("%vX\n", "д");'
434
pack("H*", "D0B4") produces a two-character string. These characters are the UTF-8 encoding of д.
$ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
D0.B4
pack("H*", "0434") produces a two-character string. These characters are the UCS-2be and UTF-16be encodings of д.
$ perl -e'printf("%vX\n", pack("H*", "0434"));'
4.34
Normally, a file handle expects a string of bytes (characters with values in 0..255) to be printed to it. These bytes are output verbatim.[1][2]
When an encoding layer (e.g. :encoding(UTF-8)) is added to a file handle, it expects a string of Unicode Code Points (aka decoded text) to be printed to it instead.
Your program adds an encoding layer to STDOUT (through its use of the use open pragma), so you must provide UCP (decoded text) to print and say. You can obtain decoded text from encoded text using, for example, Encode's decode function.
use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
use Encode qw( decode );
say "д"; # ok (UCP of "д")
say pack("H*", "D0B4"); # XXX (UTF-8 encoding of "д")
say pack("H*", "0434"); # XXX (UCS-2be and UTF-16be encoding of "д")
say decode("UTF-8", pack("H*", "D0B4")); # ok (UCP of "д")
say decode("UCS-2be", pack("H*", "0434")); # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434")); # ok (UCP of "д")
For the UTF-8 case, I need to set the UTF-8 flag on
No, you need to decode the strings.
The UTF-8 flag is irrelevant. Whether the flag is set or not originally is irrelevant. Whether the flag is set or not after the string is decoded is irrelevant. The flag indicates how the string is stored internally, something you shouldn't care about.
For example, take
use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
my $x = chr(0xE9);
utf8::downgrade($x); # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
utf8::upgrade($x); # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
It outputs
UTF8=0 E9 é
UTF8=1 E9 é
Regardless of the UTF8 flag, the UTF-8 encoding (C3 A9) of the provided UCP (U+00E9) is output.
I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
At best, one could employ heuristics to guess whether a string is encoded using iso-latin-1 or UCS-2be. I suspect one could get rather accurate results (like those you'd get for iso-latin-1 and UTF-8.)
I'm not sure why you bring up iso-latin-1 since nothing else in your question relates to iso-latin-1.
Except on Windows, where a :crlf layer added to handles by default.
You get a Wide character warning if you provide a string that contains a character that's not a byte, and the utf8 encoding of the string is output instead.
Please see if following demonstration code of any help
use strict;
use warnings;
use feature 'say';
use utf8; # https://perldoc.perl.org/utf8.html
use Encode; # https://perldoc.perl.org/Encode.html
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
# https://perldoc.perl.org/functions/binmode.html
binmode STDOUT, ':utf8';
# https://perldoc.perl.org/feature.html#The-'say'-feature
say 'UTF-8: ' . $utf8;
# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
$str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
$str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
Output
UTF-8: Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16: Привет Москва
UTF-32: Привет Москва
Supported Cyrillic encodings
use strict;
use warnings;
use feature 'say';
use Encode;
use utf8;
binmode STDOUT, ':utf8';
my $utf8 = 'Привет Москва';
my #encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 ', $utf8;
for (#encodings) {
printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}
Output
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 Привет Москва
UCS-2 041f044004380432043504420020041c043e0441043a04320430
UCS-2LE 1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE 041f044004380432043504420020041c043e0441043a04320430
UTF-16 feff041f044004380432043504420020041c043e0441043a04320430
UTF-32 0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5 bfe0d8d2d5e220bcdee1dad2d0
CP855 dde1b7eba8e520d3d6e3c6eba0
CP1251 cff0e8e2e5f220cceef1eae2e0
KOI8-F f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U f0d2c9d7c5d420edcfd3cbd7c1
Documentation Encode::Supported
Both are good answer. Here is a slight extension of Polar Bear's code to print details about the string:
use strict;
use warnings;
use feature 'say';
use utf8;
use Encode;
sub about {
my($str) = #_;
# https://perldoc.perl.org/bytes.html
my $charlen = length($str);
my $txt;
{
use bytes;
my $mark = (utf8::is_utf8($str) ? "yes" : "no");
my $bytelen = length($str);
$txt = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n",
$bytelen,$charlen,$mark,$str);
}
return $txt;
}
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
binmode STDOUT, ':utf8';
say 'UTF-8: ' . $utf8;
say about($utf8);
{
my $str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
say about($str);
}
{
my $str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
say about($str);
}
{
my $str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
say about($str);
}
{
my $str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
say about($str);
}
# Try identity transcoding
{
my $str_encoded_in_utf16 = encode('UTF16',$utf8);
my $str = decode('UTF16',$str_encoded_in_utf16);
say 'The same: ' . $str;
say about($str);
}
Running this gives:
UTF-8: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4
UTF-16: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UTF-32: Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48
The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
And a little diagram I made as an overview for next time, covering encode, decode and pack. Because one better be ready for next time.
(The above diagram & its graphml file available here)

Perl - Convert utf-8 char to hyphen - read utf-8 as single char

I am new to perl. I have a requirement, where I have to convert UTF-8 characters in a string to hyphen(-).
Input string - "IVM IST 20150324095652 31610150096 10ÑatÑ25ÑDisco 0000000091"
Expected output - "IVM IST 20150324095652 31610150096 10-at-25-Disco 0000000091".
But the below program which I have written, reads UTF-8 char as two separate bytes and getting the output as "10--at--25--Disco"
[root# cdr]# cat ../asciifilter.pl
#!/usr/bin/perl
use strict;
use Encode;
my #chars;
my $character;
my $num;
while(my $row = <>) {
#chars = split(//,$row);
foreach $character (#chars) {
$num = ord($character);
if($num < 127) {
print $character;
} else {
print "-";
}
}
}
Output:
[root#MAVBGL-351L cdr]# echo "IVM IST 20150324095652 31610150096 10ÑatÑ25ÑDisco 0000000091" | ../asciifilter.pl
IVM IST 20150324095652 31610150096 10--at--25--Disco 0000000091
But this particular 4th string column has a fixed length of 14 characters only.So that additional hyphens are creating problem.
Can someone give me some clue on how to read UTF-8 char as single character ?
The main thing you need is perl -CSD. With that, the script can be as simple as
perl -CSD -pe 's/[^\x00-\x7F]/-/g'
See man perlrun for a discussion of the options; but briefly, -CS means STDIN, STDOUT, and STDERR are in UTF-8; and -CD means UTF-8 is the default PerlIO layer for both input and output streams. (This script only uses STDIN and STDOUT so the D isn't strictly necessary; but if you only learn one magic incantation, learn -CSD.)

How do I decode a double backslashed PERLQQ escaped string into Perl characters?

I read lines from a file which contains semi-utf8 encoding and I wish to convert it to Perl-internal representation for further operations.
file.in (plain ASCII):
MO\\xc5\\xbdN\\xc3\\x81
NOV\\xc3\\x81
These should translate to MOŽNÁ and NOVÁ.
I load the lines and upgrade them to proper utf8 notation, ie. \\xc5\\xbd -> \x{00c5}\x{00bd}. Then I would like to take this upgraded $line and make perl to represent it internally:
for my $line (#lines) {
$line =~ s/x(..)/x{00$1}/g;
eval { $l = "$line"; };
}
Unfortunately, without success.
use File::Slurp qw(read_file);
use Encode qw(decode);
use Encode::Escape qw();
my $string =
decode 'UTF-8', # octets → characters
decode 'unicode-escape', # \x → octets
decode 'ascii-escape', # \\x → \x
read_file 'file.in';
Read from the bottom upwards.

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);

How do I find the length of a Unicode string in Perl?

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.
If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
$ascii = 'Lorem ipsum dolor sit amet';
{
use utf8;
$unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';
no bytes; # default, can be omitted
print "Character semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
print "----\n";
use bytes;
print "Byte semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
This outputs:
Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:
#!/usr/bin/perl
use strict;
use warnings;
sub bytes($) {
use bytes;
return length shift;
}
my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8 = "\x{24d5}\x{24de}\x{24de}";
print "[$ascii] characters: ", length $ascii, "\n",
"[$ascii] bytes : ", bytes $ascii, "\n",
"[$utf8] characters: ", length $utf8, "\n",
"[$utf8] bytes : ", bytes $utf8, "\n";
Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is &#x24d5 (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.
I found that it is possible to use Encode module to influence how the length works.
if $string is utf8 encoded string.
Encode::_utf8_on($string); # the length function will show number of code points after this.
Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.
There’s a fair bit of problematic commentary here.
Perl doesn’t know—and doesn’t care—which strings are “Unicode” and which aren’t. All it knows is the code points that make up the string.
Peeking at Perl’s internal UTF8 flag indicates you likely have the wrong idea about Perl strings. A “UTF-8 encoded string”—that is, the result of an encode operation like utf8::encode—usually does NOT have that flag set, for example.
There are some interfaces where that abstraction leaks, and strings with the internal UTF8 flag set DO behave differently from the same set of code points without that flag (that is, after utf8::downgrade). It’s unwise to rely on these behaviours since Perl’s own maintainers regard them as bugs. Most are fixed by the “unicode_strings” and “unicode_eval” features, and the rest by Sys::Binmode from CPAN.