I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.
Related
I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.
In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)
The code below works when I use the character directly in the source.
But nothing that passes through pack works.
For the UTF-8 case, I need to set the UTF-8 flag on the string $ch , but how.
The UCS-2 case fails, and I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
The code:
#!/usr/bin/perl
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
sub showme {
my ($name,$ch) = #_;
print "-------\n";
print "This is test: $name\n";
my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint
{
# https://perldoc.perl.org/bytes.html
use bytes;
my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
my $txt = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
print $txt,"\n";
}
print $ch, "\n";
print "Combine: $ch\n";
print "Concat: " . $ch . "\n";
print "Sprintf: " . sprintf("%s",$ch) . "\n";
print "-------\n";
}
showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8" , pack("HH","D0","B4")); # UTF-8 of д is D0B4
showme("Cyrillic UCS-2" , pack("HH","04","34")); # UCS-2 of д is 0434
Current output:
Looks good
-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes
д
Combine: д
Concat: д
Sprintf: д
-------
That's a no. Where does the 176 come from??
-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no
а
Combine: а
Concat: а
Sprintf: а
-------
This is even worse.
-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no
0
Combine: 0
Concat: 0
Sprintf: 0
-------
You have two problems.
Your calls to pack are incorrect
Each H represents one hex digit.
$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")' # XXX
D0.B0
$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")' # Better
D0.B4
$ perl -e'printf "%vX\n", pack("H*", "D0B4")' # Alternative
D0.B4
STDOUT is expecting decoded text, but you are providing encoded text
First, let's take a look at strings you are producing (once the problem mentioned above is fixed). All you need for that is the %vX format, which provides the period-separated value of each character in hex.
"д" produces a one-character string. This character is the Unicode Code Point for д.
$ perl -e'use utf8; printf("%vX\n", "д");'
434
pack("H*", "D0B4") produces a two-character string. These characters are the UTF-8 encoding of д.
$ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
D0.B4
pack("H*", "0434") produces a two-character string. These characters are the UCS-2be and UTF-16be encodings of д.
$ perl -e'printf("%vX\n", pack("H*", "0434"));'
4.34
Normally, a file handle expects a string of bytes (characters with values in 0..255) to be printed to it. These bytes are output verbatim.[1][2]
When an encoding layer (e.g. :encoding(UTF-8)) is added to a file handle, it expects a string of Unicode Code Points (aka decoded text) to be printed to it instead.
Your program adds an encoding layer to STDOUT (through its use of the use open pragma), so you must provide UCP (decoded text) to print and say. You can obtain decoded text from encoded text using, for example, Encode's decode function.
use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
use Encode qw( decode );
say "д"; # ok (UCP of "д")
say pack("H*", "D0B4"); # XXX (UTF-8 encoding of "д")
say pack("H*", "0434"); # XXX (UCS-2be and UTF-16be encoding of "д")
say decode("UTF-8", pack("H*", "D0B4")); # ok (UCP of "д")
say decode("UCS-2be", pack("H*", "0434")); # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434")); # ok (UCP of "д")
For the UTF-8 case, I need to set the UTF-8 flag on
No, you need to decode the strings.
The UTF-8 flag is irrelevant. Whether the flag is set or not originally is irrelevant. Whether the flag is set or not after the string is decoded is irrelevant. The flag indicates how the string is stored internally, something you shouldn't care about.
For example, take
use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
my $x = chr(0xE9);
utf8::downgrade($x); # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
utf8::upgrade($x); # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
It outputs
UTF8=0 E9 é
UTF8=1 E9 é
Regardless of the UTF8 flag, the UTF-8 encoding (C3 A9) of the provided UCP (U+00E9) is output.
I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
At best, one could employ heuristics to guess whether a string is encoded using iso-latin-1 or UCS-2be. I suspect one could get rather accurate results (like those you'd get for iso-latin-1 and UTF-8.)
I'm not sure why you bring up iso-latin-1 since nothing else in your question relates to iso-latin-1.
Except on Windows, where a :crlf layer added to handles by default.
You get a Wide character warning if you provide a string that contains a character that's not a byte, and the utf8 encoding of the string is output instead.
Please see if following demonstration code of any help
use strict;
use warnings;
use feature 'say';
use utf8; # https://perldoc.perl.org/utf8.html
use Encode; # https://perldoc.perl.org/Encode.html
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
# https://perldoc.perl.org/functions/binmode.html
binmode STDOUT, ':utf8';
# https://perldoc.perl.org/feature.html#The-'say'-feature
say 'UTF-8: ' . $utf8;
# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
$str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
$str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
Output
UTF-8: Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16: Привет Москва
UTF-32: Привет Москва
Supported Cyrillic encodings
use strict;
use warnings;
use feature 'say';
use Encode;
use utf8;
binmode STDOUT, ':utf8';
my $utf8 = 'Привет Москва';
my #encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 ', $utf8;
for (#encodings) {
printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}
Output
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 Привет Москва
UCS-2 041f044004380432043504420020041c043e0441043a04320430
UCS-2LE 1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE 041f044004380432043504420020041c043e0441043a04320430
UTF-16 feff041f044004380432043504420020041c043e0441043a04320430
UTF-32 0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5 bfe0d8d2d5e220bcdee1dad2d0
CP855 dde1b7eba8e520d3d6e3c6eba0
CP1251 cff0e8e2e5f220cceef1eae2e0
KOI8-F f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U f0d2c9d7c5d420edcfd3cbd7c1
Documentation Encode::Supported
Both are good answer. Here is a slight extension of Polar Bear's code to print details about the string:
use strict;
use warnings;
use feature 'say';
use utf8;
use Encode;
sub about {
my($str) = #_;
# https://perldoc.perl.org/bytes.html
my $charlen = length($str);
my $txt;
{
use bytes;
my $mark = (utf8::is_utf8($str) ? "yes" : "no");
my $bytelen = length($str);
$txt = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n",
$bytelen,$charlen,$mark,$str);
}
return $txt;
}
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
binmode STDOUT, ':utf8';
say 'UTF-8: ' . $utf8;
say about($utf8);
{
my $str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
say about($str);
}
{
my $str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
say about($str);
}
{
my $str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
say about($str);
}
{
my $str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
say about($str);
}
# Try identity transcoding
{
my $str_encoded_in_utf16 = encode('UTF16',$utf8);
my $str = decode('UTF16',$str_encoded_in_utf16);
say 'The same: ' . $str;
say about($str);
}
Running this gives:
UTF-8: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4
UTF-16: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UTF-32: Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48
The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
And a little diagram I made as an overview for next time, covering encode, decode and pack. Because one better be ready for next time.
(The above diagram & its graphml file available here)
I am attempting to obfusicate some code, and using a short "pm" library. The code is in the cgi-bin. But I am getting "Identifier too long" errors ... yet the line it's deciphering is only 1016 chars in plain form.
Here is the library:
package dml;
use Filter::Simple;
FILTER {
$_=~tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
}
And the actual program itself ...
BEGIN {
$path = 'D:/home/cristofa/';
}
use lib $path;
use dml;
! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...
no dml;
I have shortened the code for obvious reasons.
As well as the "identifier too long", I can change other bits, (I think removing filter::simple and using tr~ on its own) and then get "NO is not allowed" referring to the 'no dml' line. I tried putting the data into $_='! 9_44 ...' but that comes back re changing a read only value!!!
If you're curious, the first two figures above SHOULD convert to "3d". I step through the decoded string two at a time, and thus hex for the above is "=", (since the first line of the decoded file is "$f='xyz';" - and I ran into problems trying to substitute the Dollar back to a variable - I ended up using "=$f='xyz';" in the script and then using $data=~s/=\$/\$/g; when converting)
But my 'dilemma' is why that 1016 byte line is causing the script to throw a "wobbly" when I have another program using a library which decodes 2678 bytes with no problem.
$ perl -E'
$_ = "! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...";
tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
say;
'
3d24663d27687474703a2f2f7777772e636f61646d656d2e636f6d2f6c61796f75742f6d79d...
That is indeed a very very long identifier.
That looks like hex. Let's try converting the sequence from hex into bytes and rendering them on a UTF-8 terminal.
$ perl -E'
$_ = "! 9_44! 96476_6_68!h9d9d6666669n4!4d454_4 4|4 9n4!4d4 9d4j45634d6|6_9d4 63 ...";
tr/[hjndr 9_45|863!7]/[acefbd2461507938]/;
$_ = pack("H*", $_);
say;
'
=$f='http://www.coadmem.com/layout/my<garbage>
Bingo! You forgot $_ = pack("H*", $_); in your filter.
By the way, tr/[abc]/[def]/ is equivalent tr/][abc/][def/, which is equivalent to tr/abc/def/ (except for the returned value you ignore). Get rid of [ and ]!
I'm trying to write up an example of testing query string parsing when I got stumped on a Unicode issue. In short, the letter "Omega" (Ω) doesn't seem to be decoded correctly.
Unicode: U+2126
3-byte sequence: \xe2\x84\xa6
URI encoded: %E2%84%A6
So I wrote this test program verify that I could "decode" unicode query strings with URI::Encode.
use strict;
use warnings;
use utf8::all; # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;
sub parse_query_string {
my $query_string = shift;
my #pairs = split /[&;]/ => $query_string;
my %values_for;
foreach my $pair (#pairs) {
my ( $key, $value ) = split( /=/, $pair );
$_ = uri_decode($_) for $key, $value;
$values_for{$key} ||= [];
push #{ $values_for{$key} } => $value;
}
return \%values_for;
}
my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';
diag $omega;
diag $query->{alpha}[0];
done_testing;
And the output of the test:
query.t ..
not ok 1 - Unicode should decode correctly
# Failed test 'Unicode should decode correctly'
# at query.t line 23.
# Structures begin differing at:
# $got->{alpha}[0] = 'â¦'
# $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests
Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
Failed test: 1
Non-zero exit status: 1
Files=1, Tests=1, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.05 cusr 0.00 csys = 0.09 CPU)
Result: FAIL
It looks to me like URI::Encode may be broken here, but switching to URI::Escape and using the uri_unescape function reports the same error. What am I missing?
the URI encoded characters simply represents utf-8 sequences, and URI::Encode and URI::Escape simply decodes them into a utf-8 byte string, and neither of them decode the bytestrings as UTF-8 (which is a correct behavior as a generic URI decoding library).
Put it another way, your code basically does:
is "\N{U+2126}", "\xe2\x84\xa6" and that will fail, since upon comparison, perl upgrades the latter as a 3-character-length latin-1 strings.
You have to manually decode the input value with Encode::decode_utf8 after uri_decode, or instead compare encoded utf8 byte sequence.
URI escaping represents octets and knows nothing about character encodings, so you have to decode from UTF-8 octets to characters yourself, e.g.:
$_ = decode_utf8(uri_decode($_)) for $key, $value;
The problem can be seen in incorrect details in your own explanation of the problem. What you are dealing with is really:
Unicode codepoint: U+2126
UTF-8 encoding of codepoint: \xe2\x84\xa6
URI encoding of UTF-8 encoding of codepoint: %E2%84%A6
The problem is that you only undid one of the encodings.
Solutions have already been presented. I just wanted to present an alternate explanation.
I'd recommend that you have a look at Why does modern Perl avoid UTF-8 by default? for a thorough discussion on this topic.
I would add to the discussion there:
You'll notice a lot of odd glyphs on the page. This was intentional on the part of the author.
I've tried the Symbola font recommended in the thread and it looked horrible on Win 7. YMMV.
Reading Why does modern Perl avoid UTF-8 by default? too frequently may lead to depression and lingering doubts about your life choices.
I recently wrote a script which parsed a text representation of a single binary byte month field.
(Don't ask :-{ )
After fiddling with sprintf for a while I gave up and did this;
our %months = qw / x01 1
x02 2
x03 3
x04 4
x05 5
x06 6
x07 7
x08 8
x09 9
x0a 10
x0b 11
x0c 12 /;
...
my $month = $months{$text};
Which I get away with, because I'm only using 12 numbers, but is there a better way of doing this?
If you have
$hex_string = "0x10";
you can use:
$hex_val = hex($hex_string);
And you'll get: $hex_val == 16
hex doesn't require the "0x" at the beginning of the string. If it's missing it will still translate a hex string to a number.
You can also use oct to translate binary, octal or hex strings to numbers based on the prefix:
0b - binary
0 - octal
0x - hex
See hex and/or oct.
#!/usr/bin/perl
use strict;
use warnings;
my #months = map hex, qw/x01 x02 x03 x04 x05 x06 x07 x08 x09 x0a x0b x0c/;
print "$_\n" for #months;
If I understand correctly you have 1 byte per month - not string "0x10", but rather byte with 10 in it.
In this way, you should use unpack:
my $in = "\x0a";
print length($in), "\n";
my ($out) = unpack("c", $in);
print length($out), "\n", $out, "\n"
output:
1
2
10
If the input are 3 characters, like "x05", then changing is also quite simple:
my $in = "x0a";
my $out = hex($in);
Here's another way that may be more practical for directly converting the hexadecimals contained in a string.
This make use of the /e (e for eval) regex modifier on s///.
Starting from this string:
$hello_world = "\\x48\\x65\\x6c\\x6c\\x6f\\x20\\x57\\x6f\\x72\\x6c\\x64";
Hexadecimals to characters :
print $hello_world =~ s/\\x([0-9a-fA-F]{2})/chr hex $1/gre;
Hexadecimals to decimal numbers :
print $hello_world =~ s/\\x([0-9a-fA-F]{2})/hex $1/gre;
Drop the /r modifier to substitute the string in-place.
One day I used a python script that did stuff with a binary file and I was stuck with a bytes literal (b'\x09\xff...') containing only hexadecimal digits.
I managed to get back my bytes with a one-liner that was a variant of the above.
The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.
If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
$ascii = 'Lorem ipsum dolor sit amet';
{
use utf8;
$unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';
no bytes; # default, can be omitted
print "Character semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
print "----\n";
use bytes;
print "Byte semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
This outputs:
Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:
#!/usr/bin/perl
use strict;
use warnings;
sub bytes($) {
use bytes;
return length shift;
}
my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8 = "\x{24d5}\x{24de}\x{24de}";
print "[$ascii] characters: ", length $ascii, "\n",
"[$ascii] bytes : ", bytes $ascii, "\n",
"[$utf8] characters: ", length $utf8, "\n",
"[$utf8] bytes : ", bytes $utf8, "\n";
Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is ⓕ (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.
I found that it is possible to use Encode module to influence how the length works.
if $string is utf8 encoded string.
Encode::_utf8_on($string); # the length function will show number of code points after this.
Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.
There’s a fair bit of problematic commentary here.
Perl doesn’t know—and doesn’t care—which strings are “Unicode” and which aren’t. All it knows is the code points that make up the string.
Peeking at Perl’s internal UTF8 flag indicates you likely have the wrong idea about Perl strings. A “UTF-8 encoded string”—that is, the result of an encode operation like utf8::encode—usually does NOT have that flag set, for example.
There are some interfaces where that abstraction leaks, and strings with the internal UTF8 flag set DO behave differently from the same set of code points without that flag (that is, after utf8::downgrade). It’s unwise to rely on these behaviours since Perl’s own maintainers regard them as bugs. Most are fixed by the “unicode_strings” and “unicode_eval” features, and the rest by Sys::Binmode from CPAN.