Perl LWP::UserAgent mishandling UTF-8 response - perl

When I use LWP::UserAgent to retrieve content encoded in UTF-8 it seems LWP::UserAgent doesn't handle the encoding correctly.
Here's the output after setting the Command Prompt window to Unicode by the command chcp 65001 Note that this initially gives the appearance that all is well, but I think it's just the shell reassembling bytes and decoding UTF-8, From the other output you can see that perl itself is not handling wide characters correctly.
C:\>perl getutf8.pl
======================================================================
HTTP/1.1 200 OK
Connection: close
Date: Fri, 31 Dec 2010 19:24:04 GMT
Accept-Ranges: bytes
Server: Apache/2.2.8 (Win32) PHP/5.2.6
Content-Length: 75
Content-Type: application/xml; charset=utf-8
Last-Modified: Fri, 31 Dec 2010 19:20:18 GMT
Client-Date: Fri, 31 Dec 2010 19:24:04 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
<?xml version="1.0" encoding="UTF-8"?>
<name>Budějovický Budvar</name>
======================================================================
response content length is 33
....v....1....v....2....v....3....v....4
<name>Budějovický Budvar</name>
. . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . .
3c6e616d653e427564c49b6a6f7669636bc3bd204275647661723c2f6e616d653e
< n a m e > B u d � � j o v i c k � � B u d v a r < / n a m e >
Above you can see the payload length is 31 characters but Perl thinks it is 33.
For confirmation, in the hex, we can see that the UTF-8 sequences c49b and c3bd are being interpreted as four separate characters and not as two Unicode characters.
Here's the code
#!perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $response = $ua->get('http://localhost/Bud.xml');
if (! $response->is_success) { die $response->status_line; }
print '='x70,"\n",$response->as_string(), '='x70,"\n";
my $r = $response->decoded_content((charset => 'UTF-8'));
$/ = "\x0d\x0a"; # seems to be \x0a otherwise!
chomp($r);
# Remove any xml prologue
$r =~ s/^<\?.*\?>\x0d\x0a//;
print "Response content length is ", length($r), "\n\n";
print "....v....1....v....2....v....3....v....4\n";
print $r,"\n";
print ". . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . . \n";
print unpack("H*", $r), "\n";
print join(" ", split("", $r)), "\n";
Note that Bud.xml is UTF-8 encoded without a BOM.
How can I persuade LWP::UserAgent to do the right thing?
P.S. Ultimately I want to translate the Unicode data into an ASCII encoding, even if it means replacing each non-ASCII character with one question mark or other marker.
Update 1
I have accepted Ysth's "upgrade" answer - because I know it is the right thing to do when possible. However there is a work around to fix up the data into a well formed Perl Unicode string.
$r = decode("utf8", $r);
Update 2
My data gets fed to a non-Perl application that displays the data using Code Page 437 to Putty/Reflection/Teraterm terminals at many locations. The app is currently displaying something like:
Bud├ä┬øjovick├â┬¢ Budvar
I am going to use ($r = decode("UTF-8", $r)) =~ s/[\x80-\x{FFFF}]/\xFE/g; to get the app to display:
Bud■jovick■ Budvar
Moving away from CP437 would be a major job, so that is not going to happen in the short to medium term.
Update 3
CPAN has some interesting Unicode modules such as:
Text::Unidecode
Unicode::Map8
Unicode::Map
Unicode::Escape
Unicode::Transliterate
Text::Unidecode translated "Budějovický Budvar" into "Budejovicky Budvar" - which didn't seem to me a particularly impressive attempt at a phonetic transliteration but then I don't speak Czech. English speakers might prefer it to "Bud■jovick■ Budvar" though.

Upgrade to a newer libwwwperl. The old version you are using only honored the charset argument to decoded_content for text/* content types; the newer version also does so for application/xml or anything ending +xml.

Related

Perl: Homograph attacks. It is possible to compare ascii / non-ascii strings, visually similar?

I faced this so called "homograph attack" and I want to reject domains where decoded punycode visually seems to be alphanumeric only. For example, www.xn--80ak6aa92e.com will display www.apple.com in browser (Firefox). Domains are visually the same, but character set is different. Chrome already patched this and browser display the punycode.
I have example below.
#!/usr/bin/perl
use strict;
use warnings;
use Net::IDN::Encode ':all';
use utf8;
my $testdomain = "www.xn--80ak6aa92e.com";
my $IDN = domain_to_unicode($testdomain);
my $visual_result_ascii = "www.apple.com";
print "S1: $IDN\n";
print "S2: $visual_result_ascii";
print "MATCH" if ($IDN eq $visual_result_ascii);
Visually are the same, but they won't match. It is possible to compare an unicode string ($IDN) against an alphanumeric string, visually the same?
Your example converted by the Punycode converter results in this UTF-8 string:
www.аррӏе.com
$ perl -e 'printf("%02x ", ord) for split("", "www.аррӏе.com"); print "\n"'
77 77 77 2e d0 b0 d1 80 d1 80 d3 8f d0 b5 2e 63 6f 6d
As Unicode:
$ perl -Mutf8 -e 'printf("%04x ", ord) for split("", "www.аррӏе.com"); print "\n"'
0077 0077 0077 002e 0430 0440 0440 04cf 0435 002e 0063 006f 006d
Using #ikegamis input:
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\p{Cyrillic}/g); print "\n"'
аррӏе
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\P{Cyrillic}/g); print "\n"'
www..com
Original idea
I'm not sure if code for this exists, but my first idea would be to create a map \N{xxxx} -> "visual equivalent ASCII/UTF-8 code". Then you could apply the map on the Unicode string to "convert" it to ASCII/UTF-8 code and compare the resulting string with a list of domains.
Example code (I'm skipping the IDN decoding stuff and use the UTF-8 result directly in the test data). This could probably still be improved, but at least it shows the idea.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
# Unicode (in HEX) -> visually equal ASCII/ISO-8859-1/... character
my %unicode_to_equivalent = (
'0430' => 'a',
'0435' => 'e',
'04CF' => 'l',
'0440' => 'p',
);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN, $compare) = split(' ', $_) ; # already decoded in test data
my $visually_decoded =
join('', # merge result
map { # map, if mapping exists
$unicode_to_equivalent{sprintf("%04X", ord($_))} // $_
}
split ('', $IDN) # split to characters
);
print "Testing: ", encode('UTF-8', $IDN), " -> $compare ";
print "Visual match!"
if ($visually_decoded eq $compare);
print "\n";
}
exit 0;
__DATA__
www.аррӏе.com www.apple.com
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com -> www.apple.com Visual match!
Counting the # of scripts in the string
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use Unicode::UCD qw(charscript);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN) = $_; # already decoded in test data
# Unicod characters
my #characters = split ('', $IDN);
# See UTR #39: Unicode Security Mechanisms
my %scripts =
map { (charscript(ord), 1) } # Codepoint to script
#characters;
delete %scripts{Common};
print 'Testing: ',
encode('UTF-8', $IDN),
' (', join(' ', map { sprintf("%04X", ord) } #characters), ')',
(keys %scripts == 1) ? ' not' : '', " suspicious\n";
}
exit 0;
__DATA__
www.аррӏе.com
www.apple.com
www.école.fr
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com (0077 0077 0077 002E 0430 0440 0440 04CF 0435 002E 0063 006F 006D) suspicious
Testing: www.apple.com (0077 0077 0077 002E 0061 0070 0070 006C 0065 002E 0063 006F 006D) not suspicious
Testing: www.école.fr (0077 0077 0077 002E 00E9 0063 006F 006C 0065 002E 0066 0072) not suspicious
After some research and thanks to your comments, I have a conclusion now.
The most frequent issues are coming from Cyrillic. This set contains a lot of visually-similar to Latin characters and you can do many combinations.
I have identified some scammy IDN domains including these names:
"аррӏе" "сһаѕе" "сіѕсо"
Maybe here, with this font, you can see a difference, but in browser is absolutely no visual difference.
Consulting https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode I was able to create a table with 12 visually similar characters.
Update: I found 4 more Latin-like characters in Cyrillic charset, 16 in total now.
It is possible to create many combinations between these, to create IDNs 100% visually-similar to legit domains.
0430 a CYRILLIC SMALL LETTER A
0441 c CYRILLIC SMALL LETTER ES
0501 d CYRILLIC SMALL LETTER KOMI DE
0435 e CYRILLIC SMALL LETTER IE
04bb h CYRILLIC SMALL LETTER SHHA
0456 i CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
0458 j CYRILLIC SMALL LETTER JE
043a k CYRILLIC SMALL LETTER KA
04cf l CYRILLIC SMALL LETTER PALOCHKA
043e o CYRILLIC SMALL LETTER O
0440 p CYRILLIC SMALL LETTER ER
051b q CYRILLIC SMALL LETTER QA
0455 s CYRILLIC SMALL LETTER DZE
051d w CYRILLIC SMALL LETTER WE
0445 x CYRILLIC SMALL LETTER HA
0443 y CYRILLIC SMALL LETTER U
The problem is happening with second level domain. Extensions can also be IDN, but they are verified, can not be spoofed and not subject of this issue.
Domain registrar will check if all letters are from the same set. IDN will not be accepted if you have a mix of Latin,non-Latin characters. So, extra validation is pointless.
My idea is simple. We split the domain and only decode SLD part, then we match against a visually-similar Cyrillic list.
If all letters are visually similar to Latin, then result is almost sure scam.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
use Net::IDN::Encode ':all';
use Array::Utils qw(:all);
my #latinlike_cyrillics = qw (0430 0441 0501 0435 04bb 0456 0458 043a 04cf 043e 0440 051b 0455 051d 0445 0443);
# maybe you can find better examples
my $domain1 = "www.xn--80ak6aa92e.com";
my $domain2 = "www.xn--d1acpjx3f.xn--p1ai";
test_domain ($domain1);
test_domain ($domain2);
sub test_domain {
my $testdomain = shift;
my ($tLD, $sLD, $topLD) = split(/\./, $testdomain);
my $IDN = domain_to_unicode($sLD);
my #decoded; push (#decoded,sprintf("%04x", ord)) for ( split("", $IDN) );
my #checker = array_minus( #decoded, #latinlike_cyrillics );
if (#checker){print "$testdomain [$IDN] seems to be ok\n"}
else {print "$testdomain [$IDN] is possibly scam\n"}
}

Chunked transfer encoding browser experience

Why the output of this simple Perl script >>
print "Content-type: text/plain\n";
print "Transfer-Encoding: chunked\n\n";
print "11\n\n";
print "0123456789ABCDEF\n";
print "11\n\n";
print "0123456789ABCDEF\n";
print "0\n\n";
...works for Chrome browser and does not for IE10..?
You’ve implemented the chunked transfer coding wrong: Each chunk consists of the chunk size in bytes in hexadecimal notation, followed by a CRLF sequence, followed by the chunk data:
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1*HEX
last-chunk = 1*("0") [ chunk-extension ] CRLF
chunk-data = chunk-size(OCTET)
So your code should look like this:
print "Content-type: text/plain\r\n";
print "Transfer-Encoding: chunked\r\n";
print "\r\n";
# first chunk
print "10\r\n";
print "0123456789ABCDEF\r\n";
# second chunk
print "10\r\n";
print "0123456789ABCDEF\r\n";
# last chunk
print "0\r\n";
print "\r\n";

neglect warnings in open 3 HANDLE_OUT [duplicate]

This question already exists:
Suppress SSL warnings
Closed 9 years ago.
I am executing open 3 as shown below I am getting below lines from sysout from SYSOUT
<May 7, 2013 1:21:59 AM IST> <Info> <Security> <BEA-090905> <Disabling CryptoJ JCE Provider self-integrity check for better startup performance. To enable this check, specify -Dweblogic.security.allowCryptoJDefaultJCEVerification=true>
<May 7, 2013 1:21:59 AM IST> <Info> <Security> <BEA-090906> <Changing the default Random Number Generator in RSA CryptoJ from ECDRBG to FIPS186PRNG. To disable this change, specify -Dweblogic.security.allowCryptoJDefaultPRNG=true>
<May 7, 2013 1:21:59 AM IST> <Notice> <Security> <BEA-090898> <Ignoring the trusted CA certificate "CN=CertGenCA,OU=FOR TESTING ONLY,O=MyOrganization,L=MyTown,ST=MyState,C=ka". The loading of the trusted certificate list raised a certificate parsing exception PKIX: Unsupported OID in the AlgorithmIdentifier object: 1.2.840.113549.1.1.11.>
My expected string
<Composites>
i=0
compositedetail=swlib:soaprov/soacomposite=eis/FileAdapter#eis/FileAdapter#
swlib:soaprov/soacomposite=eis/FileAdapter#eis/FileAdapter# starts with swlib
</Composites>
I want to ignore the lines from BEA security and print only my expected string .How can i do it?
my $command = $java . ' -classpath ' . $classpath . ' ' . $secOptions . ' ' . $className . ' ' . $serviceUrl . ' ' . $composites;
local (*HANDLE_IN, *HANDLE_OUT, *HANDLE_ERR);
my $pid = open3( *HANDLE_IN, *HANDLE_OUT, *HANDLE_ERR, "$command") ;
my $nextLine;
while(<HANDLE_OUT>) {
$nextLine= $_;
print $nextLine;
}
You could use regexps to do that. Of course you could use some kind of xml parser too, but it would be an overkill in this case.
my $debug = 1;#set 1 for debugging
while(my $nextLine=<HANDLE_OUT>) {
chomp($nextLine);
if ($nextLine =~ m!<BEA-!){
print "Skipping this line (BEA): |$nextLine|\n" if $debug;
}
print $nextLine."\n";

Perl UTF8 encoding error. Neither LWP::UserAgent->decoded_content or Encode::decode work. Other ideas?

I have an encoding issue in perl when trying to pull back global addresses from webpages using both LWP::Useragent and Encode for character encoding. I've tried googling solutions but nothing seems to work. I'm using Strawberry Perl 5.12.3.
As an example take the address page of the US embassy in Czech Republic (http://prague.usembassy.gov/contact.html). All I want is to pull back the address:
Address: Tržiště 15 118 01 Praha 1 - Malá Strana Czech Republic
Which firefox displays correctly using character encoding UTF-8 which is the same as the webpage header char-set. But when I try to use perl to pull this back and write it to a file the encoding looks messed up despite using decoded_content in Useragent or Encode::decode.
I've tried using regex on the data to check the error isn't when the data is printed (ie internally correct in perl) but the error seems to be in how perl handles the encoding.
Here's my code:
#!/usr/bin/perl
require Encode;
require LWP::UserAgent;
use utf8;
my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;
my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";
# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";
$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}
print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";
my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;
$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;
# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";
# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";
# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
# check for #-digit character in the strings (to guard against the error coming in the print statement)
if ($content_not_decoded =~ /\&/) {
print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n";
print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_Endode_decoded =~ /\&/) {
print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
close (OUTPUTFILE);
exit;
And here's the output to terminal:
CONTENT TYPE: UTF-8 UNDECODED CONTENT::Tr├à┬╛išt├ä┬¢
15118 01 Praha 1 - Malá StranaCzech Republic
DECODED CONTENT::Tr┼╛išt─¢ 15118 01 Praha 1 -
Malá StranaCzech Republic ENCODE::DECODED
CONTENT::Tr┼╛išt─¢ 15118 01 Praha 1 -
Malá StranaCzech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR
And to the file (note this is slightly different to terminal but not correct). OK WOW- this is showing as correct in stack overflow but not in Bluefish, LibreOffice, Excel, Word or anything else on my computer. So the data is there just incorrectly encoded. I really don't get what's going on.
CONTENT TYPE: UTF-8 UNDECODED CONTENT::TržištÄ
15118 01 Praha 1 - Malá StranaCzech Republic
DECODED CONTENT::Tržiště 15118 01 Praha 1 -
Malá StranaCzech Republic ENCODE::DECODED
CONTENT::Tržiště 15118 01 Praha 1 - Malá
StranaCzech Republic DOUBLE-DECODED CONTENT::Tržiště 15118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY
ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR
Any pointers how this can be made really appreciated.
Thanks,
Ian/Montecristo
The mistake is using regex to parse HTML. You lack decoding of HTML entities, at the least. You can do that manually, or leave it to a robust parser:
use strictures;
use Web::Query 'wq';
use autodie qw(:all);
open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text
#!/usr/bin/env perl
use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use LWP::Simple;
use HTML::Entities;
my $content = get 'http://prague.usembassy.gov/contact.html';
my ($address) = ($content =~ m{<p><b>Address(.*?)</p>});
decode_entities($address);
say $address;
From the command line:
C:\temp> uu > tt.txt
C:\temp> gvim tt.txt
and the following text is displayed in GVim (which is UTF8 mode):
</b>:<br />Tržiště 15<br />118 01 Praha 1 - Malá Strana<br />Czech Republic
See also Tom Christiansen's standard preamble.

Perl: utf8::decode vs. Encode::decode

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.