What is the difference between :encoding(utf8) and :utf8? [duplicate] - perl

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.
I have a few usage questions related to this distinction:
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?
What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?

On Read,Invalid encoding otherthan sequence length
On Read,Outside of Unicode,Unicode nonchar, orUnicode surrogate
On Write,Outside of Unicode,Unicode nonchar, orUnicode surrogate
:encoding(UTF-8)
Warns and Replaces
Warns and Replaces
Warns and Replaces
:encoding(utf8)
Warns and Replaces
Accepts
Warns and Outputs
:utf8
Corrupt scalar
Accepts
Warns and Outputs
(This is the state in Perl 5.26.)
Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.
(Encoding names are case-insensitive.)
Tests used to generate the above table:
On read
:encoding(UTF-8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(UTF-8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:encoding(utf8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:utf8
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":utf8";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)
On write
:encoding(UTF-8)
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.
$ od -t c a
0000000 303 251 \n \ x { F F F F } \n \ x { D
0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
0000040
$ cat a
é
\x{FFFF}
\x{D800}
\x{200000}
:encoding(utf8)
$ perl -e'
use open ":std", ":encoding(utf8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
$ od -t c a
0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
0000015
$ cat a
é
▒
▒
:utf8
Same results as :encoding(utf8).
Tested using Perl 5.26.
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

Related

Perl: Packing a sequence of bytes into a string

I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.
In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)
The code below works when I use the character directly in the source.
But nothing that passes through pack works.
For the UTF-8 case, I need to set the UTF-8 flag on the string $ch , but how.
The UCS-2 case fails, and I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
The code:
#!/usr/bin/perl
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
sub showme {
my ($name,$ch) = #_;
print "-------\n";
print "This is test: $name\n";
my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint
{
# https://perldoc.perl.org/bytes.html
use bytes;
my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
my $txt = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
print $txt,"\n";
}
print $ch, "\n";
print "Combine: $ch\n";
print "Concat: " . $ch . "\n";
print "Sprintf: " . sprintf("%s",$ch) . "\n";
print "-------\n";
}
showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8" , pack("HH","D0","B4")); # UTF-8 of д is D0B4
showme("Cyrillic UCS-2" , pack("HH","04","34")); # UCS-2 of д is 0434
Current output:
Looks good
-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes
д
Combine: д
Concat: д
Sprintf: д
-------
That's a no. Where does the 176 come from??
-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no
а
Combine: а
Concat: а
Sprintf: а
-------
This is even worse.
-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no
0
Combine: 0
Concat: 0
Sprintf: 0
-------
You have two problems.
Your calls to pack are incorrect
Each H represents one hex digit.
$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")' # XXX
D0.B0
$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")' # Ok
D0.B4
$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")' # Better
D0.B4
$ perl -e'printf "%vX\n", pack("H*", "D0B4")' # Alternative
D0.B4
STDOUT is expecting decoded text, but you are providing encoded text
First, let's take a look at strings you are producing (once the problem mentioned above is fixed). All you need for that is the %vX format, which provides the period-separated value of each character in hex.
"д" produces a one-character string. This character is the Unicode Code Point for д.
$ perl -e'use utf8; printf("%vX\n", "д");'
434
pack("H*", "D0B4") produces a two-character string. These characters are the UTF-8 encoding of д.
$ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
D0.B4
pack("H*", "0434") produces a two-character string. These characters are the UCS-2be and UTF-16be encodings of д.
$ perl -e'printf("%vX\n", pack("H*", "0434"));'
4.34
Normally, a file handle expects a string of bytes (characters with values in 0..255) to be printed to it. These bytes are output verbatim.[1][2]
When an encoding layer (e.g. :encoding(UTF-8)) is added to a file handle, it expects a string of Unicode Code Points (aka decoded text) to be printed to it instead.
Your program adds an encoding layer to STDOUT (through its use of the use open pragma), so you must provide UCP (decoded text) to print and say. You can obtain decoded text from encoded text using, for example, Encode's decode function.
use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
use Encode qw( decode );
say "д"; # ok (UCP of "д")
say pack("H*", "D0B4"); # XXX (UTF-8 encoding of "д")
say pack("H*", "0434"); # XXX (UCS-2be and UTF-16be encoding of "д")
say decode("UTF-8", pack("H*", "D0B4")); # ok (UCP of "д")
say decode("UCS-2be", pack("H*", "0434")); # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434")); # ok (UCP of "д")
For the UTF-8 case, I need to set the UTF-8 flag on
No, you need to decode the strings.
The UTF-8 flag is irrelevant. Whether the flag is set or not originally is irrelevant. Whether the flag is set or not after the string is decoded is irrelevant. The flag indicates how the string is stored internally, something you shouldn't care about.
For example, take
use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );
my $x = chr(0xE9);
utf8::downgrade($x); # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
utf8::upgrade($x); # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;
It outputs
UTF8=0 E9 é
UTF8=1 E9 é
Regardless of the UTF8 flag, the UTF-8 encoding (C3 A9) of the provided UCP (U+00E9) is output.
I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?
At best, one could employ heuristics to guess whether a string is encoded using iso-latin-1 or UCS-2be. I suspect one could get rather accurate results (like those you'd get for iso-latin-1 and UTF-8.)
I'm not sure why you bring up iso-latin-1 since nothing else in your question relates to iso-latin-1.
Except on Windows, where a :crlf layer added to handles by default.
You get a Wide character warning if you provide a string that contains a character that's not a byte, and the utf8 encoding of the string is output instead.
Please see if following demonstration code of any help
use strict;
use warnings;
use feature 'say';
use utf8; # https://perldoc.perl.org/utf8.html
use Encode; # https://perldoc.perl.org/Encode.html
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
# https://perldoc.perl.org/functions/binmode.html
binmode STDOUT, ':utf8';
# https://perldoc.perl.org/feature.html#The-'say'-feature
say 'UTF-8: ' . $utf8;
# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
$str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
$str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
Output
UTF-8: Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16: Привет Москва
UTF-32: Привет Москва
Supported Cyrillic encodings
use strict;
use warnings;
use feature 'say';
use Encode;
use utf8;
binmode STDOUT, ':utf8';
my $utf8 = 'Привет Москва';
my #encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 ', $utf8;
for (#encodings) {
printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}
Output
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 Привет Москва
UCS-2 041f044004380432043504420020041c043e0441043a04320430
UCS-2LE 1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE 041f044004380432043504420020041c043e0441043a04320430
UTF-16 feff041f044004380432043504420020041c043e0441043a04320430
UTF-32 0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5 bfe0d8d2d5e220bcdee1dad2d0
CP855 dde1b7eba8e520d3d6e3c6eba0
CP1251 cff0e8e2e5f220cceef1eae2e0
KOI8-F f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U f0d2c9d7c5d420edcfd3cbd7c1
Documentation Encode::Supported
Both are good answer. Here is a slight extension of Polar Bear's code to print details about the string:
use strict;
use warnings;
use feature 'say';
use utf8;
use Encode;
sub about {
my($str) = #_;
# https://perldoc.perl.org/bytes.html
my $charlen = length($str);
my $txt;
{
use bytes;
my $mark = (utf8::is_utf8($str) ? "yes" : "no");
my $bytelen = length($str);
$txt = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n",
$bytelen,$charlen,$mark,$str);
}
return $txt;
}
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
binmode STDOUT, ':utf8';
say 'UTF-8: ' . $utf8;
say about($utf8);
{
my $str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
say about($str);
}
{
my $str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
say about($str);
}
{
my $str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
say about($str);
}
{
my $str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
say about($str);
}
# Try identity transcoding
{
my $str_encoded_in_utf16 = encode('UTF16',$utf8);
my $str = decode('UTF16',$str_encoded_in_utf16);
say 'The same: ' . $str;
say about($str);
}
Running this gives:
UTF-8: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4
UTF-16: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48
UTF-32: Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48
The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176
And a little diagram I made as an overview for next time, covering encode, decode and pack. Because one better be ready for next time.
(The above diagram & its graphml file available here)

perl - How to print utf8 code points for each byte

I'm trying to print the code points for all possible byte values.
My test file :
$ perl -e ' open($fh,">raw_bytes.dat");while($i++<256){ print $fh chr($i-1) } close($fh)'
$ ls -l raw_bytes.dat
-rw-rw-r--+ 1 uuuuu Domain Users 256 Mar 20 15:41 raw_bytes.dat
$
What should go into the below #---> part so that I print the code points of utf8 $x in hexadecimal?
perl -e ' use utf8; open($fh,"<raw_bytes.dat");binmode($fh);
while($rb=read($fh,$x,1)) { utf8::encode($x);
#--->
} '
I tried %02x using printf, but it didn't work. Also, I want the solution only using core modules.
Use unpack('H*'):
$ perl -e '$x="\x80"; utf8::encode($x); print unpack("H*", $x), "\n"'
c280
For your example file I get
$ perl -e 'open($fh, "<", "raw_bytes.dat"); binmode($fh);
while ($rb=read($fh,$x,1)) { utf8::encode($x);
print unpack("H*", $x), "\n";
}'
00
01
02
03
...
7f
c280
c281
c282
c283
...
c3bd
c3be
c3bf
Variants:
$ perl -e '$x="\x80"; utf8::encode($x);
print uc(unpack("H*", $x)), "\n"'
C280
$ perl -e '$x="\x80"; utf8::encode($x);
($r = uc(unpack("H*", $x))) =~ s/(..)/\\X\1/g;
print "$r\n"'
\XC2\X80
# a little bit pointless example, but assume that $x is a provided Perl scalar....
$ perl -e '$x="\N{U+0080}\N{U+0081}";
printf("U+%04x ", ord($_)) foreach(split(//, $x));
print "\n";'
U+0080 U+0081
Please remember the difference between
a scalar holding a raw string: split(//) returns octets, e.g. \x80
a scalar holding a properly encoded string: split(//) returns characters, e.g. \N{U+0080}
I tried %02x using printf, but it didn't work.
You can use
printf "%vX\n", $x;
According to perldoc sprintf:
vector flag
This flag tells Perl to interpret the supplied string as a vector of
integers, one for each character in the string. Perl applies the
format to each integer in turn, then joins the resulting strings with
a separator (a dot . by default). This can be useful for displaying
ordinal values of characters in arbitrary strings.

Perl: Homograph attacks. It is possible to compare ascii / non-ascii strings, visually similar?

I faced this so called "homograph attack" and I want to reject domains where decoded punycode visually seems to be alphanumeric only. For example, www.xn--80ak6aa92e.com will display www.apple.com in browser (Firefox). Domains are visually the same, but character set is different. Chrome already patched this and browser display the punycode.
I have example below.
#!/usr/bin/perl
use strict;
use warnings;
use Net::IDN::Encode ':all';
use utf8;
my $testdomain = "www.xn--80ak6aa92e.com";
my $IDN = domain_to_unicode($testdomain);
my $visual_result_ascii = "www.apple.com";
print "S1: $IDN\n";
print "S2: $visual_result_ascii";
print "MATCH" if ($IDN eq $visual_result_ascii);
Visually are the same, but they won't match. It is possible to compare an unicode string ($IDN) against an alphanumeric string, visually the same?
Your example converted by the Punycode converter results in this UTF-8 string:
www.аррӏе.com
$ perl -e 'printf("%02x ", ord) for split("", "www.аррӏе.com"); print "\n"'
77 77 77 2e d0 b0 d1 80 d1 80 d3 8f d0 b5 2e 63 6f 6d
As Unicode:
$ perl -Mutf8 -e 'printf("%04x ", ord) for split("", "www.аррӏе.com"); print "\n"'
0077 0077 0077 002e 0430 0440 0440 04cf 0435 002e 0063 006f 006d
Using #ikegamis input:
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\p{Cyrillic}/g); print "\n"'
аррӏе
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\P{Cyrillic}/g); print "\n"'
www..com
Original idea
I'm not sure if code for this exists, but my first idea would be to create a map \N{xxxx} -> "visual equivalent ASCII/UTF-8 code". Then you could apply the map on the Unicode string to "convert" it to ASCII/UTF-8 code and compare the resulting string with a list of domains.
Example code (I'm skipping the IDN decoding stuff and use the UTF-8 result directly in the test data). This could probably still be improved, but at least it shows the idea.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
# Unicode (in HEX) -> visually equal ASCII/ISO-8859-1/... character
my %unicode_to_equivalent = (
'0430' => 'a',
'0435' => 'e',
'04CF' => 'l',
'0440' => 'p',
);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN, $compare) = split(' ', $_) ; # already decoded in test data
my $visually_decoded =
join('', # merge result
map { # map, if mapping exists
$unicode_to_equivalent{sprintf("%04X", ord($_))} // $_
}
split ('', $IDN) # split to characters
);
print "Testing: ", encode('UTF-8', $IDN), " -> $compare ";
print "Visual match!"
if ($visually_decoded eq $compare);
print "\n";
}
exit 0;
__DATA__
www.аррӏе.com www.apple.com
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com -> www.apple.com Visual match!
Counting the # of scripts in the string
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use Unicode::UCD qw(charscript);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN) = $_; # already decoded in test data
# Unicod characters
my #characters = split ('', $IDN);
# See UTR #39: Unicode Security Mechanisms
my %scripts =
map { (charscript(ord), 1) } # Codepoint to script
#characters;
delete %scripts{Common};
print 'Testing: ',
encode('UTF-8', $IDN),
' (', join(' ', map { sprintf("%04X", ord) } #characters), ')',
(keys %scripts == 1) ? ' not' : '', " suspicious\n";
}
exit 0;
__DATA__
www.аррӏе.com
www.apple.com
www.école.fr
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com (0077 0077 0077 002E 0430 0440 0440 04CF 0435 002E 0063 006F 006D) suspicious
Testing: www.apple.com (0077 0077 0077 002E 0061 0070 0070 006C 0065 002E 0063 006F 006D) not suspicious
Testing: www.école.fr (0077 0077 0077 002E 00E9 0063 006F 006C 0065 002E 0066 0072) not suspicious
After some research and thanks to your comments, I have a conclusion now.
The most frequent issues are coming from Cyrillic. This set contains a lot of visually-similar to Latin characters and you can do many combinations.
I have identified some scammy IDN domains including these names:
"аррӏе" "сһаѕе" "сіѕсо"
Maybe here, with this font, you can see a difference, but in browser is absolutely no visual difference.
Consulting https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode I was able to create a table with 12 visually similar characters.
Update: I found 4 more Latin-like characters in Cyrillic charset, 16 in total now.
It is possible to create many combinations between these, to create IDNs 100% visually-similar to legit domains.
0430 a CYRILLIC SMALL LETTER A
0441 c CYRILLIC SMALL LETTER ES
0501 d CYRILLIC SMALL LETTER KOMI DE
0435 e CYRILLIC SMALL LETTER IE
04bb h CYRILLIC SMALL LETTER SHHA
0456 i CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
0458 j CYRILLIC SMALL LETTER JE
043a k CYRILLIC SMALL LETTER KA
04cf l CYRILLIC SMALL LETTER PALOCHKA
043e o CYRILLIC SMALL LETTER O
0440 p CYRILLIC SMALL LETTER ER
051b q CYRILLIC SMALL LETTER QA
0455 s CYRILLIC SMALL LETTER DZE
051d w CYRILLIC SMALL LETTER WE
0445 x CYRILLIC SMALL LETTER HA
0443 y CYRILLIC SMALL LETTER U
The problem is happening with second level domain. Extensions can also be IDN, but they are verified, can not be spoofed and not subject of this issue.
Domain registrar will check if all letters are from the same set. IDN will not be accepted if you have a mix of Latin,non-Latin characters. So, extra validation is pointless.
My idea is simple. We split the domain and only decode SLD part, then we match against a visually-similar Cyrillic list.
If all letters are visually similar to Latin, then result is almost sure scam.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
use Net::IDN::Encode ':all';
use Array::Utils qw(:all);
my #latinlike_cyrillics = qw (0430 0441 0501 0435 04bb 0456 0458 043a 04cf 043e 0440 051b 0455 051d 0445 0443);
# maybe you can find better examples
my $domain1 = "www.xn--80ak6aa92e.com";
my $domain2 = "www.xn--d1acpjx3f.xn--p1ai";
test_domain ($domain1);
test_domain ($domain2);
sub test_domain {
my $testdomain = shift;
my ($tLD, $sLD, $topLD) = split(/\./, $testdomain);
my $IDN = domain_to_unicode($sLD);
my #decoded; push (#decoded,sprintf("%04x", ord)) for ( split("", $IDN) );
my #checker = array_minus( #decoded, #latinlike_cyrillics );
if (#checker){print "$testdomain [$IDN] seems to be ok\n"}
else {print "$testdomain [$IDN] is possibly scam\n"}
}

Using Encode::encode with "utf8"

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.
I have a few usage questions related to this distinction:
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?
What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?
On Read,Invalid encoding otherthan sequence length
On Read,Outside of Unicode,Unicode nonchar, orUnicode surrogate
On Write,Outside of Unicode,Unicode nonchar, orUnicode surrogate
:encoding(UTF-8)
Warns and Replaces
Warns and Replaces
Warns and Replaces
:encoding(utf8)
Warns and Replaces
Accepts
Warns and Outputs
:utf8
Corrupt scalar
Accepts
Warns and Outputs
(This is the state in Perl 5.26.)
Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.
(Encoding names are case-insensitive.)
Tests used to generate the above table:
On read
:encoding(UTF-8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(UTF-8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:encoding(utf8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:utf8
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":utf8";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)
On write
:encoding(UTF-8)
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.
$ od -t c a
0000000 303 251 \n \ x { F F F F } \n \ x { D
0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
0000040
$ cat a
é
\x{FFFF}
\x{D800}
\x{200000}
:encoding(utf8)
$ perl -e'
use open ":std", ":encoding(utf8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
$ od -t c a
0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
0000015
$ cat a
é
▒
▒
:utf8
Same results as :encoding(utf8).
Tested using Perl 5.26.
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

Perl: utf8::decode vs. Encode::decode

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.