I'm trying to write up an example of testing query string parsing when I got stumped on a Unicode issue. In short, the letter "Omega" (Ω) doesn't seem to be decoded correctly.
Unicode: U+2126
3-byte sequence: \xe2\x84\xa6
URI encoded: %E2%84%A6
So I wrote this test program verify that I could "decode" unicode query strings with URI::Encode.
use strict;
use warnings;
use utf8::all; # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;
sub parse_query_string {
my $query_string = shift;
my #pairs = split /[&;]/ => $query_string;
my %values_for;
foreach my $pair (#pairs) {
my ( $key, $value ) = split( /=/, $pair );
$_ = uri_decode($_) for $key, $value;
$values_for{$key} ||= [];
push #{ $values_for{$key} } => $value;
}
return \%values_for;
}
my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';
diag $omega;
diag $query->{alpha}[0];
done_testing;
And the output of the test:
query.t ..
not ok 1 - Unicode should decode correctly
# Failed test 'Unicode should decode correctly'
# at query.t line 23.
# Structures begin differing at:
# $got->{alpha}[0] = 'â¦'
# $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests
Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
Failed test: 1
Non-zero exit status: 1
Files=1, Tests=1, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.05 cusr 0.00 csys = 0.09 CPU)
Result: FAIL
It looks to me like URI::Encode may be broken here, but switching to URI::Escape and using the uri_unescape function reports the same error. What am I missing?
the URI encoded characters simply represents utf-8 sequences, and URI::Encode and URI::Escape simply decodes them into a utf-8 byte string, and neither of them decode the bytestrings as UTF-8 (which is a correct behavior as a generic URI decoding library).
Put it another way, your code basically does:
is "\N{U+2126}", "\xe2\x84\xa6" and that will fail, since upon comparison, perl upgrades the latter as a 3-character-length latin-1 strings.
You have to manually decode the input value with Encode::decode_utf8 after uri_decode, or instead compare encoded utf8 byte sequence.
URI escaping represents octets and knows nothing about character encodings, so you have to decode from UTF-8 octets to characters yourself, e.g.:
$_ = decode_utf8(uri_decode($_)) for $key, $value;
The problem can be seen in incorrect details in your own explanation of the problem. What you are dealing with is really:
Unicode codepoint: U+2126
UTF-8 encoding of codepoint: \xe2\x84\xa6
URI encoding of UTF-8 encoding of codepoint: %E2%84%A6
The problem is that you only undid one of the encodings.
Solutions have already been presented. I just wanted to present an alternate explanation.
I'd recommend that you have a look at Why does modern Perl avoid UTF-8 by default? for a thorough discussion on this topic.
I would add to the discussion there:
You'll notice a lot of odd glyphs on the page. This was intentional on the part of the author.
I've tried the Symbola font recommended in the thread and it looked horrible on Win 7. YMMV.
Reading Why does modern Perl avoid UTF-8 by default? too frequently may lead to depression and lingering doubts about your life choices.
Related
I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.
I am using substr on a very long UTF-8 string (~250,000,000 characters).
The thing is my program almost freeze around the 200,000,000th character.
Does somebody know about this issue? What are my options?
As I am indexing a document using a suffix array, I need:
to keep my string in one piece;
to access variable length substrings using an index.
As for a MWE:
use strict;
use warnings;
use utf8;
my $text = 'あいうえお' x 50000000;
for( my $i = 0 ; $i < length($text) ; $i++ ){
print "\r$i";
my $char = substr($text,$i,1);
}
print "\n";
Perl has two string storage formats. One that's capable of storing 8-bit characters, and one capable of storing 72-bit characters (practically limited to 32 or 64). Your string necessarily uses the latter format. This wide-character format uses a variable number of bytes per character like UTF-8 does.
Finding the ith element of a string in the first format is trivial: Add the offset to the string pointer. With the second format, finding the ith character requires scanning the string from the beginning, just like you would have to scan a file from the beginning to find the nth line. There is a mechanism that caches information about the string as it's discovered, but it's not perfect.
The problem goes away if you use a fixed number of bytes per character.
use utf8;
use Encode qw( encode );
my $text = 'あいうえお' x 50000000;
my $packed = encode('UCS-4le', $text);
for my $i (0..length($packed)/4) {
print "\r$i";
my $char = chr(unpack('V', substr($packed, $i*4, 4)));
}
print "\n";
Note that the string will use 33% more memory for hiragana characters. Or maybe not, since there's no cache anymore.
I suggest that you use a regular expression instead of substr.
Benchmarking these two methods shows that a regex is nearly 100 times faster:
use strict;
use warnings;
use utf8;
my $text = 'あいうえお' x 50_000;
sub mysubstr {
for( my $i = 0 ; $i < length($text) ; $i++ ){
my $char = substr($text,$i,1);
}
}
sub myregex {
while ($text =~ /(.)/g) {
my $char = $1;
}
}
use Benchmark qw(:all) ;
timethese(10, {
'substr' => \&mysubstr,
'regex' => \&myregex,
});
Outputs:
Benchmark: timing 10 iterations of regex, substr...
regex: 2 wallclock secs ( 2.18 usr + 0.00 sys = 2.18 CPU) # 4.58/s (n=10)
substr: 198 wallclock secs (184.66 usr + 0.16 sys = 184.81 CPU) # 0.05/s (n=10)
It is a known issue listed under Bugs for Perl 5.20.0:
http://perldoc.perl.org/perlunicode.html#Speed
The most important part is the first paragraph of my quote:
Speed
Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over characters such as length(), substr() or index(), or matching regular expressions can work much faster when the underlying data are byte-encoded.
In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a caching scheme was introduced which will hopefully make the slowness somewhat less spectacular, at least for some operations. In general, operations with UTF-8 encoded strings are still slower. As an example, the Unicode properties (character classes) like \p{Nd} are known to be quite a bit slower (5-20 times) than their simpler counterparts like \d (then again, there are hundreds of Unicode characters matching Nd compared with the 10 ASCII characters matching d ).
The easiest way to avoid it is using byte-strings instead of unicode-strings.
In your particular sample, you can just remove characters from the beginning of the $text string as they are processed in order to avoid the linear lookup:
use utf8;
use Encode qw( encode );
$| = 1;
my $text = 'あいうえお' x 50000000;
while ($text ne '') {
print ".";
my $char = substr($text, 0, 1, '');
}
print "\n";
I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.
e.g
$b="\x{263a}\x{263b}\x{263c}";
my $sstr;
print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}
#After this $sstr contains "\342\230\272\342"\0
# How to remove \342 from the end
UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);
Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before
[\x00-\x7F\xC0-\xFF]
Together, we get:
use Encode qw( encode_utf8 );
my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}"; # ☺☻☼
my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;
# $bytes contains encode_utf8("\x{263a}\x{263b}")
# instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"
Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)
Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.
my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg"; # abcdéfg
my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
my $grapheme = $1;
my $grapheme_bytes = encode_utf8($grapheme);
$bytes_left -= length($grapheme_bytes);
last if $bytes_left < 0;
$bytes .= $grapheme_bytes;
}
# $bytes contains encode_utf8("abcd")
# instead of encode_utf8("abcde")
# or encode_utf8("abcde") . "\xCC"
First, please don't use bytes (and never assume that any internal encoding in Perl). As documentation says: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded <...> use of this module for anything other than debugging purposes is strongly discouraged.
To strip incomplete sequence at end of line, assuming it contains octets, use Encode::decode's Encode::FB_QUIET handling mode to stop processing once you hit invalid sequence and then just encode result back:
my $valid = Encode::decode('utf8', $sstr, Encode::FB_QUIET);
$sstr = Encode::encode('utf8', $valid);
Note that if you plan to use it with another encoding in future, not all of encodings may support this handling method.
I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.
I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);