perl uri_escape_utf8 with arabic - perl

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli

Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

Related

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

How to let File::Queue to be able to process utf8 strings in perl?

I'm processing some data from XML files in perl and wanna use the FIFO File::Queue to divide and speed up the process.
One perl script parses the XML file and prepares JSON output for another script:
#!/usr/bin/perl -w
binmode STDOUT, ":utf8";
use utf8;
use strict;
use XML::Rules;
use JSON;
use File::Queue;
#do the XML magic: %data contains result
my $q = new File::Queue (File => './importqueue', Mode => 0666);
my $json = new JSON;
my $qItem = $json->allow_nonref->encode(\%data);
$q->enq($qItem);
As long %data contains numeric and a-z data only this works fine. But when one of the widechars occurs (eg. ł, ą, ś, ż etc.) i'm getting: Wide character in syswrite at /usr/lib/perl/5.10/IO/Handle.pm line 207.
I have tried to check if the string is valid utf8:
print utf8::is_utf8($qItem). ':' . utf8::valid($qItem)
and I did get 1:1 - so yes I do have the correct utf8 string.
I have find out that the reason could be that syswrite gets the filehandler to the queue file which is not aware to be a :utf8 encoded file.
Am I right? If so is there any way to force File:Queue to use the :utf8 file handler?
Maybe the File:Queue is not the best choice - should I use sth else to create FIFO queue between two perl scripts?
utf8::is_utf8 does not tell you whether your string is encoded using UTF-8 or not. (That information is not even available.)
>perl -MEncode -E"say utf8::is_utf8(encode_utf8(chr(0xE9))) || 0"
0
utf8::valid does not tell you whether your string is valid UTF-8 or not.
>perl -MEncode -E"say utf8::valid(qq{\xE9}) || 0"
1
Both check some internal storage details. You should never have a need for either.
File::Queue can only transmit strings of bytes. It's up to you to serialise the data you want to transmit into a string.
The primary means of serialising text is character encoding, or just encoding for short. UTF-8 is a character encoding.
For example, the string
dostępu
consists of the following chars (each a Unicode code point):
64 6F 73 74 119 70 75
Not all of those chars fit in bytes, so the string can't be sent using File::Queue. If you were to encode that string using UTF-8, you'd get a string composed of the following chars:
64 6F 73 74 C4 99 70 75
Those chars fit in bytes, so that string can be sent using File::Queue.
JSON, as you used it, returns strings of Unicode code points. As such, you need to apply a character encoding.
File::Queue doesn't provide an option to automatically encode strings for you, so you'll have to do it yourself.
You could use encode_utf8 and decode_utf8 from the Encode module
my $json = JSON->new->allow_nonref;
$q->enq(encode_utf8($json->encode(\%data)));
my $data = $json->decode(decode_utf8($q->deq()));
or you can let JSON do the encoding/decoding for you.
my $json = JSON->new->utf8->allow_nonref;
$q->enq($json->encode(\%data));
my $data = $json->decode($q->deq());
Looking at the docs.....
perldoc -f syswrite
WARNING: If the filehandle is marked ":utf8", Unicode
characters encoded in UTF-8 are written instead of bytes, and
the LENGTH, OFFSET, and return value of syswrite() are in
(UTF8-encoded Unicode) characters. The ":encoding(...)" layer
implicitly introduces the ":utf8" layer. Alternately, if the
handle is not marked with an encoding but you attempt to write
characters with code points over 255, raises an exception. See
"binmode", "open", and the "open" pragma, open.
man 3perl open
use open OUT => ':utf8';
...
with the "OUT" subpragma you can declare the default
layers of output streams. With the "IO" subpragma you can control
both input and output streams simultaneously.
So I'd guess adding use open OUT=> ':utf8' to the top of your program would help

Perl ord and chr working with unicode

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).
First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");