How to avoid wide characters in LWP::UserAgent? - perl

I am trying to download contents (formulas) of a web page in Perl. I have used "LWP::UserAgent" module to parse the content and taken care to check for UTF8 format. The code is as follows:
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.abc.org/patent/formulae');
my $content =$response->decoded_content();
if (utf8::is_utf8($content))
{
binmode STDOUT,':utf8';
}
else
{
binmode STDOUT,':raw';
}
print $content;
But I still get wide characters & the output is as follows:
"Formula =
Ï
Ì
â¡
(
c
+
/
c
0
)
â
1
"
Whereas I want:
"Fromula = Ï Ì â¡ ( c + / c 0 ) â 1 "
How can we avoid that?

The decoded_content uses encoding and charset information available in the HTTP header to decode your data. However, HTML files may specify a different encoding.
If you want your output file to be utf8, you should always apply the :utf8 layer. What you are trying to do with is_uft8 is wrong.
Perl strings are internally stored with two different encodings. This is absolutely irrelevant to you, the programmer. The is_utf8 just reads the value of an internal flag that determines this internal representation. Just because this flag isn't set doesn't mean that one codepoint in your string may not be encoded as multiple bytes when encoded as utf8.
The data you fetch from the server has various levels of encodings
encodings like compression
charsets
the charset specified by the HTML
HTML entities like &quot.
The decoded_content takes care of the first two levels, the rest is left for you. To remove entities, you can use the HTML::Entities module. Duh.
use open qw/:std :utf8/; # Apply :utf8 layer to STD{IN,OUT,ERR}
...;
if ($response->is_success) {
my $content = $response->decoded_content;
print decode_entities $content;
}
Note that I cannot verify that this works; the URL you gave 404s (irritatingly, without sending the 404 status code).

Related

How do I avoid double UTF-8 encoding in XML::LibXML

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure.
When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.
Here's a piece of code trying to visualize the problem:
use strict; use diagnostics; use feature 'unicode_strings';
use utf8; use v5.14; use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)"); use open qw( :encoding(UTF-8) :std );
use XML::LibXML
# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";
# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" ); $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );
# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );
Spoiler 1: Good case:
<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>
Spoiler 2: Bad case:
<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>
The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.
What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?
(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)
ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:
IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!
(serialize is just an alias for toString)
When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.
As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.
Since XML documents are parsed without needing any external information, they are binary files rather than text files.
You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.
Replace
binmode(STDOUT, ":encoding(UTF-8)");
with
binmode(STDOUT);
Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.
In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.
I do not like changing settings of STDOUT because of specific features of "toString()" in two modules XML::LibXML::Document, XML::LibXML::Element.
So, I do prefer to add "Encode::encode" where it is required. You may run the following example:
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
my ( $doc, $main, $nodelatin, $nodepolish );
$doc = XML::LibXML::Document->createDocument( '1.0', 'UTF-8' );
$main = $doc->createElement('main');
$doc->addChild($main);
$nodelatin = $doc->createElement('latin');
$nodelatin->appendTextNode('Lorem ipsum dolor sit amet');
$main->addChild($nodelatin);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n\n"; # printed OK
$nodepolish = $doc->createElement('polish');
$nodepolish->appendTextNode('Zażółć gęślą jaźń');
$main->addChild($nodepolish);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', Encode::encode("UTF-8", $doc->documentElement()->toString()), "\n"; # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n"; # Wide character in print

How can I convert a Base64 string to binary using Perl

Is it possible to take a Base64 string and convert it to binary using basic Perl (i.e. just the packages in a standard release, no 3rd party libraries from CPAN)? If so, how?
I came across the module MIME::Base64 which appears to convert from plain text->Base64 and Base64->plain text but I can't seem to find anything to go from Base64 to binary.
-----EDIT-----
It's possible my notion of binary is confused. Essentially. I have a Base64 string passed via an HTML form field. I would like to convert that string into whatever format necessary so as to download that file to the user's browser.
From what I understand. If I first print the correct MIME type headers and then print the raw file data that should work.
You have it backwards. MIME::Base64, like the encoding, only handles bytes. If you have decoded text, you would have to encode it first.
This demonstrates its ability to handle arbitrary bytes:
use MIME::Base64 qw( decode_base64 encode_base64 );
my $expected = join '', map chr, 0x00..0xFF;
my $base64 = encode_base64($expected);
print($base64);
my $got = decode_base64($base64);
print($got eq $expected ? "ok" : "error", "\n");
AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1Njc4
OTo7PD0+P0BBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWltcXV5fYGFiY2RlZmdoaWprbG1ub3Bx
cnN0dXZ3eHl6e3x9fn+AgYKDhIWGh4iJiouMjY6PkJGSk5SVlpeYmZqbnJ2en6ChoqOkpaanqKmq
q6ytrq+wsbKztLW2t7i5uru8vb6/wMHCw8TFxsfIycrLzM3Oz9DR0tPU1dbX2Nna29zd3t/g4eLj
5OXm5+jp6uvs7e7v8PHy8/T19vf4+fr7/P3+/w==
ok
This demonstrates its inability to handle text that hasn't first been encoded into bytes:
use MIME::Base64 qw( encode_base64 );
encode_base64("\x{2660}");
print("ok\n");
Wide character in subroutine entry at a.pl line 2.

perl uri_escape_utf8 with arabic

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli
Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

How to let File::Queue to be able to process utf8 strings in perl?

I'm processing some data from XML files in perl and wanna use the FIFO File::Queue to divide and speed up the process.
One perl script parses the XML file and prepares JSON output for another script:
#!/usr/bin/perl -w
binmode STDOUT, ":utf8";
use utf8;
use strict;
use XML::Rules;
use JSON;
use File::Queue;
#do the XML magic: %data contains result
my $q = new File::Queue (File => './importqueue', Mode => 0666);
my $json = new JSON;
my $qItem = $json->allow_nonref->encode(\%data);
$q->enq($qItem);
As long %data contains numeric and a-z data only this works fine. But when one of the widechars occurs (eg. ł, ą, ś, ż etc.) i'm getting: Wide character in syswrite at /usr/lib/perl/5.10/IO/Handle.pm line 207.
I have tried to check if the string is valid utf8:
print utf8::is_utf8($qItem). ':' . utf8::valid($qItem)
and I did get 1:1 - so yes I do have the correct utf8 string.
I have find out that the reason could be that syswrite gets the filehandler to the queue file which is not aware to be a :utf8 encoded file.
Am I right? If so is there any way to force File:Queue to use the :utf8 file handler?
Maybe the File:Queue is not the best choice - should I use sth else to create FIFO queue between two perl scripts?
utf8::is_utf8 does not tell you whether your string is encoded using UTF-8 or not. (That information is not even available.)
>perl -MEncode -E"say utf8::is_utf8(encode_utf8(chr(0xE9))) || 0"
0
utf8::valid does not tell you whether your string is valid UTF-8 or not.
>perl -MEncode -E"say utf8::valid(qq{\xE9}) || 0"
1
Both check some internal storage details. You should never have a need for either.
File::Queue can only transmit strings of bytes. It's up to you to serialise the data you want to transmit into a string.
The primary means of serialising text is character encoding, or just encoding for short. UTF-8 is a character encoding.
For example, the string
dostępu
consists of the following chars (each a Unicode code point):
64 6F 73 74 119 70 75
Not all of those chars fit in bytes, so the string can't be sent using File::Queue. If you were to encode that string using UTF-8, you'd get a string composed of the following chars:
64 6F 73 74 C4 99 70 75
Those chars fit in bytes, so that string can be sent using File::Queue.
JSON, as you used it, returns strings of Unicode code points. As such, you need to apply a character encoding.
File::Queue doesn't provide an option to automatically encode strings for you, so you'll have to do it yourself.
You could use encode_utf8 and decode_utf8 from the Encode module
my $json = JSON->new->allow_nonref;
$q->enq(encode_utf8($json->encode(\%data)));
my $data = $json->decode(decode_utf8($q->deq()));
or you can let JSON do the encoding/decoding for you.
my $json = JSON->new->utf8->allow_nonref;
$q->enq($json->encode(\%data));
my $data = $json->decode($q->deq());
Looking at the docs.....
perldoc -f syswrite
WARNING: If the filehandle is marked ":utf8", Unicode
characters encoded in UTF-8 are written instead of bytes, and
the LENGTH, OFFSET, and return value of syswrite() are in
(UTF8-encoded Unicode) characters. The ":encoding(...)" layer
implicitly introduces the ":utf8" layer. Alternately, if the
handle is not marked with an encoding but you attempt to write
characters with code points over 255, raises an exception. See
"binmode", "open", and the "open" pragma, open.
man 3perl open
use open OUT => ':utf8';
...
with the "OUT" subpragma you can declare the default
layers of output streams. With the "IO" subpragma you can control
both input and output streams simultaneously.
So I'd guess adding use open OUT=> ':utf8' to the top of your program would help

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");