How do I avoid double UTF-8 encoding in XML::LibXML - perl

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure.
When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.
Here's a piece of code trying to visualize the problem:
use strict; use diagnostics; use feature 'unicode_strings';
use utf8; use v5.14; use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)"); use open qw( :encoding(UTF-8) :std );
use XML::LibXML
# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";
# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" ); $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );
# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );
Spoiler 1: Good case:
<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>
Spoiler 2: Bad case:
<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>
The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.
What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?
(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)

ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:
IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!
(serialize is just an alias for toString)
When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.
As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.

Since XML documents are parsed without needing any external information, they are binary files rather than text files.
You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.
Replace
binmode(STDOUT, ":encoding(UTF-8)");
with
binmode(STDOUT);
Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.
In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.

I do not like changing settings of STDOUT because of specific features of "toString()" in two modules XML::LibXML::Document, XML::LibXML::Element.
So, I do prefer to add "Encode::encode" where it is required. You may run the following example:
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
my ( $doc, $main, $nodelatin, $nodepolish );
$doc = XML::LibXML::Document->createDocument( '1.0', 'UTF-8' );
$main = $doc->createElement('main');
$doc->addChild($main);
$nodelatin = $doc->createElement('latin');
$nodelatin->appendTextNode('Lorem ipsum dolor sit amet');
$main->addChild($nodelatin);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n\n"; # printed OK
$nodepolish = $doc->createElement('polish');
$nodepolish->appendTextNode('Zażółć gęślą jaźń');
$main->addChild($nodepolish);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', Encode::encode("UTF-8", $doc->documentElement()->toString()), "\n"; # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n"; # Wide character in print

Related

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

How to clear non-utf characters while reading a utf-8 file in Perl?

I am parsing a very large log file with Perl.
The code is:
open($input_handle, '<:encoding(UTF-8)', $input_file);
while (<$input_handle>) {
...
}
close($input_handle);
However, sometimes the log file contains faulty characters, and I get the following message:
utf8 "\xD0" does not map to Unicode at log_parser.pl line 32, <$input_handle> line 10920.
I am aware of the characters and I would just like to ignore them without the log message flooding my (Windows!) build server logs. I tried no warnings 'utf8'; but it did not help.
How can I suppress the message?
You could do the decoding yourself instead of using the :encoding layer. By default, Encode's decode and decode_utf8 simply exchange the bad character with U+FFFD rather than warning.
$ perl -e'
use Encode qw( decode_utf8 );
$bytes = "\xD0 \x92 \xD0\x92\n";
$text = decode_utf8($bytes);
printf("U+%v04X\n", $text);
'
U+FFFD.0020.FFFD.0020.0412.000A
If the file is a mix of UTF-8, iso-8859-1 and cp1252, it may be possible to fix the file rather than simply silencing the errors, as detailed here.

perl uri_escape_utf8 with arabic

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli
Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

Print other language character in csv using perl file handling

I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get Weiß, Römersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.
You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.
Check if there are any weird characters at start or anywhere in the file using commands like head or tail

How to avoid wide characters in LWP::UserAgent?

I am trying to download contents (formulas) of a web page in Perl. I have used "LWP::UserAgent" module to parse the content and taken care to check for UTF8 format. The code is as follows:
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.abc.org/patent/formulae');
my $content =$response->decoded_content();
if (utf8::is_utf8($content))
{
binmode STDOUT,':utf8';
}
else
{
binmode STDOUT,':raw';
}
print $content;
But I still get wide characters & the output is as follows:
"Formula =
Ï
Ì
â¡
(
c
+
/
c
0
)
â
1
"
Whereas I want:
"Fromula = Ï Ì â¡ ( c + / c 0 ) â 1 "
How can we avoid that?
The decoded_content uses encoding and charset information available in the HTTP header to decode your data. However, HTML files may specify a different encoding.
If you want your output file to be utf8, you should always apply the :utf8 layer. What you are trying to do with is_uft8 is wrong.
Perl strings are internally stored with two different encodings. This is absolutely irrelevant to you, the programmer. The is_utf8 just reads the value of an internal flag that determines this internal representation. Just because this flag isn't set doesn't mean that one codepoint in your string may not be encoded as multiple bytes when encoded as utf8.
The data you fetch from the server has various levels of encodings
encodings like compression
charsets
the charset specified by the HTML
HTML entities like &quot.
The decoded_content takes care of the first two levels, the rest is left for you. To remove entities, you can use the HTML::Entities module. Duh.
use open qw/:std :utf8/; # Apply :utf8 layer to STD{IN,OUT,ERR}
...;
if ($response->is_success) {
my $content = $response->decoded_content;
print decode_entities $content;
}
Note that I cannot verify that this works; the URL you gave 404s (irritatingly, without sending the 404 status code).