Encode module and inverted commas - perl

I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (’, character 146). I'm trying to print my extracted data to a text file, but it's giving me ’ instead of the inverted comma. I have tried the following:
$content =~ s/’/'/g;
my $invComma = chr 146;
$content =~ s/$invComma/'/g;
$content =~ s/\x{0092}/'/g;
None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the ’ changes to ’ instead. I have already tried use utf8 as well, to no effect.
I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.
What am I doing wrong, and how do I fix it?
UPDATE: I am able to do $content =~ s/’/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.
UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.

The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).

Start by finding out what $content contains. You can use the following:
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
warn(Dumper($content));
If you get the following, $content is decoded
$VAR1 = "...\x{2019}...";
Any of the following will work.
use utf8; # Source code is encoded using UTF-8.
$content =~ s/’/'/g;
$content =~ s/\x{2019}/'/g;
$content =~ s/\N{U+2019}/'/g;
$content =~ s/\N{RIGHT SINGLE QUOTATION MARK}/'/g;
If you get the following, $content is encoded using UTF-8.
$VAR1 = "...\342\200\231...";
Start by decoding the value of $content using either of the following:
utf8::decode($content) or die;
use Encode qw( decode_utf8 );
$content = decode_utf8($content);
Then use any of the solutions for decoded content (above).
If you get the following, $content is encoded using cp1252.
$VAR1 = "...\222...";
Start by decoding the value of $content.
use Encode qw( decode );
$content = decode("cp1252", $content);
Then use any of the solutions for decoded content (above).
By the way, ’ is what the UTF-8 encoding of ’ (E2 80 99) would look like if decoded as cp1252.

The problem was not in my script, it was in my editor. The script works properly, and the question is based on false pretenses. I was using gVim on Windows, which did not play nicely with Unicode. My script was properly decoding the content, but when I opened the output file in gVim, it mangled the text and displayed it incorrectly. My attempts to use regular expressions to change the characters failed because I was using the wrong codepoint - it wasn't 0x92, it was 0x2019. This was another failing of gVim. Thanks to hobbs and ikegami for helping me figure this out.

Related

MIME::Base64::decode_base64 wrong characters

I get some troubles using perl MIME::Base64::decode_base64
Here is my code:
#!/usr/bin/perl
use MIME::Base64;
$string_to_decrypt="lVvfrx23jX7vX3HghyJGxo4oivqBIg";
$content=MIME::Base64::decode_base64($string_to_decrypt);
open(WRITE,">/home/laurent/decrypted.txt");
print WRITE $content;
close(WRITE);
exit;
Using online decoder (like https://www.base64decode.org/) result should be:
[߯·~ï_qà"FÆ(ú"
But in my file, I get:
<95>[߯^]·<8d>~ï_qà<87>"FÆ<8e>(<8a>ú<81>"
I don't know how to get rid of:
<95>, ^], <8d>,<87> ....
Thanks
Laurent
This is clearly not text, so it's no surprise it doesn't render properly when printed as text. base64decode.org actually produces the same correct result as decode_base64, which is the following bytes:
95.5B.DF.AF.1D.B7.8D.7E.EF.5F.71.E0.87.22.46.C6.8E.28.8A.FA.81.22
You can use either of the following to remove the characters you identified, but that's is most definitely the wrong thing to do.
$content =~ tr/\x1D\x87\x8D\x95//d;
-or-
$content =~ s/[\x1D\x87\x8D\x95]//g;

How to replace à with a space using perl

Apologies if this is a dupe (I tried all manner of searches!). This is driving me nuts...
I need a quick fix to replace à with a space.
I've tried the following, with no success:
$str =~ s/Ã/ /g;
$str =~ s/\xC3/ /g;
What am I doing wrong here ?
The statement "replace à with a space" is meaningless, because the statement does not specify which encoding is used for the character in question.
The context of this statement could be using the UTF-8 encoding, for example, as well as one of several ISO-8859 encodings. Or, maybe even UTF-16 or UTF-32.
So, for starters, you need to specify, at least, which encoding you are using. And after that, it's also necessary to specify where the input or the output is coming from.
Assuming:
1) You are using UTF-8 encoding
2) You are reading/writing STDIN and STDOUT
Then here's a short example of a filter that shows how to replace this character with a space. Assuming, of course, that the Perl script itself is also encoded in UTF-8.
use utf8;
use feature 'unicode_strings';
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
while (<STDIN>)
{
s/Ã/ /g;
print;
}
You need to specify that you want UNICODE and not Latin-1 (or another encoding).
If you're reading from a file then:
#!/usr/bin/perl
open INFILE, '<:encoding(UTF-8)', '/mypath/file';
while(<INFILE>) {
s/\xc3/ /g;
print;
}
I'll break that down better for you:
In <:encoding(UTF-8) you are specifying that you want to read (the <), and that you want UNICODE (the :encoding(UTF-8) part).
If you weren't using unicode you would use:
open INFILE, '<', '/mypath/file';
or
open INFILE, '/mypath/file';
because by default perl will read. If you want to write you use >:encoding(UTF-8) and if you want to append (because the > overwrites the file) you use >>:encoding(UTF-8).
Hope it helped!
There is another answer that specifies how to do binmode(STDIN, ":utf8") if you're trying to unicode from STDIN.
Following this, for the simple "quick fix" Wonko was looking for:
tr/ -~//cd;

How do I avoid double UTF-8 encoding in XML::LibXML

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure.
When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.
Here's a piece of code trying to visualize the problem:
use strict; use diagnostics; use feature 'unicode_strings';
use utf8; use v5.14; use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)"); use open qw( :encoding(UTF-8) :std );
use XML::LibXML
# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";
# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" ); $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );
# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );
Spoiler 1: Good case:
<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>
Spoiler 2: Bad case:
<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>
The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.
What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?
(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)
ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:
IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!
(serialize is just an alias for toString)
When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.
As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.
Since XML documents are parsed without needing any external information, they are binary files rather than text files.
You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.
Replace
binmode(STDOUT, ":encoding(UTF-8)");
with
binmode(STDOUT);
Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.
In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.
I do not like changing settings of STDOUT because of specific features of "toString()" in two modules XML::LibXML::Document, XML::LibXML::Element.
So, I do prefer to add "Encode::encode" where it is required. You may run the following example:
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
my ( $doc, $main, $nodelatin, $nodepolish );
$doc = XML::LibXML::Document->createDocument( '1.0', 'UTF-8' );
$main = $doc->createElement('main');
$doc->addChild($main);
$nodelatin = $doc->createElement('latin');
$nodelatin->appendTextNode('Lorem ipsum dolor sit amet');
$main->addChild($nodelatin);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n\n"; # printed OK
$nodepolish = $doc->createElement('polish');
$nodepolish->appendTextNode('Zażółć gęślą jaźń');
$main->addChild($nodepolish);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', Encode::encode("UTF-8", $doc->documentElement()->toString()), "\n"; # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n"; # Wide character in print

perl save utf-8 text problem

I am playing around the pplog, a single file file base blog.
The writing to file code:
open(FILE, ">$config_postsDatabaseFolder/$i.$config_dbFilesExtension");
my $date = getdate($config_gmt);
print FILE $title.'"'.$content.'"'.$date.'"'.$category.'"'.$i; # 0: Title, 1: Content, 2: Date, 3: Category, 4: FileName
print 'Your post '. $title.' has been saved. Go to Index';
close FILE;
The input text:
春眠不覺曉,處處聞啼鳥. 夜來風雨聲,花落知多小.
After store to file, it becomes:
春眠不覺�›�,處處聞啼鳥. 夜來風�›�聲,花落知多小.
I can use Eclipse to edit the file and make it render to normal. The problem exists during printing to the file.
Some basic info:
Strawberry perl 5.12
without use utf8;
tried use utf8;, dosn't have effect.
Thank you.
--- EDIT ---
Thanks for comments. I traced the code:
Codes add new content:
# Blog Add New Entry Page
my $pass = r('pass');
#BK 7JUL09 patch from fedekun, fix post with no title that caused zero-byte message...
my $title = r('title');
my $content = '';
if($config_useHtmlOnEntries == 0)
{
$content = bbcode(r('content'));
}
else
{
$content = basic_r('content');
}
my $category = r('category');
my $isPage = r('isPage');
sub r
{
escapeHTML(param($_[0]));
}
sub r forward the command to a CGI.pm function.
In CGI.pm
sub escapeHTML {
# hack to work around earlier hacks
push #_,$_[0] if #_==1 && $_[0] eq 'CGI';
my ($self,$toencode,$newlinestoo) = CGI::self_or_default(#_);
return undef unless defined($toencode);
$toencode =~ s{&}{&}gso;
$toencode =~ s{<}{<}gso;
$toencode =~ s{>}{>}gso;
if ($DTD_PUBLIC_IDENTIFIER =~ /[^X]HTML 3\.2/i) {
# $quot; was accidentally omitted from the HTML 3.2 DTD -- see
# <http://validator.w3.org/docs/errors.html#bad-entity> /
# <http://lists.w3.org/Archives/Public/www-html/1997Mar/0003.html>.
$toencode =~ s{"}{"}gso;
}
else {
$toencode =~ s{"}{"}gso;
}
# Handle bug in some browsers with Latin charsets
if ($self->{'.charset'}
&& (uc($self->{'.charset'}) eq 'ISO-8859-1' # This line cause trouble. it treats Chinese chars as ISO-8859-1
|| uc($self->{'.charset'}) eq 'WINDOWS-1252')) {
$toencode =~ s{'}{'}gso;
$toencode =~ s{\x8b}{‹}gso;
$toencode =~ s{\x9b}{›}gso;
if (defined $newlinestoo && $newlinestoo) {
$toencode =~ s{\012}{
}gso;
$toencode =~ s{\015}{
}gso;
}
}
return $toencode;
}
Further trace the problem, found out the browser default to iso-8859-1, even manually set to utf-8, it send the string back to server as iso-8859-1.
Finally,
print header(-charset => qw(utf-8)), '<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
add the -charset => qw(utf-8) param to header. The Chinese poem is still Chinese poem.
Thanks for Schwern's comments, it inspired me to trace out the problem and learn a leeson.
In order to get utf8 really working in Perl involves flipping on a lot of individual features. use utf8 only makes your code utf8 (strings, variables, regexes...), you have to do file handles separately.
Its complicated, and the simplest thing is to use utf8::all which will make utf8 the default for your code, your files, #ARGV, STDIN, STDOUT and STDERR. utf8 support is constantly improving in Perl, and utf8::all will add it as it comes available.
I'm unsure of how your code can produce that output—for example, the quote marks are missing. Of course, this could be due to "corruption" somewhere between your file and me seeing the page. SO may filter corrupted UTF-8. I suggest providing hex dumps in the future!
Anyway, to get UTF-8 output working in Perl, there are several approaches:
Work with character data, that is let Perl know that your variables contain Unicode. This is probably the best method. Confirm that utf8::is_utf8($var) is true (you do not need to, and should not use utf8 for this). If not, look into the Encode module's decode function to make Perl know its Unicode. Once Perl knows your data is characters, that print will give warnings (which you do have enabled, right?). To fix, enable the :utf8 or :encoding(utf8) layer on your file (the latter version provides error checking). You can do this in your open (open FILE, '>:utf8', "$fname") or alternative enable it with binmode (binmode FILE, ':utf8'). Note that you can also use other encodings; see the encoding and PerlIO::encoding docs.
Treat your Unicode as opaque binary data. utf8::is_utf8($var) must be false. You must be very careful when manipulating strings; for example, if you've got UTF-16-BE, this would be a bad idea: print "$data\n", because you actually need print $data\0\n". UTF-8 has fewer of these issues, but you need to be aware of them.
I suggest reading the perluniintro, perlunitut, perlunicode, and perlunifaq manpages/pods.
Also, use utf8; just tells Perl that your script is written in UTF-8. Its effects are very limited; see its pod docs.
You're not showing the code that is actually running. I successfully processed the text you supplied as input with both 5.10.1 on Cygwin and 5.12.3 on Windows. So I suspect a bug in your code. Try narrowing down the problem by writing a short, self-contained test case.

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");