Clarfication on binmode - perl

I'm new to Perl.
I was getting errors in my print statement: "Wide character in print"
And adding this line of code made it work #binmode(STDOUT, ":utf8");
I read the doc, simply put, binmode encodes characters in a manner that the platform can understand.
Without it, the platform may be expecting the characters to mean something else because it is using a different encoding.
Or is my understanding of binmode off ?
Is there a way with perl to find out what encoding the platform is using ?

use open ':std', ':locale';
can help. Doesn't work on all systems, though.

Related

Reading file breaks encoding in Perl

I have script for reading html files in Perl, it works, but it breaks encoding.
This is my script:
use utf8;
use Data::Dumper;
open my $fr, '<', 'file.html' or die "Can't open file $!";
my $content_from_file = do { local $/; <$fr> };
print Dumper($content_from_file);
Content of file.html:
<span class="previews-counter">Počet hodnotení: [%product.rating_votes%]</span>
[%L10n.msg('Zobraziť recenzie')%]
Output from reading:
<span class=\"previews-counter\">Po\x{10d}et hodnoten\x{ed}: [%product.rating_votes%]</span>
[%L10n.msg('Zobrazi\x{165} recenzie')%]
As you can see lot of characters are escaped, how can I read this file and show content of it as it is?
You open the file with perl's default encoding:
open my $fh, '<', ...;
If that encoding doesn't match the actual encoding, Perl might translate some characters incorrectly. If you know the encoding, specify it in the open mode:
open my $fh, '<:utf8', ...;
You aren't done yet, though. Now that you have a probably decoded string, you want to output it. You have the same problem again. The standard output file handle's encoding has to match what you are trying to print to. If you've set up your terminal (or whatever) to expect UTF-8, you need to actually output UTF-8. One way to fix that is to make the standard filehandles use UTF-8:
use open qw(:std :utf8);
You have use utf8, but that only signals the encoding for your program file.
I've written a much longer primer for Perl and Unicode in the back of Learning Perl. The StackOverflow question Why does modern Perl avoid UTF-8 by default? has lots of good advice.

How to replace à with a space using perl

Apologies if this is a dupe (I tried all manner of searches!). This is driving me nuts...
I need a quick fix to replace à with a space.
I've tried the following, with no success:
$str =~ s/Ã/ /g;
$str =~ s/\xC3/ /g;
What am I doing wrong here ?
The statement "replace à with a space" is meaningless, because the statement does not specify which encoding is used for the character in question.
The context of this statement could be using the UTF-8 encoding, for example, as well as one of several ISO-8859 encodings. Or, maybe even UTF-16 or UTF-32.
So, for starters, you need to specify, at least, which encoding you are using. And after that, it's also necessary to specify where the input or the output is coming from.
Assuming:
1) You are using UTF-8 encoding
2) You are reading/writing STDIN and STDOUT
Then here's a short example of a filter that shows how to replace this character with a space. Assuming, of course, that the Perl script itself is also encoded in UTF-8.
use utf8;
use feature 'unicode_strings';
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
while (<STDIN>)
{
s/Ã/ /g;
print;
}
You need to specify that you want UNICODE and not Latin-1 (or another encoding).
If you're reading from a file then:
#!/usr/bin/perl
open INFILE, '<:encoding(UTF-8)', '/mypath/file';
while(<INFILE>) {
s/\xc3/ /g;
print;
}
I'll break that down better for you:
In <:encoding(UTF-8) you are specifying that you want to read (the <), and that you want UNICODE (the :encoding(UTF-8) part).
If you weren't using unicode you would use:
open INFILE, '<', '/mypath/file';
or
open INFILE, '/mypath/file';
because by default perl will read. If you want to write you use >:encoding(UTF-8) and if you want to append (because the > overwrites the file) you use >>:encoding(UTF-8).
Hope it helped!
There is another answer that specifies how to do binmode(STDIN, ":utf8") if you're trying to unicode from STDIN.
Following this, for the simple "quick fix" Wonko was looking for:
tr/ -~//cd;

Print other language character in csv using perl file handling

I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get Weiß, Römersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.
You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.
Check if there are any weird characters at start or anywhere in the file using commands like head or tail

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

Why do my Perl tests fail with use encoding 'utf8'?

I'm puzzled with this test script:
#!perl
use strict;
use warnings;
use encoding 'utf8';
use Test::More 'no_plan';
ok('áá' =~ m/á/, 'ok direct match');
my $re = qr{á};
ok('áá' =~ m/$re/, 'ok qr-based match');
like('áá', $re, 'like qr-based match');
The three tests fail, but I was expecting that the use encoding 'utf8' would upgrade both the literal áá and the qr-based regexps to utf8 strings, and thus passing the tests.
If I remove the use encoding line the tests pass as expected, but I can't figure it out why would they fail in utf8 mode.
I'm using perl 5.8.8 on Mac OS X (system version).
Do not use the encoding pragma. It’s broken. (Juerd Waalboer gave a great talk where he mentioned this at YAPC::EU 2k8.)
It does at least two things at once that do not belong together:
It specifies an encoding for your source file.
It specifies an encoding for your file input/output.
And to add injury to insult it also does #1 in a broken fashion: it reinterprets \xNN sequences as being undecoded octets as opposed to treating them like codepoints, and decodes them, preventing you from being able to express characters outside the encoding you specified and making your source code mean different things depending on the encoding. That’s just astonishingly wrong.
Write your source code in ASCII or UTF-8 only. In the latter case, the utf8 pragma is the correct thing to use. If you don’t want to use UTF-8, but you do want to include non-ASCII charcters, escape or decode them explicitly.
And use I/O layers explicitly or set them using the open pragma to have I/O automatically transcoded properly.
It works fine on my computer (on perl 5.10). Maybe you should try replacing that use encoding 'utf8' with use utf8.
What version of perl are you using? I think older versions had bugs with UTF-8 in regexps.
The Test::More documentation contains a fix for this issue, which I just found today (and this entry shows higher in the googles).
utf8 / "Wide character in print"
If you use utf8 or other non-ASCII characters with Test::More you might get a "Wide character in print" warning. Using binmode STDOUT, ":utf8" will not fix it. Test::Builder (which powers Test::More) duplicates STDOUT and STDERR. So any changes to them, including changing their output disciplines, will not be seem by Test::More. The work around is to change the filehandles used by Test::Builder directly.
my $builder = Test::More->builder;
binmode $builder->output, ":utf8";
binmode $builder->failure_output, ":utf8";
binmode $builder->todo_output, ":utf8";
I added this bit of boilerplate to my testing code and it works a charm.