Reading file breaks encoding in Perl

Reading file breaks encoding in Perl - perl

I have script for reading html files in Perl, it works, but it breaks encoding.
This is my script:
use utf8;
use Data::Dumper;
open my $fr, '<', 'file.html' or die "Can't open file $!";
my $content_from_file = do { local $/; <$fr> };
print Dumper($content_from_file);
Content of file.html:
<span class="previews-counter">Počet hodnotení: [%product.rating_votes%]</span>
[%L10n.msg('Zobraziť recenzie')%]
Output from reading:
<span class=\"previews-counter\">Po\x{10d}et hodnoten\x{ed}: [%product.rating_votes%]</span>
[%L10n.msg('Zobrazi\x{165} recenzie')%]
As you can see lot of characters are escaped, how can I read this file and show content of it as it is?

You open the file with perl's default encoding:
open my $fh, '<', ...;
If that encoding doesn't match the actual encoding, Perl might translate some characters incorrectly. If you know the encoding, specify it in the open mode:
open my $fh, '<:utf8', ...;
You aren't done yet, though. Now that you have a probably decoded string, you want to output it. You have the same problem again. The standard output file handle's encoding has to match what you are trying to print to. If you've set up your terminal (or whatever) to expect UTF-8, you need to actually output UTF-8. One way to fix that is to make the standard filehandles use UTF-8:
use open qw(:std :utf8);
You have use utf8, but that only signals the encoding for your program file.
I've written a much longer primer for Perl and Unicode in the back of Learning Perl. The StackOverflow question Why does modern Perl avoid UTF-8 by default? has lots of good advice.

Related

i am not getting kannada text when i run the perl script on a file

i am having following code for extracting the text from the html files and writing to a text file. in html it contain kannada text(utf-8) when programs runs i am getting a text file in that i am getting text but its not in proper formate. text is in unreadable formate
enter code here
use utf8;
use HTML::FormatText;
my $string = HTML::FormatText->format_file(
'a.html',
leftmargin => 0, rightmargin => 50
);
open mm,">t1.txt";
print mm "$string";
so please do help me.how to handle the file formates while we are processing it.

If I understand you correctly, you want the output file to be UTF-8 encoded so that the characters from the Kannada language are encoded in the output correctly. Your code is probably trying (and failing) to encode incorrectly into ISO-8859-1 instead.
If so, then what you can do is make sure your file is opened with a UTF-8 encoding filter.
use HTML::FormatText;
open my $htmlfh, '<:encoding(UTF-8)', 'a.html' or die "cannot open a.html: $!";
my $content = do { local $/; <$htmlfh> }; # read all content from file
close $htmlfh;
my $string = HTML::FormatText->format_string(
$content,
leftmargin => 0, rightmargin => 50
);
open my $mm, '>:encoding(UTF-8)', 't1.txt' or die "cannot open t1.txt: $!";
print $mm $string;
For further reading, I recommend checking out these docs:
perlunitut
perlunifaq
perlunicode
A few other notes:
The use utf8 line only makes it so that your Perl script/library may contain UTF formatting. It does not make any changes to how you read or write files.
Avoid using two-argument forms of open() like in your example. It may allow a malicious user to compromise your system in certain cases. (Though, your usage in this example happens to safe.
When opening a file, you need to add an or die afterwards or failures to read or write the file will be silently ignored.
Update 3/12: I changed it to read the file in UTF-8 and send that to HTML::FormatText. If your a.html file is saved with a BOM character at the start, it may have done the right thing anyway, but this should make it always assume UTF-8 for the incoming file.

Different behaviors of reading from files generated on different machines

I have a folder of several hundred text file. Each file has the same format, for instance the file with the name ATextFile1.txt reads
ATextFile1.txt 09 Oct 2013
1
2
3
4
...
I have a simplified Perl script that is supposed to read the file and print it back out in the terminal window:
#!/usr/bin/Perl
use warnings;
use strict;
my $fileName = shift(#ARGV);
open(my $INFILE, "<:encoding(UTF-8)", $fileName) || die("Cannot open $fileName: $!.\n");
foreach (<$INFILE>){
print("$_"); # Uses the newline character from the file
}
When I use this script on files generated by the Windows version of the program that generates the ATextFile1.txt, my output is exactly as I'd expect (being the content of the text file), however, when I run this script on files generated by the Mac version of the file generating program, the output looks like the following:
2016tFile1.txt 09 Oct 2013
After some testing, it seems that it is only printing the first line of the text where the first 4 characters are overwritten by what can be expressed in RegEx as /[0-9][0-9]16/. If in my Perl script, I replace the output statement with print("\t$_");, I get the following line printed to STDOUT:
2016 ATextFile1.txt 09 Oct 2013
Each of these files can be read normally using any standard text editor but for some reason, my Perl script can't seem to properly read and write from the file. Any help would be greatly appreciated (I'm hoping it's something obvious that I'm missing). Thanks in advance!

Note that if you are printing UTF-8 characters to STDOUT you will need to use
binmode STDOUT, ':encoding(utf8)';
beforehand.
It looks as if your Mac files have just CR as the line ending. I understood that recent versions of Macintosh systems used LF as the line ending (the same as Linux) but Mac OS 9 uses just CR, while Windows uses the two characters CR LF inside the file, which is converted to just LF by the PerlIO layer when perl is running in a Windows platform.
If there are no linefeeds in the file, then Perl will read the entire file as a single record, and printing it will overlay all lines on top of one another.
As long as the files are relatively small, the easiest way to read either file format with the same Perl code is to read the whole file and split it on either CR or LF. Anything else will need different code according to the source of the input files.
Try this version of your code.
use strict;
use warnings;
my #contents = do {
open my $fh, '<:encoding(utf8)', $ARGV[0];
local $/;
my $contents = <$fh>;
split /[\r\n]+/, $contents;
}
print "$_\n" for #contents;
Update
One alternative you might try is to use the PerlIO::eol module, which provides a PerlIO layer that translates any line ending to LF when the record is read. I'm not certain that it plays nice with UTF-8, but as long as you add it after the encoding layer it should be fine.
It is not a core module so you will probably need to install it, but after that the program becomes just
use strict;
use warnings;
open my $fh, '<:encoding(UTF-8):eol(LF)', $ARGV[0];
binmode STDOUT, ':encoding(utf8)';
print while <$fh>;
I have created Windows, Linux, and Mac-style text files and this program works fine wioth all of them, but I have been unable to check whether a UTF-8 character that has 0x0D or 0x0A as part of its encoding are passed through properly, so be careful.
Update 2
After thinking briefly about this, of course there are no UTF-8 encodings that contain CR or LF apart from those characters themselves. All characters outside the ASCII range contain only bytes with the top bit set, so they are over 0x80 and can never be 0x0D or 0x0A.

Print other language character in csv using perl file handling

I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get WeiÃŸ, RÃ¶mersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (WeiÃŸ)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.

You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.

Check if there are any weird characters at start or anywhere in the file using commands like head or tail

Clarfication on binmode

I'm new to Perl.
I was getting errors in my print statement: "Wide character in print"
And adding this line of code made it work #binmode(STDOUT, ":utf8");
I read the doc, simply put, binmode encodes characters in a manner that the platform can understand.
Without it, the platform may be expecting the characters to mean something else because it is using a different encoding.
Or is my understanding of binmode off ?
Is there a way with perl to find out what encoding the platform is using ?

use open ':std', ':locale';
can help. Doesn't work on all systems, though.

With a utf8-encoded Perl script, can it open a filename encoded as GB2312?

I'm not talking about reading in the file content in utf-8 or non-utf-8 encoding and stuff. It's about file names. Usually I save my Perl script in the system default encoding, "GB2312" in my case and I won't have any file open problems. But for processing purposes, I'm now having some Perl script files saved in utf-8 encoding. The problem is: these scripts cannot open the files whose names consist of characters encoded in "GB2312" encoding and I don't like the idea of having to rename my files.
Does anyone happen to have any experience in dealing with this kind of situation? Thanks like always for any guidance.
Edit
Here's the minimized code to demonstrate my problem:
# I'm running ActivePerl 5.10.1 on Windows XP (Simplified Chinese version)
# The file system is NTFS
#!perl -w
use autodie;
my $file = "./测试.txt"; #the file name consists of two Chinese characters
open my $in,'<',"$file";
while (<$in>){
print;
}
This test script can run well if saved in "ANSI" encoding (I assume ANSI encoding is the same as GB2312, which is used to display Chinese charcters). But it won't work if saved as "UTF-8" and the error message is as follows:
Can't open './娴嬭瘯.txt' for reading: 'No such file or directory'.
In this warning message, "娴嬭瘯" are meaningless junk characters.
Update
I tried first encoding the file name as GB2312 but it does not seem to work :(
Here's what I tried:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
My current thinking is: the file name in my OS is 测试.txt but it is encoded as GB2312. In the Perl script the file name looks the same to human eyes, still 测试.txt. But to Perl, they are different because they have different internal representations. But I don't understand why the problem persists when I already converted my file name in Perl to GB2312 as shown in the above code.
Update
I made it, finally made it :)
#brian's suggestion is right. I made a mistake in the above code. I didn't give the encoded file name back to the $file.
Here's the solution:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
$file = encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}

If you
use utf8;
in your Perl script, that merely tells perl that the source is in UTF-8. It doesn't affect how perl deals with the outside world. Are you turning on any other Perl Unicode features?
Are you having problems with every filename, or just some of them? Can you give us some examples, or a small demonstration script? I don't have a filesystem that encodes names as GB2312, but have you tried encoding your filenames as GB2312 before you call open?
If you want specific strings encoded with a specific encoding, you can use the Encode module. Try that with your filenames that you give to open.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse