How to fix UTF-8 encoding error with Russian words - perl

My Perl script reads from an text file which contains mainly English ANSI words.
But there are Russian words sometimes, which I can not convert back to UTF-8.
See same example (the words in brackets are the English translations):
Êîìïîíåíò (Component)
Àâòîð (Author)
Ãýíäàëüô (Gandalf)
Äàòà ñîçäàíèÿ (Create date): 20-ìàé(may)-2003
Äàòà êîððåêöèè (Last correction Date): 25-ìàð(mar)-2003
Âåðñèÿ (Version): 0.92
Áëàãîäàðíîñòè (Thanks):
Íîâîå â (New in):
Ïîääåðæêà (Support)
Î÷åíü ìíîãî (Very much)
I've read the UTF-8 Encoding Debugging Chart and tried also the following
$s='Àâòîð';
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
my $s = Encode::decode( 'iso-8859-5', 'Àâòîð' );
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
I've tried also cp1252 instead of iso-8859-5, but nothing helps.
I've tried also Encode::Guess, but the result is not helpful: iso-8859-5 or cp1251 or koi8-r or iso-8859-1.
Any idea how to convert 'Àâòîð' back to the Cyrillic text 'автор'?

After some tries, I get the expected output Автор when switching the (Windows) console code page to 65001 (UTF-8) and decoding the input data from Windows-1251:
perl -MEncode -wle "print encode('UTF-8',decode('Windows-1251',shift))" "Àâòîð"
This suggests that the input data is encoded as Windows-1251 and decoding from that should give you the cyrrillic letters in Unicode. To output the data to a file, make sure you either set the encoding when opening the file (easiest) or encode each string to the target encoding on output (hard to keep track of):
my $octets = <$input_file>;
my $data = decode('Windows-1251', $octets;
open my $fh, '>:encoding(UTF-8)', $filename
or die "Couldn't write to $filename: $!";
print $fh decode('Windows-1251', $data);

Your bytes sequence is 0xc0 0xe2 0xf2 0xee 0xf0. This is russian word 'author' in cp1251. Representation given by you can be get if your application assumes that this is cp1252 encoding. Now the question is here what codepage do you like to have? Or, what codepage needed to your application?
To read file in cp1251 in correct way you have to use construction like this:
open (my $tmp_h,"<:encoding(cp-1251)", $ARGV[0]) or die $!;
That allows perl to know what codepage do you use in your file. And then when you will read file into string it allows perl to correctly convert values from cp1251 to Perl's internal form (UTF-8) and use these string as you want without any problems.
For internal form perl set UTF8 flag you can check using Devel::Peek module.
I think, that using internal form also will give you chance to use any string operation correctly and will help avoid mistakes.
I would recommend to use "use utf8" pragma in our source code. Now, all literals in the source code will be threated as utf8 and automatically converted into internal form correctly. Now, we know that our source code is in UTF8 (and it would also better if with BOM, because detecting BOM usualy is the first thing different IDE and editor will typical do). Later, we can open other files in any encoding using "<:encoding(....)" construction get data from the web, from the databases and again make sure that data were converted into internal form correctly checking utf8 flag. Having all this, we would be able to work with all this data in one manner, correcly compare string, use regular expression and so on.

Related

In what encoding does readpipe return the result of an executed command?

Here's a simple perl script that is supposed to write a utf-8 encoded file:
use warnings;
use strict;
open (my $out, '>:encoding(utf-8)', 'tree.out') or die;
print $out readpipe ('tree ~');
close $out;
I have expected readpipe to return a utf-8 encoded string since LANG is set toen_US.UTF-8. However, looking at tree.out (while making sure the editor recognizes it a as utf-8 encoded) shows me all garbled text.
If I change the >:encoding(utf-8) in the open statement to >:encoding(latin-1), the script creates a utf-8 file with the expected
text.
This is all a bit strange to me. What is the explanation for this behavior?
readpipe is returning to perl a string of undecoded bytes. We know that that string is UTF-8 encoded, but you've not told Perl.
The IO layer on your output handle is taking that string, assuming it is Unicode code-points and re-encoding them as UTF-8 bytes.
The reason that the latin-1 IO layer appears to be functioning correctly is that it is writing out each undecoded byte unmolested because the 1st 256 unicode code-points correspond nicely with latin-1.
The proper thing to do would be to decode the byte-string returned by readpipe into a code-point-string, before feeding it to an IO-layer. The statement use open ':utf8', as mentioned by Borodin, should be a viable solution as readpipe is specifically mentioned in the open manual page.

Corrupt Spanish characters when saving variables to a text file in Perl

I think I have an encoding problem. My knowledge of perl is not great. Much better with other languages, but I have tried everything I can think of and checked lots of other posts.
I am collecting a name and address. This can contain non english characters. In this case Spanish.
A php process uses curl to execute a .pl script and passes the values URLEncoded
The .pl executes a function in a .pm which writes the data to a text file. No database is involved.
Both the .pl and .pm have
use Encode;
use utf8;
binmode (STDIN, 'utf8');
binmode (STDOUT, 'utf8');
defined. Below is the function which is writing the text to a file
sub bookingCSV(#){
my $filename = "test.csv";
utf8::decode($_[1]{booking}->{LeadNameFirst});
open OUT, ">:utf8", $filename;
$_="\"$_[1]{booking}->{BookingNo}¦¦$_[1]{booking}->{ShortPlace}¦¦$_[1]{booking}->{ShortDev}¦¦$_[1]{booking}->{ShortAcc}¦¦$_[1]{booking}->{LeadNameFirst}¦¦$_[1]{booking}->{LeadNameLast}¦¦$_[1]{booking}->{Email}¦¦$_[1]{booking}->{Telephone}¦¦$_[1]{booking}->{Company}¦¦$_[1]{booking}->{Address1}¦¦$_[1]{booking}->{Address2}¦¦$_[1]{booking}->{Town}¦¦$_[1]{booking}->{County}¦¦$_[1]{booking}->{Zip}¦¦$_[1]{booking}->{Country}¦¦";
print OUT $_;
close (OUT);
All Spanish characters are corrupted in the text file. I have tried decode on one specific field "LeadNameFirst" but that has not made a difference. I left the code in place just in case it is useful.
Thanks for any help.
What is the encoding of the input? If the input encoding is not utf-8, then it will not do you any good to decode it as utf-8 input.
Does the input come from an HTML form? Then the encoding probably matches the encoding of the web page it came from. ISO-8859-1 is a common default encoding for American/European locales. Anyway, once you discover the encoding, you can decode the input with it:
$name = decode('iso-8859-1',$_[1]{booking}->{LeadNameFirst});
print OUT "name is $name\n"; # utf8 layer already enabled
Some browsers look for and respect a accept-charset attribute inside a <form> tag, e.g.,
<form action="/my_form_processor.php" accept-charset="UTF-8">
...
</form>
This will (cross your fingers) cause you to receive the form input as utf-8 encoded.

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

How can I convert an input file to UTF-8 encoding in Perl?

I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:
# outfile.txt is in GB-2312 encode
open my $filter,"<",'c:/outfile.txt';
while(<$filter>){
#convert each line of outfile.txt to UTF-8 encoding
$_ = Encode::decode("gb2312", $_);
...}
But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like
#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt';
(Perl says something like "utf8 "\xD4" does not map to Unicode" )
and
open my $filter,"<",'c:/outfile.txt';
$filter = Encode::decode("gb2312", $filter);
(Perl says "readline() on unopened filehandle!)
They don't work. But is there some way to directly convert the input file to UTF-8 encode?
Update:
Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:
open my $filter,'<:encoding(gb2312)','c:/outfile.txt';
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt';
print $filter_new $_ while <$filter>;
while (<$filter_new>){
...
}
But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.
I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.
When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.
old answer
The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.
You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.
If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.
The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.
But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)
To do that, all you need is to apply an additional layer to your open:
open my $foo, "<:encoding(gb2312):bytes", ...;
Note that the output of the following will be the same:
perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'
but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.

With a utf8-encoded Perl script, can it open a filename encoded as GB2312?

I'm not talking about reading in the file content in utf-8 or non-utf-8 encoding and stuff. It's about file names. Usually I save my Perl script in the system default encoding, "GB2312" in my case and I won't have any file open problems. But for processing purposes, I'm now having some Perl script files saved in utf-8 encoding. The problem is: these scripts cannot open the files whose names consist of characters encoded in "GB2312" encoding and I don't like the idea of having to rename my files.
Does anyone happen to have any experience in dealing with this kind of situation? Thanks like always for any guidance.
Edit
Here's the minimized code to demonstrate my problem:
# I'm running ActivePerl 5.10.1 on Windows XP (Simplified Chinese version)
# The file system is NTFS
#!perl -w
use autodie;
my $file = "./测试.txt"; #the file name consists of two Chinese characters
open my $in,'<',"$file";
while (<$in>){
print;
}
This test script can run well if saved in "ANSI" encoding (I assume ANSI encoding is the same as GB2312, which is used to display Chinese charcters). But it won't work if saved as "UTF-8" and the error message is as follows:
Can't open './娴嬭瘯.txt' for reading: 'No such file or directory'.
In this warning message, "娴嬭瘯" are meaningless junk characters.
Update
I tried first encoding the file name as GB2312 but it does not seem to work :(
Here's what I tried:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
My current thinking is: the file name in my OS is 测试.txt but it is encoded as GB2312. In the Perl script the file name looks the same to human eyes, still 测试.txt. But to Perl, they are different because they have different internal representations. But I don't understand why the problem persists when I already converted my file name in Perl to GB2312 as shown in the above code.
Update
I made it, finally made it :)
#brian's suggestion is right. I made a mistake in the above code. I didn't give the encoded file name back to the $file.
Here's the solution:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
$file = encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
If you
use utf8;
in your Perl script, that merely tells perl that the source is in UTF-8. It doesn't affect how perl deals with the outside world. Are you turning on any other Perl Unicode features?
Are you having problems with every filename, or just some of them? Can you give us some examples, or a small demonstration script? I don't have a filesystem that encodes names as GB2312, but have you tried encoding your filenames as GB2312 before you call open?
If you want specific strings encoded with a specific encoding, you can use the Encode module. Try that with your filenames that you give to open.