Reading Cyrillic characters from file in perl - perl

I'm having trouble reading Cyrillic characters from a file in perl.
The text file is written in Notepad and contains "абвгдежзийклмнопрстуфхцчшщъьюя".
Here's my code:
#!/usr/bin/perl
use warnings;
use strict;
open FILE, "text.txt" or die $!;
while (<FILE>) {
print $_;
}
If I save the text file using the ANSI encoding, I get:
рстуфхцчшщъыьэюяЁёЄєЇїЎў°∙·№■
If I save it using the UTF-8 encoding, and I use the function decode('UTF-8', $_) from the package Encode, I get:
Wide character in print at test.pl line 11, <TEXT> line 1.
and a bunch of unreadable characters.
I'm using the command prompt in windows 7x64

You're decoding your inputs, but "forgot" to encode your outputs.
Your file is probably encoded using cp1251.
Your terminal expects cp866.
Use
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
open(my $FILE, '<', 'text.txt')
or die $!;
or
use open ':std', ':encoding(cp866)';
open(my $FILE, '<:encoding(cp1251)', 'text.txt')
or die $!;
Use :encoding(UTF-8) instead of :encoding(cp1251) if you saved as UTF-8.

Related

Perl (wrong?) encoding of output file

I am running Active Perl 5.16.3 on Windows 7 (32 bits).
My (short) program massages an input text file (encoded in UTF-8). I wish the output encoding to be in Latin1, so my code is:
open (OUT, '>;encoding(Latin1)', "out.txt") || die "Cannot open output file: $!\n";
print OUT "$string\n";
yet the resulting file is still in UTF-8. What am I doing wrong?
Firstly, the encoding layer is separated from the open mode by a colon, not a semicolon.
open OUT, '>:encoding(latin1)', "out.txt" or die "Cannot open output file: $!\n";
Secondly, Latin-1 can only encode a small subset of UTF-8. Furthermore, most of this subset is encoded the same in both encodings. We therefore have to use a test file with characters that are not encoded the same, e.g. \N{MULTIPLICATION SIGN} U+00D7 ×, which is \xD7 in Latin-1, and \xC3\x97 in UTF-8.
Make also sure that you actually decode the input file.
Here is how you could generate the test file:
$ perl -CSA -E'say "\N{U+00D7}"' > input.txt
Here is how you can test that you are properly recoding the file:
use strict;
use warnings;
use autodie;
open my $in, "<:encoding(UTF-8)", "input.txt";
open my $out, ">:encoding(latin1)", "output.txt";
while (<$in>) {
print { $out } $_;
}
The input.txt and output.txt should afterwards have different lengths (3 bytes → 2 bytes).

How can I get Perl to respect the locale encoding for STDIN/STDOUT/STDERR, without affecting file IO?

What is the best way to ensure Perl uses the locale encoding (as in LANG=en_US.UTF-8) for STDIN/STDOUT/STDERR, without affecting file IO?
If I use
use open ':locale';
say "mañana";
open (my $f, '>', 'test.txt'); say $f "mañana";
then the locale encoding is used for STDIN/STDOUT/STDERR, but also in test.txt, which is not very well-behaved: you don't want the encoding of a file to depend on the way you logged in.
To add the encoding layers to STDIN, STDOUT and STDERR, you need to use
use open ':std', ':locale';
instead of
use open ':locale';
But that doesn't just add an encoding layer to STDIN, STDOUT and STDERR; it causes the same layer to be added to file handles opened in scope by default. So we need to override that default with
open(my $fh, '>:encoding(UTF-8)', $qfn)
or
use open ':encoding(UTF-8)';
open(my $fh, '>', $qfn)
All together:
use open ':std', ':locale';
use open ':encoding(UTF-8)';
open(my $fh_txt, '>', $qfn); # Text
open(my $fh_bin, '>:raw', $qfn); # Binary
or
use open ':std', ':locale';
open(my $fh_txt, '>:encoding(UTF-8)', $qfn); # Text
open(my $fh_bin, '>:raw', $qfn); # Binary
Result:
my $s = chr(0xE9);
say $s; # U+E9 encoded as per locale
say $fh_txt $s; # U+E9 encoded using UTF-8
say $fh_bin $s; # Byte E9
(You can use binmode($fh); instead of :raw for binary files, if you prefer.)

Perl Programm keep waiting to open a file using utf8

I am trying to read a UTF-8 encoded xml file. This file is of size around 8M and contains only one line.
I used below line to open this single line xml file:
open(INP,"<:utf8","$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
But after this line program get stuck and keep waiting.
I have tried other methods like binmode and decode but getting the same issue.
The same Program works when i change above mentioned file opening code to:
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## It works..
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
binmode(INP, ":utf8");
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
Can you please help me what I am doing wrong here? I need to perform some operation on the input data and have to get utf8 encoded output.
I tried your last snippet here (Ubuntu 12.04, perl 5.14.2) and it works as expected. Only problem I have, is a difference between the input and output. Input file is UTF-8 and output is ISO-8859-1.
When I add
use utf8;
use open qw(:std :utf8);
this problem is gone, though. So this must be an environment issue.

Copy/rename images with utf8 names using csv file

I'm working on a script to batch rename and copy images based on a csv file. The csv consists of column 1: old name and column 2: new name. I want to use the csv file as input for the perl script so that it checks the old name and makes a copy using the new name into a new folder. The problem that (i think) I'm having has to do with the images. They contain utf8 characters like ß etc. When I run the script it prints out this: Barfu├ƒg├ñsschen where it should be Barfußgässchen and the following error:
Unsuccessful stat on filename containing newline at C:/Perl64/lib/File/Copy.pm line 148, <$INFILE> line 1.
Copy failed: No such file or directory at X:\Script directory\correction.pl line 26, <$INFILE> line 1.
I know it has to do with Binmode utf8 but even when i try a simple script (saw it here: How can I output UTF-8 from Perl?):
use strict;
use utf8;
my $str = 'Çirçös';
binmode(STDOUT, ":utf8");
print "$str\n";
it prints out this: Ãirþ÷s
This is my entire script, can someone explain to me where i'm going wrong? (its not the cleanest of codes because i was testing out stuff).
use strict;
use warnings;
use File::Copy;
use utf8;
my $inputfile = shift || die "give input!\n";
#my $outputfile = shift || die "Give output!\n";
open my $INFILE, '<', $inputfile or die "In use / not found :$!\n";
#open my $OUTFILE, '>', $outputfile or die "In use / not found :$!\n";
binmode($INFILE, ":encoding(utf8)");
#binmode($OUTFILE, ":encoding(utf8)");
while (<$INFILE>) {
s/"//g;
my #elements = split /;/, $_;
my $old = $elements[1];
my $new = "new/$elements[3]";
binmode STDOUT, ':utf8';
print "$old | $new\n";
copy("$old","$new") or die "Copy failed: $!";
#copy("Copy.pm",\*STDOUT);
# my $output_line = join(";", #elements);
# print $OUTFILE $output_line;
#print "\n"
}
close $INFILE;
#close $OUTFILE;
exit 0;
You need to ensure every step of the process is using UTF-8.
When you create the input CSV, you need to make sure that it's saved as UTF-8, preferably without a BOM. Windows Notepad will add a BOM so try Notepad++ instead which gives you more control of the encoding.
You also have the problem that the Windows console is not UTF-8 compliant by default. See Unicode characters in Windows command line - how?. Either set the codepage with chcp 65001 or don't change the STDOUT encoding.
In terms of your code, the first error regarding the new line is probably due to the trailing new line from the CSV. Add chomp() after while (<$INFILE>) {
Update:
To "address" the file you need to encode your filenames in the correct locale - See How do you create unicode file names in Windows using Perl and What is the universal way to use file I/O API with unicode filenames?. Assuming you're using Western 1252 / Latin, this means when your copy command will look like:
copy(encode("cp1252", $old), encode("cp1252", $new))
Also, your open should also encode the filename:
open my $INFILE, '<', encode("cp1252", $inputfile)
Update 2:
As you're running in a DOS window, remove binmode(STDOUT, ":utf8"); and leave the default codepage in place.

How can I output UTF-8 from Perl?

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing files in utf-8 format.
However, when I enter the following into a text file, save it as a ".pl", and execute it, I get the friendly "diamond with a question mark" in place of the non-ASCII characters.
#!/usr/bin/env perl -w
use strict;
use utf8;
my $str = 'Çirçös';
print( "$str\n" );
Any idea what I'm doing wrong? I expect to get 'Çirçös' in the output, but I get '�ir��s' instead.
use utf8; does not enable Unicode output - it enables you to type Unicode in your program. Add this to the program, before your print() statement:
binmode(STDOUT, ":utf8");
See if that helps. That should make STDOUT output in UTF-8 instead of ordinary ASCII.
You can use the open pragma.
For eg. below sets STDOUT, STDIN & STDERR to use UTF-8....
use open qw/:std :utf8/;
TMTOWTDI, chose the method that best fits how you work. I use the environment method so I don't have to think about it.
In the environment:
export PERL_UNICODE=SDL
on the command line:
perl -CSDL -le 'print "\x{1815}"';
or with binmode:
binmode(STDOUT, ":utf8"); #treat as if it is UTF-8
binmode(STDIN, ":encoding(utf8)"); #actually check if it is UTF-8
or with PerlIO:
open my $fh, ">:utf8", $filename
or die "could not open $filename: $!\n";
open my $fh, "<:encoding(utf-8)", $filename
or die "could not open $filename: $!\n";
or with the open pragma:
use open ":encoding(utf8)";
use open IN => ":encoding(utf8)", OUT => ":utf8";
You also want to say, that strings in your code are utf-8. See Why does modern Perl avoid UTF-8 by default?. So set not only PERL_UNICODE=SDAL but also PERL5OPT=-Mutf8.
Thanks, finally got an solution to not put utf8::encode all over code.
To synthesize and complete for other cases, like write and read files in utf8 and also works with LoadFile of an YAML file in utf8
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");
open(FH, ">test.txt");
print FH "something éá";
use YAML qw(LoadFile Dump);
my $PUBS = LoadFile("cache.yaml");
my $f = "2917";
my $ref = $PUBS->{$f};
print "$f \"".$ref->{name}."\" ". $ref->{primary_uri}." ";
where cache.yaml is:
---
2917:
id: 2917
name: Semanário
primary_uri: 2917.xml
do in your shell:
$ env |grep LANG
This will probably show that your shell is not using a utf-8 locale.