Perl (wrong?) encoding of output file - perl

I am running Active Perl 5.16.3 on Windows 7 (32 bits).
My (short) program massages an input text file (encoded in UTF-8). I wish the output encoding to be in Latin1, so my code is:
open (OUT, '>;encoding(Latin1)', "out.txt") || die "Cannot open output file: $!\n";
print OUT "$string\n";
yet the resulting file is still in UTF-8. What am I doing wrong?

Firstly, the encoding layer is separated from the open mode by a colon, not a semicolon.
open OUT, '>:encoding(latin1)', "out.txt" or die "Cannot open output file: $!\n";
Secondly, Latin-1 can only encode a small subset of UTF-8. Furthermore, most of this subset is encoded the same in both encodings. We therefore have to use a test file with characters that are not encoded the same, e.g. \N{MULTIPLICATION SIGN} U+00D7 ×, which is \xD7 in Latin-1, and \xC3\x97 in UTF-8.
Make also sure that you actually decode the input file.
Here is how you could generate the test file:
$ perl -CSA -E'say "\N{U+00D7}"' > input.txt
Here is how you can test that you are properly recoding the file:
use strict;
use warnings;
use autodie;
open my $in, "<:encoding(UTF-8)", "input.txt";
open my $out, ">:encoding(latin1)", "output.txt";
while (<$in>) {
print { $out } $_;
}
The input.txt and output.txt should afterwards have different lengths (3 bytes → 2 bytes).

Related

Reading Cyrillic characters from file in perl

I'm having trouble reading Cyrillic characters from a file in perl.
The text file is written in Notepad and contains "абвгдежзийклмнопрстуфхцчшщъьюя".
Here's my code:
#!/usr/bin/perl
use warnings;
use strict;
open FILE, "text.txt" or die $!;
while (<FILE>) {
print $_;
}
If I save the text file using the ANSI encoding, I get:
рстуфхцчшщъыьэюяЁёЄєЇїЎў°∙·№■
If I save it using the UTF-8 encoding, and I use the function decode('UTF-8', $_) from the package Encode, I get:
Wide character in print at test.pl line 11, <TEXT> line 1.
and a bunch of unreadable characters.
I'm using the command prompt in windows 7x64
You're decoding your inputs, but "forgot" to encode your outputs.
Your file is probably encoded using cp1251.
Your terminal expects cp866.
Use
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
open(my $FILE, '<', 'text.txt')
or die $!;
or
use open ':std', ':encoding(cp866)';
open(my $FILE, '<:encoding(cp1251)', 'text.txt')
or die $!;
Use :encoding(UTF-8) instead of :encoding(cp1251) if you saved as UTF-8.

Perl Programm keep waiting to open a file using utf8

I am trying to read a UTF-8 encoded xml file. This file is of size around 8M and contains only one line.
I used below line to open this single line xml file:
open(INP,"<:utf8","$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
But after this line program get stuck and keep waiting.
I have tried other methods like binmode and decode but getting the same issue.
The same Program works when i change above mentioned file opening code to:
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## It works..
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
binmode(INP, ":utf8");
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
Can you please help me what I am doing wrong here? I need to perform some operation on the input data and have to get utf8 encoded output.
I tried your last snippet here (Ubuntu 12.04, perl 5.14.2) and it works as expected. Only problem I have, is a difference between the input and output. Input file is UTF-8 and output is ISO-8859-1.
When I add
use utf8;
use open qw(:std :utf8);
this problem is gone, though. So this must be an environment issue.

Copy/rename images with utf8 names using csv file

I'm working on a script to batch rename and copy images based on a csv file. The csv consists of column 1: old name and column 2: new name. I want to use the csv file as input for the perl script so that it checks the old name and makes a copy using the new name into a new folder. The problem that (i think) I'm having has to do with the images. They contain utf8 characters like ß etc. When I run the script it prints out this: Barfu├ƒg├ñsschen where it should be Barfußgässchen and the following error:
Unsuccessful stat on filename containing newline at C:/Perl64/lib/File/Copy.pm line 148, <$INFILE> line 1.
Copy failed: No such file or directory at X:\Script directory\correction.pl line 26, <$INFILE> line 1.
I know it has to do with Binmode utf8 but even when i try a simple script (saw it here: How can I output UTF-8 from Perl?):
use strict;
use utf8;
my $str = 'Çirçös';
binmode(STDOUT, ":utf8");
print "$str\n";
it prints out this: Ãirþ÷s
This is my entire script, can someone explain to me where i'm going wrong? (its not the cleanest of codes because i was testing out stuff).
use strict;
use warnings;
use File::Copy;
use utf8;
my $inputfile = shift || die "give input!\n";
#my $outputfile = shift || die "Give output!\n";
open my $INFILE, '<', $inputfile or die "In use / not found :$!\n";
#open my $OUTFILE, '>', $outputfile or die "In use / not found :$!\n";
binmode($INFILE, ":encoding(utf8)");
#binmode($OUTFILE, ":encoding(utf8)");
while (<$INFILE>) {
s/"//g;
my #elements = split /;/, $_;
my $old = $elements[1];
my $new = "new/$elements[3]";
binmode STDOUT, ':utf8';
print "$old | $new\n";
copy("$old","$new") or die "Copy failed: $!";
#copy("Copy.pm",\*STDOUT);
# my $output_line = join(";", #elements);
# print $OUTFILE $output_line;
#print "\n"
}
close $INFILE;
#close $OUTFILE;
exit 0;
You need to ensure every step of the process is using UTF-8.
When you create the input CSV, you need to make sure that it's saved as UTF-8, preferably without a BOM. Windows Notepad will add a BOM so try Notepad++ instead which gives you more control of the encoding.
You also have the problem that the Windows console is not UTF-8 compliant by default. See Unicode characters in Windows command line - how?. Either set the codepage with chcp 65001 or don't change the STDOUT encoding.
In terms of your code, the first error regarding the new line is probably due to the trailing new line from the CSV. Add chomp() after while (<$INFILE>) {
Update:
To "address" the file you need to encode your filenames in the correct locale - See How do you create unicode file names in Windows using Perl and What is the universal way to use file I/O API with unicode filenames?. Assuming you're using Western 1252 / Latin, this means when your copy command will look like:
copy(encode("cp1252", $old), encode("cp1252", $new))
Also, your open should also encode the filename:
open my $INFILE, '<', encode("cp1252", $inputfile)
Update 2:
As you're running in a DOS window, remove binmode(STDOUT, ":utf8"); and leave the default codepage in place.

How use perl to process a file whose format is similar to unicode?

I have a legacy program, and after running it, it will generate a log file. Now I need to analysis this log file.
But the file format is very strange. Please see the following,I used vi to open it, it looks like an unicode file, but it is not FFFE started. after I used notepad open it, save it and open again, I found that the FFFE is added by notepad. Then I can use command 'type log.txt > log1.txt" to convert the whole file to ANSI format. Later in perl, I can use /TDD/ in perl to search what I need.
But now, I can't deal with this file format.
Any comment or idea will be very appreciated.
0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100 T.D.D.>. .L.o.a.
After notepad save it
0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00 ..T.D.D.>. .L.o.
open STDIN, "< log.txt";
while(<>)
{
if (/TDD/)
{
# Add my logic.
}
}
I have read the thread which is very useful, but still can't resolve my problem.
How can I open a Unicode file with Perl?
I can't add answer, so I edit my thread.
Thanks Michael,
I tried your script but got the following error. I checked my perl version is 5.1, OS is windows 2008.
* ascii
* ascii-ctrl
* iso-8859-1
* null
* utf-8-strict
* utf8
UTF-16:Unrecognised BOM 5400 at test.pl line 12.
Update
I tried the UTF-16LE with the command:
perl.exe open.pl utf-16le utf-16 <my log file>.txt
but I still got the error like
UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824.
also, I tried utf-16be, got the same error.
If I used utf-16, I will got the error
UTF-16:Unrecognised BOM 5400 at open.pl line 18.
open.pl line 18
is "print while <$fh>;"
Any idea?
Updated: 5/11/2011.
Thank you guys for your help. I resolved the problem.
I found that the data in log file are not UTF-16 after all. So, I had to write a .net project by visual studio. It will read the log file with UTF-16 and write to a new file with UTF-8. And then I used perl script to parse the file and generate result data. It worked now.
So, if any of you know how to use perl read a file with many garbage data, please tell me, thank you very much.
e.g. garbage data sample
tests.cpp:34)
਍吀䐀䐀㸀 䰀漀愀搀椀渀最 挀挀洀挀漀爀攀⸀搀氀
use hex reader to open it:
0000070: a88d e590 80e4 9080 e490 80e3 b880 e280 ................
0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480 ................
0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6 ................
00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8 ................
Your file seems to be encoded in UTF-16LE. The bytes notepad adds are called "Byte Order Mark", or just BOM.
Here's how you can read your file using Perl:
use strict;
use warnings;
use Encode;
# list loaded encodings
print STDERR map "* $_\n", Encode->encodings;
# read arguments
my $enc = shift || 'utf16';
die "no files :-(\n" unless #ARGV;
# process files
for ( #ARGV ) {
open my $fh, "<:encoding($enc)", $_ or die "open $_: $!";
print <$fh>;
close $fh;
}
# loaded more encodings now
print STDERR map "* $_\n", Encode->encodings;
Proceed like this, taking care to supply the correct encoding for your file:
perl open.pl utf16 open.utf16be.txt
perl open.pl utf16 open.utf16le.txt
perl open.pl utf16le open.utf16le.nobom.txt
Here's the revised version following tchrist's suggestions:
use strict;
use warnings;
use Encode;
# read arguments
my $enc_in = shift || die 'pass file encoding as first parameter';
my $enc_out = shift || die 'pass STDOUT encoding as second parameter';
print STDERR "going to read files as encoded in: $enc_in\n";
print STDERR "going to write to standard output in: $enc_out\n";
die "no files :-(\n" unless #ARGV;
binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8
print STDERR map "* $_\n", Encode->encodings; # list loaded encodings
for ( #ARGV ) { # process files
open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!";
print while <$fh>;
close $fh;
}
print STDERR map "* $_\n", Encode->encodings; # more encodings now

How can I output UTF-8 from Perl?

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing files in utf-8 format.
However, when I enter the following into a text file, save it as a ".pl", and execute it, I get the friendly "diamond with a question mark" in place of the non-ASCII characters.
#!/usr/bin/env perl -w
use strict;
use utf8;
my $str = 'Çirçös';
print( "$str\n" );
Any idea what I'm doing wrong? I expect to get 'Çirçös' in the output, but I get '�ir��s' instead.
use utf8; does not enable Unicode output - it enables you to type Unicode in your program. Add this to the program, before your print() statement:
binmode(STDOUT, ":utf8");
See if that helps. That should make STDOUT output in UTF-8 instead of ordinary ASCII.
You can use the open pragma.
For eg. below sets STDOUT, STDIN & STDERR to use UTF-8....
use open qw/:std :utf8/;
TMTOWTDI, chose the method that best fits how you work. I use the environment method so I don't have to think about it.
In the environment:
export PERL_UNICODE=SDL
on the command line:
perl -CSDL -le 'print "\x{1815}"';
or with binmode:
binmode(STDOUT, ":utf8"); #treat as if it is UTF-8
binmode(STDIN, ":encoding(utf8)"); #actually check if it is UTF-8
or with PerlIO:
open my $fh, ">:utf8", $filename
or die "could not open $filename: $!\n";
open my $fh, "<:encoding(utf-8)", $filename
or die "could not open $filename: $!\n";
or with the open pragma:
use open ":encoding(utf8)";
use open IN => ":encoding(utf8)", OUT => ":utf8";
You also want to say, that strings in your code are utf-8. See Why does modern Perl avoid UTF-8 by default?. So set not only PERL_UNICODE=SDAL but also PERL5OPT=-Mutf8.
Thanks, finally got an solution to not put utf8::encode all over code.
To synthesize and complete for other cases, like write and read files in utf8 and also works with LoadFile of an YAML file in utf8
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");
open(FH, ">test.txt");
print FH "something éá";
use YAML qw(LoadFile Dump);
my $PUBS = LoadFile("cache.yaml");
my $f = "2917";
my $ref = $PUBS->{$f};
print "$f \"".$ref->{name}."\" ". $ref->{primary_uri}." ";
where cache.yaml is:
---
2917:
id: 2917
name: Semanário
primary_uri: 2917.xml
do in your shell:
$ env |grep LANG
This will probably show that your shell is not using a utf-8 locale.