I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.
$iconv -o test.csv -f windows-1252 -t utf-8 win.csv
Then I read the converted CSV file with the following Perl script (utfcsv.pl).
#!/usr/bin/perl
use utf8;
use Text::CSV;
use Encode::Detect::Detector;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";
while (my $row = $csv->getline($fh)) {
my $line = join " ", #$row;
my $enc = Encode::Detect::Detector::detect($line);
print "($enc) $line\n";
}
$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;
Then the output is like the following.
(UFT-8) .........
() .....
Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file
$./utfcsv.pl > output.txt
then the encoding of output.txt is detected as windows-1252.
Question: How can I get the output text in UFT-8?
Notes:
Environment: openSUSE 13.2 x86_64, perl 5.20.1
I do not use Text::CSV::Encoded because the installation fails. (Because test.csv is converted in UTF-8, so it is strange to use Text::CSV::Encoded.)
I use the following script to check the encoding. (I also use it to find out the encoding of the initial CSV file win.csv.)
.
#!/usr/bin/perl
use Encode::Detect::Detector;
open my $in, "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
my $enc = Encode::Detect::Detector::detect($line);
chomp $enc;
if ($enc) {
print "$enc\n";
}
}
You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8) -- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output
The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8
If you use binmode on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open pragma like this
use open qw/ :std :encoding(utf-8) /;
which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this
Note that I have also added use strict and use warnings, which are essential in any Perl program. I have also
used autodie to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join call
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });
open my $fh, '<', 'test.csv';
while ( my $row = $csv->getline($fh) ) {
print "#$row\n";
}
close $fh;
Related
This question already has answers here:
How can I output UTF-8 from Perl?
(6 answers)
Closed 6 months ago.
I have a problem with perl output : the french word "préféré" is sometimes outputted "pr�f�r�" :
The sample script :
devel#k0:~/tmp$ cat 02.pl
#!/usr/bin/env perl
use strict;
use warnings;
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) ;
while ( <$fh> ) { print $_ }
close $fh;
exit;
The execution :
devel#k0:~/tmp$ ./02.pl
préféré
pr�f�r�
devel#k0:~/tmp$ cat text
préféré
devel#k0:~/tmp$ file text
text: UTF-8 Unicode text
Can please someone help me ?
Decode your inputs, encode your outputs. You have two bugs related to failure to properly decode and encode.
Specifically, you're missing
use utf8;
use open ":std", ":encoding(UTF-8)";
Details follow.
Perl source code is expected to be ASCII (with 8-bit clean string literals) unless you use use utf8 to tell Perl it's UTF-8.
I believe you have a UTF-8 terminal. We can conclude from the fact that cat 02.pl works that your source code is encoded using UTF-8. This means Perl sees the equivalent of this:
print "pr\x{C3}\x{A9}f\x{C3}\x{A9}r\x{C3}\x{A9}\n"; # C3 A9 = é encoded using UTF-8
You should be using use utf8; so Perl sees the equivalent of
print "pr\x{E9}f\x{E9}r\x{E9}\n"; # E9 = Unicode Code Point for é
You correctly decode the file you read.
The file presumably contains
70 72 C3 A9 66 C3 A9 72 C3 A9 0A # préféré␊ encoded using UTF-8
Because of the encoding layer you add, you are effectively doing
$_ = decode( "UTF-8", "\x{70}\x{72}\x{C3}\x{A9}\x{66}\x{C3}\x{A9}\x{72}\x{C3}\x{A9}\x{0A}" );
or
$_ = "pr\x{E9}f\x{E9}r\x{E9}\n";
This is correct.
Finally, you fail to encode your outputs.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
BEGIN {
binmode( STDIN, ":encoding(UTF-8)" ); # Well, not needed here.
binmode( STDOUT, ":encoding(UTF-8)" );
binmode( STDERR, ":encoding(UTF-8)" );
}
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;
But the open pragma makes it a lot cleaner.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ":std", ":encoding(UTF-8)";
print "préféré\n";
open( my $fh, '<', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;
UTF-8 is an interesting problem. First, your Perl itself will print correctly, because you don't do any UTF-8 Handling. You have an UTF-8 String, but Perl itself don't really know that it is UTF-8, and it will also print it, as-is.
So an an UTF-8 Terminal. Everything looks fine. Even that's not the case.
When you add use utf8; to your source-code. You will see, that your print now will produce the same garbage. But if you have string containing UTF-8. That's what you should do.
use utf8;
# Now also prints garbage
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Next. For every input you do from the outside, you need to do an decode, and for every output you do. You need todo an encode.
use utf8;
use Encode qw(encode decode);
# Now correct
print encode("UTF-8", "préféré\n");
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print encode("UTF-8", $_);
}
close $fh;
This can be tedious. But you can enable Auto-Encoding on a FileHandle with binmode
use utf8;
# Activate UTF-8 Encode on STDOUT
binmode STDOUT, ':utf8';
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Now everything is UTF-8! You also can activate it on STDERR. Remember that if you want to print binary data on STDOUT (for whatever reason) you must disable the Layer.
binmode STDOUT, ':raw';
I have a perl string containing Unicode characters and I want to create a file with this string as a filename. It should work on Windows, Linux and Mac whatever the locale used.
Here is my code:
use strict;
use warnings FATAL => 'all';
use Encode::Locale;
use Encode;
# ファイル.c
my $file = "\x{30D5}\x{30A1}\x{30A4}\x{30EB}.c";
$file = encode(locale_fs => $file);
open(my $filehdl, '>', $file) or die("Unable to create file: $!");
close($filehdl);
I use encode function because, according to this answer:
Perl treats file names as opaque strings of bytes. They need to be encoded as per your "locale"'s encoding (ANSI code page).
However, this code fails with the following error:
Unable to create file: Invalid argument at .\perl.pl line 15.
I took a deeper look on how the string is encoded by encode:
my $rep = sprintf '%v02X', $file;
print($rep);
This prints:
3F.3F.3F.3F.2E.63
In my current locale (CP-1252), it corresponds to ????.c. We can see that each Unicode characters has been replaced by a question mark.
I think it is normal to have question marks here because the characters in my string are not representable using CP-1252 encoding.
So, my question is: is there a way to create a file with a name containing Unicode characters?
For Windows there is a module Win32::LongPath, which not only allows long file names, but also unicode characters.
I wrote myself a module for all kinds of file and dir IO that I need, that on Windows uses these module's functions, and else the standard perl ones, like so:
use Carp;
use Fcntl qw( :flock :seek );
use constant USE_LONG => ($^O =~ /Win/i) ? 1 : 0;
use if USE_LONG, 'Win32::LongPath', ':funcs';
sub open
{
my $f = shift; # file
my $m = shift; # mode
my $l = #_ ? (shift) : 'utf8'; # encoding
my $lock = $m eq '<' ? LOCK_SH : LOCK_EX;
length $l
and $m .= ":$l";
my $h;
USE_LONG ? openL( \$h, $m, $f ) : open( $h, $m, $f ) # openL needs REF on Handle!
or confess "Can't open file: '$f' ($^E)";
flock( $h, $lock );
return $h;
}
That way the code is portable. It runs on a Linux server as well as on my Windows PC at home.
I am trying to print a warning message when reading a file (that is supposed to contain valid UTF-8) contains invalid UTF-8. However, if the invalid data is at the end of the file I am not able to output any warnings. The following MVCE creates a file containing invalid UTF-8 data (creation of the file is not relevant to the the general question, it was just added here to produce a MVCE):
use feature qw(say);
use strict;
use warnings;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}\x{61}"; # 3 bytes in iso 8859-1: aåa
test_read_invalid( $bytes );
$bytes = "\x{61}\x{E5}"; # 2 bytes in iso 8859-1: aå
test_read_invalid( $bytes );
sub test_read_invalid {
my ( $bytes ) = #_;
say "Running test case..";
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'\n";
}
The output is:
Running test case..
utf8 "\xE5" does not map to Unicode at ./p.pl line 22.
Read string: 'a\xE5a'
Running test case..
Read string: 'a'
In the last test case, the invalid byte at the end of the file seems to be silently ignored by the PerlIO layer :encoding(utf-8).
Essentially what you're seeing is the perlIO system attempting to deal with a block read ending in the middle of a utf-8 sequence. So the raw byte buffer still has the invalid byte you want, but the encoded buffer does not yet have that content because it doesn't decode properly yet and it's hoping to find another character later. You can check for this by popping the encoding layer off and doing another read and checking the length.
binmode $fh, ':pop';
my $remainder = do { local $/; <$fh>};
die "Unread Characters" if length $remainder;
I'm not sure, you may want to have your open encoding start with :raw or do binmode $fh, ':raw' instead, I've never paid much attention to the layers themselves since it usually just works. I do know that this code block works for your test case :)
I'm not sure what you are asking. To detect encoding errors in a string, you can simply attempt to decode the string. As for getting an error from writing to the file, maybe close returns an error, or you can use chomp($_); print($fh "$_\n"); (seeing as unix text files should always end with a newline anyway).
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
#the end of the file need a single space to find a invalid UTF-8 characters.
print $fh "$bytes ";
Output:
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '
I'd like advice about Perl.
I have text files I want to process with Perl. Those text files are encoded in cp932, but for some reasons they may contain malformed characters.
My program is like:
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
while ( my $line = <$in> ) {
# my process comes here
print $line;
}
If workfile.txt includes malformed characters, Perl complains:
cp932 "\x81" does not map to Unicode at ./my_program.pl line 8, <$in> line 1234.
Perl knows if its input contains malformed characters. So I want to rewrite to see if my input is good or bad and act accordingly, say print all good lines (lines that do not contain malformed characters) to output filehandle A, and print lines that do contain malformed characters to output filehandle B.
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
use English;
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
open my $output_good, ">:encoding(utf8)", "good.txt";
open my $output_bad, ">:encoding(utf8)", "bad.txt";
select $output_good; # in most cases workfile.txt lines are good
while ( my $line = <$in> ) {
if ( $line contains malformed characters ) {
select $output_bad;
}
print "$INPUT_LINE_NUMBER: $line";
select $output_good;
}
My question is how I can write this "if ($line contains malfoomed characters)" part. How can I check if input is good or bad.
Thanks in advance.
#! /usr/bin/perl -w
use strict;
use utf8; # Source encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # STD* is UTF-8;
# UTF-8 is default encoding for open.
use Encode qw( decode );
open my $fh_in, "<:raw", "workfile.txt"
or die $!;
open my $fh_good, ">", "good.txt"
or die $!;
open my $fh_bad, ">:raw", "bad.txt"
or die $!;
while ( my $line = <$fh_in> ) {
my $decoded_line =
eval { decode('cp932', $line, Encode::FB_CROAK|Encode::LEAVE_SRC) };
if (defined($decoded_line)) {
print($fh_good "$. $decoded_line");
} else {
print($fh_bad "$. $line");
}
}
I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.
I know that use utf8; does nothing to begin with. I have tried the Encode package, which looked promising:
open FILE, '<', $ARGV[0] or die $!;
my %tmp = ();
my $last_num = 0;
while (<FILE>) {
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
chomp;
next unless length;
process($_);
}
I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!; and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character, or even mixed text (Some in UTF-8, some in Latin-1).
All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print. Is there any simple way to do that in Perl?
See Perl encoding introduction and the Unicode cookbook.
Easiest with piconv:
$ piconv -f Latin1 -t UTF-8 < input.file > output.file
Easy, with encoding layers:
use autodie qw(:all);
open my $input, '<:encoding(Latin1)', $ARGV[0];
binmode STDOUT, ':encoding(UTF-8)';
Moderately, with manual de-/encoding:
use Encode qw(decode encode);
use autodie qw(:all);
open my $input, '<:raw', $ARGV[0];
binmode STDOUT, ':raw';
while (my $raw = <$input>) {
my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
my $result = process($line);
print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
}
Maybe as :
$_ = encode('utf-8', decode('ISO-8859-1', $_));
The Data is gb2312 encode, so this can convert it to utf-8:
#!/usr/bin/env perl
use Encode qw(encode decode);
while (<DATA>) {
$_ = encode('utf-8', decode('gb2312', $_));
print;
}
__DATA__
Â׶ذÂÔË»á
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
This line has two problems with it. Firstly you are encoding your input to UTF-8 and then decoding it from ISO-8859-1. These two operations are the wrong way round.
Secondly, you almost certainly don't want to decode and encode at the same time. The Golden Rule of handling character encodings in Perl is to follow this process:
Decode data as soon as you get it from the outside world. This takes your input bytestream and converts it into Perl's internal representation for character strings.
Process the data according to your requirements.
Encode the data just before sending it to the outside world. This takes Perl's internal representation for character strings and converts it to a correctly-encoded bytestream for your required output encoding.