Perl and utf8 output from file [duplicate] - perl

This question already has answers here:
How can I output UTF-8 from Perl?
(6 answers)
Closed 6 months ago.
I have a problem with perl output : the french word "préféré" is sometimes outputted "pr�f�r�" :
The sample script :
devel#k0:~/tmp$ cat 02.pl
#!/usr/bin/env perl
use strict;
use warnings;
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) ;
while ( <$fh> ) { print $_ }
close $fh;
exit;
The execution :
devel#k0:~/tmp$ ./02.pl
préféré
pr�f�r�
devel#k0:~/tmp$ cat text
préféré
devel#k0:~/tmp$ file text
text: UTF-8 Unicode text
Can please someone help me ?

Decode your inputs, encode your outputs. You have two bugs related to failure to properly decode and encode.
Specifically, you're missing
use utf8;
use open ":std", ":encoding(UTF-8)";
Details follow.
Perl source code is expected to be ASCII (with 8-bit clean string literals) unless you use use utf8 to tell Perl it's UTF-8.
I believe you have a UTF-8 terminal. We can conclude from the fact that cat 02.pl works that your source code is encoded using UTF-8. This means Perl sees the equivalent of this:
print "pr\x{C3}\x{A9}f\x{C3}\x{A9}r\x{C3}\x{A9}\n"; # C3 A9 = é encoded using UTF-8
You should be using use utf8; so Perl sees the equivalent of
print "pr\x{E9}f\x{E9}r\x{E9}\n"; # E9 = Unicode Code Point for é
You correctly decode the file you read.
The file presumably contains
70 72 C3 A9 66 C3 A9 72 C3 A9 0A # préféré␊ encoded using UTF-8
Because of the encoding layer you add, you are effectively doing
$_ = decode( "UTF-8", "\x{70}\x{72}\x{C3}\x{A9}\x{66}\x{C3}\x{A9}\x{72}\x{C3}\x{A9}\x{0A}" );
or
$_ = "pr\x{E9}f\x{E9}r\x{E9}\n";
This is correct.
Finally, you fail to encode your outputs.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
BEGIN {
binmode( STDIN, ":encoding(UTF-8)" ); # Well, not needed here.
binmode( STDOUT, ":encoding(UTF-8)" );
binmode( STDERR, ":encoding(UTF-8)" );
}
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;
But the open pragma makes it a lot cleaner.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ":std", ":encoding(UTF-8)";
print "préféré\n";
open( my $fh, '<', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;

UTF-8 is an interesting problem. First, your Perl itself will print correctly, because you don't do any UTF-8 Handling. You have an UTF-8 String, but Perl itself don't really know that it is UTF-8, and it will also print it, as-is.
So an an UTF-8 Terminal. Everything looks fine. Even that's not the case.
When you add use utf8; to your source-code. You will see, that your print now will produce the same garbage. But if you have string containing UTF-8. That's what you should do.
use utf8;
# Now also prints garbage
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Next. For every input you do from the outside, you need to do an decode, and for every output you do. You need todo an encode.
use utf8;
use Encode qw(encode decode);
# Now correct
print encode("UTF-8", "préféré\n");
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print encode("UTF-8", $_);
}
close $fh;
This can be tedious. But you can enable Auto-Encoding on a FileHandle with binmode
use utf8;
# Activate UTF-8 Encode on STDOUT
binmode STDOUT, ':utf8';
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Now everything is UTF-8! You also can activate it on STDERR. Remember that if you want to print binary data on STDOUT (for whatever reason) you must disable the Layer.
binmode STDOUT, ':raw';

Related

Error:Wide character in print at X at line 35, <$fh> ?(read text files from command line)

i am newbie to perl. and this is my second assignment i should create program to parse n files and print m sentences using n-grams model. long story short, i wrote this script that will take n arguments, where the first and second arguments are numeric but the rest are files names, however i am getting this error Wide character in print at ngram.pl line 35, line 1.
steps to reproduce it :
input from command line : perl ngram.pl 5 10 tale-cities.txt bleak-house.txt papers.txt
output : Wide character in print at ngram.pl line 35, line 1.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use Scalar::Util qw(looks_like_number);
use utf8;
use Encode;
#Charles Dickens
sub checkIfNumberic
{
my ($inp)=#_;
if (looks_like_number($inp)){
return "True";
}
else{
return "False" ;
}
}
sub main
{
my $correctInput=", your input must be something like this 5 10 somefile.txt somefile2.txt ";
my #inputs= #ARGV;
if (checkIfNumberic($inputs[0]) eq "False"){
die "first argument must be numberic $correctInput\n";
}
if (checkIfNumberic($inputs[1]) eq "False"){
die "second argument must be numberic $correctInput\n";
}
for (my $i=2; $i< scalar #inputs ;$i++)
{
if (open(my $fh, '<:encoding(UTF-8)', $inputs[$i])) {
while (my $line = <$fh>) {
chomp $line;
print "$line \n";
}
}
}
}
main();
You decoded your inputs (the script, with use utf8;; and the file, with :encoding(UTF-8)), but you didn't encode your outputs. Add
use open ':std', ':encoding(UTF-8)';
This is equivalent to
BEGIN {
binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
binmode STDERR, ':encoding(UTF-8)';
}
It also sets the default encoding for file handles opened in its lexical scope, you can remove the existing :encoding(UTF-8) if you want.

Trying to improve Encode::decode warning message: Segfault in $SIG{__WARN__} handler

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.
I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:
#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;
use Encode ();
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $fn = 'test.txt';
write_test_file( $fn );
# Try to improve the Encode::FB_WARN fallback warning message :
#
# utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
# utf8 "\xE5" does not map to Unicode at line xx of file <filename>.
my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };
$str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";
sub my_warn_handler {
my ( $fn, $msg ) = #_;
if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
recover_line_number_and_char_pos( $fn, $msg );
}
else {
warn $msg;
}
}
sub recover_line_number_and_char_pos {
my ( $fn, $err_msg ) = #_;
chomp $err_msg;
$err_msg =~ s/(line \d+)\.$/$1/; # Remove period at end of sentence.
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
my $raw_data = do { local $/; <$fh> };
close $fh;
my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s;
my $line_no = $str =~ tr/\n//;
++$line_no;
my $pos = ( length $last_line ) + 1;
warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}
sub write_test_file {
my ( $fn ) = #_;
my $bytes = "Hello\nA\x{E5}\x{61}"; # 2 lines ending in iso 8859-1: åa
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
}
Output:
utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
Here is another way to locate where the warning fires, with un-buffered sysread
use warnings;
use strict;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";
$SIG{__WARN__} = sub { print "\t==> WARN: #_" };
my $char_cnt = 0;
my $char;
while (sysread($fh, $char, 1)) {
++$char_cnt;
print "$char ($char_cnt)\n";
}
The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.
The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).
This prints
H (1)
e (2)
l (3)
l (4)
o (5)
(6)
A (7)
å (8)
a (9)
==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
(10)
While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.
However, Perl is utf8 internally and I am not sure that sysread needs Encode.
Note. The page for sysread supports its use on data with encoding layers
Note that if the filehandle has been marked as :utf8 Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread are in Unicode characters). The
:encoding(...) layer implicitly introduces the :utf8 layer.
See binmode, open, and the open pragma.
Note Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

How to detect malformed UTF-8 at the end of the file?

I am trying to print a warning message when reading a file (that is supposed to contain valid UTF-8) contains invalid UTF-8. However, if the invalid data is at the end of the file I am not able to output any warnings. The following MVCE creates a file containing invalid UTF-8 data (creation of the file is not relevant to the the general question, it was just added here to produce a MVCE):
use feature qw(say);
use strict;
use warnings;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}\x{61}"; # 3 bytes in iso 8859-1: aåa
test_read_invalid( $bytes );
$bytes = "\x{61}\x{E5}"; # 2 bytes in iso 8859-1: aå
test_read_invalid( $bytes );
sub test_read_invalid {
my ( $bytes ) = #_;
say "Running test case..";
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'\n";
}
The output is:
Running test case..
utf8 "\xE5" does not map to Unicode at ./p.pl line 22.
Read string: 'a\xE5a'
Running test case..
Read string: 'a'
In the last test case, the invalid byte at the end of the file seems to be silently ignored by the PerlIO layer :encoding(utf-8).
Essentially what you're seeing is the perlIO system attempting to deal with a block read ending in the middle of a utf-8 sequence. So the raw byte buffer still has the invalid byte you want, but the encoded buffer does not yet have that content because it doesn't decode properly yet and it's hoping to find another character later. You can check for this by popping the encoding layer off and doing another read and checking the length.
binmode $fh, ':pop';
my $remainder = do { local $/; <$fh>};
die "Unread Characters" if length $remainder;
I'm not sure, you may want to have your open encoding start with :raw or do binmode $fh, ':raw' instead, I've never paid much attention to the layers themselves since it usually just works. I do know that this code block works for your test case :)
I'm not sure what you are asking. To detect encoding errors in a string, you can simply attempt to decode the string. As for getting an error from writing to the file, maybe close returns an error, or you can use chomp($_); print($fh "$_\n"); (seeing as unix text files should always end with a newline anyway).
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
#the end of the file need a single space to find a invalid UTF-8 characters.
print $fh "$bytes ";
Output:
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '
Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '

How to make the output from Text::CSV utf8?

I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.
$iconv -o test.csv -f windows-1252 -t utf-8 win.csv
Then I read the converted CSV file with the following Perl script (utfcsv.pl).
#!/usr/bin/perl
use utf8;
use Text::CSV;
use Encode::Detect::Detector;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";
while (my $row = $csv->getline($fh)) {
my $line = join " ", #$row;
my $enc = Encode::Detect::Detector::detect($line);
print "($enc) $line\n";
}
$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;
Then the output is like the following.
(UFT-8) .........
() .....
Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file
$./utfcsv.pl > output.txt
then the encoding of output.txt is detected as windows-1252.
Question: How can I get the output text in UFT-8?
Notes:
Environment: openSUSE 13.2 x86_64, perl 5.20.1
I do not use Text::CSV::Encoded because the installation fails. (Because test.csv is converted in UTF-8, so it is strange to use Text::CSV::Encoded.)
I use the following script to check the encoding. (I also use it to find out the encoding of the initial CSV file win.csv.)
.
#!/usr/bin/perl
use Encode::Detect::Detector;
open my $in, "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
my $enc = Encode::Detect::Detector::detect($line);
chomp $enc;
if ($enc) {
print "$enc\n";
}
}
You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8) -- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output
The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8
If you use binmode on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open pragma like this
use open qw/ :std :encoding(utf-8) /;
which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this
Note that I have also added use strict and use warnings, which are essential in any Perl program. I have also
used autodie to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join call
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });
open my $fh, '<', 'test.csv';
while ( my $row = $csv->getline($fh) ) {
print "#$row\n";
}
close $fh;

Perl Handling Malformed Characters

I'd like advice about Perl.
I have text files I want to process with Perl. Those text files are encoded in cp932, but for some reasons they may contain malformed characters.
My program is like:
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
while ( my $line = <$in> ) {
# my process comes here
print $line;
}
If workfile.txt includes malformed characters, Perl complains:
cp932 "\x81" does not map to Unicode at ./my_program.pl line 8, <$in> line 1234.
Perl knows if its input contains malformed characters. So I want to rewrite to see if my input is good or bad and act accordingly, say print all good lines (lines that do not contain malformed characters) to output filehandle A, and print lines that do contain malformed characters to output filehandle B.
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
use English;
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
open my $output_good, ">:encoding(utf8)", "good.txt";
open my $output_bad, ">:encoding(utf8)", "bad.txt";
select $output_good; # in most cases workfile.txt lines are good
while ( my $line = <$in> ) {
if ( $line contains malformed characters ) {
select $output_bad;
}
print "$INPUT_LINE_NUMBER: $line";
select $output_good;
}
My question is how I can write this "if ($line contains malfoomed characters)" part. How can I check if input is good or bad.
Thanks in advance.
#! /usr/bin/perl -w
use strict;
use utf8; # Source encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # STD* is UTF-8;
# UTF-8 is default encoding for open.
use Encode qw( decode );
open my $fh_in, "<:raw", "workfile.txt"
or die $!;
open my $fh_good, ">", "good.txt"
or die $!;
open my $fh_bad, ">:raw", "bad.txt"
or die $!;
while ( my $line = <$fh_in> ) {
my $decoded_line =
eval { decode('cp932', $line, Encode::FB_CROAK|Encode::LEAVE_SRC) };
if (defined($decoded_line)) {
print($fh_good "$. $decoded_line");
} else {
print($fh_bad "$. $line");
}
}