I'd like advice about Perl.
I have text files I want to process with Perl. Those text files are encoded in cp932, but for some reasons they may contain malformed characters.
My program is like:
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
while ( my $line = <$in> ) {
# my process comes here
print $line;
}
If workfile.txt includes malformed characters, Perl complains:
cp932 "\x81" does not map to Unicode at ./my_program.pl line 8, <$in> line 1234.
Perl knows if its input contains malformed characters. So I want to rewrite to see if my input is good or bad and act accordingly, say print all good lines (lines that do not contain malformed characters) to output filehandle A, and print lines that do contain malformed characters to output filehandle B.
#! /usr/bin/perl -w
use strict;
use encoding 'utf-8';
use English;
# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";
open my $output_good, ">:encoding(utf8)", "good.txt";
open my $output_bad, ">:encoding(utf8)", "bad.txt";
select $output_good; # in most cases workfile.txt lines are good
while ( my $line = <$in> ) {
if ( $line contains malformed characters ) {
select $output_bad;
}
print "$INPUT_LINE_NUMBER: $line";
select $output_good;
}
My question is how I can write this "if ($line contains malfoomed characters)" part. How can I check if input is good or bad.
Thanks in advance.
#! /usr/bin/perl -w
use strict;
use utf8; # Source encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # STD* is UTF-8;
# UTF-8 is default encoding for open.
use Encode qw( decode );
open my $fh_in, "<:raw", "workfile.txt"
or die $!;
open my $fh_good, ">", "good.txt"
or die $!;
open my $fh_bad, ">:raw", "bad.txt"
or die $!;
while ( my $line = <$fh_in> ) {
my $decoded_line =
eval { decode('cp932', $line, Encode::FB_CROAK|Encode::LEAVE_SRC) };
if (defined($decoded_line)) {
print($fh_good "$. $decoded_line");
} else {
print($fh_bad "$. $line");
}
}
Related
This question already has answers here:
How can I output UTF-8 from Perl?
(6 answers)
Closed 6 months ago.
I have a problem with perl output : the french word "préféré" is sometimes outputted "pr�f�r�" :
The sample script :
devel#k0:~/tmp$ cat 02.pl
#!/usr/bin/env perl
use strict;
use warnings;
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) ;
while ( <$fh> ) { print $_ }
close $fh;
exit;
The execution :
devel#k0:~/tmp$ ./02.pl
préféré
pr�f�r�
devel#k0:~/tmp$ cat text
préféré
devel#k0:~/tmp$ file text
text: UTF-8 Unicode text
Can please someone help me ?
Decode your inputs, encode your outputs. You have two bugs related to failure to properly decode and encode.
Specifically, you're missing
use utf8;
use open ":std", ":encoding(UTF-8)";
Details follow.
Perl source code is expected to be ASCII (with 8-bit clean string literals) unless you use use utf8 to tell Perl it's UTF-8.
I believe you have a UTF-8 terminal. We can conclude from the fact that cat 02.pl works that your source code is encoded using UTF-8. This means Perl sees the equivalent of this:
print "pr\x{C3}\x{A9}f\x{C3}\x{A9}r\x{C3}\x{A9}\n"; # C3 A9 = é encoded using UTF-8
You should be using use utf8; so Perl sees the equivalent of
print "pr\x{E9}f\x{E9}r\x{E9}\n"; # E9 = Unicode Code Point for é
You correctly decode the file you read.
The file presumably contains
70 72 C3 A9 66 C3 A9 72 C3 A9 0A # préféré␊ encoded using UTF-8
Because of the encoding layer you add, you are effectively doing
$_ = decode( "UTF-8", "\x{70}\x{72}\x{C3}\x{A9}\x{66}\x{C3}\x{A9}\x{72}\x{C3}\x{A9}\x{0A}" );
or
$_ = "pr\x{E9}f\x{E9}r\x{E9}\n";
This is correct.
Finally, you fail to encode your outputs.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
BEGIN {
binmode( STDIN, ":encoding(UTF-8)" ); # Well, not needed here.
binmode( STDOUT, ":encoding(UTF-8)" );
binmode( STDERR, ":encoding(UTF-8)" );
}
print "préféré\n";
open( my $fh, '<:encoding(UTF-8)', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;
But the open pragma makes it a lot cleaner.
The following does what you want:
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ":std", ":encoding(UTF-8)";
print "préféré\n";
open( my $fh, '<', 'text' ) or die $!;
while ( <$fh> ) { print $_ }
close $fh;
UTF-8 is an interesting problem. First, your Perl itself will print correctly, because you don't do any UTF-8 Handling. You have an UTF-8 String, but Perl itself don't really know that it is UTF-8, and it will also print it, as-is.
So an an UTF-8 Terminal. Everything looks fine. Even that's not the case.
When you add use utf8; to your source-code. You will see, that your print now will produce the same garbage. But if you have string containing UTF-8. That's what you should do.
use utf8;
# Now also prints garbage
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Next. For every input you do from the outside, you need to do an decode, and for every output you do. You need todo an encode.
use utf8;
use Encode qw(encode decode);
# Now correct
print encode("UTF-8", "préféré\n");
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print encode("UTF-8", $_);
}
close $fh;
This can be tedious. But you can enable Auto-Encoding on a FileHandle with binmode
use utf8;
# Activate UTF-8 Encode on STDOUT
binmode STDOUT, ':utf8';
print "préféré\n";
open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
print $_;
}
close $fh;
Now everything is UTF-8! You also can activate it on STDERR. Remember that if you want to print binary data on STDOUT (for whatever reason) you must disable the Layer.
binmode STDOUT, ':raw';
I have a perl string containing Unicode characters and I want to create a file with this string as a filename. It should work on Windows, Linux and Mac whatever the locale used.
Here is my code:
use strict;
use warnings FATAL => 'all';
use Encode::Locale;
use Encode;
# ファイル.c
my $file = "\x{30D5}\x{30A1}\x{30A4}\x{30EB}.c";
$file = encode(locale_fs => $file);
open(my $filehdl, '>', $file) or die("Unable to create file: $!");
close($filehdl);
I use encode function because, according to this answer:
Perl treats file names as opaque strings of bytes. They need to be encoded as per your "locale"'s encoding (ANSI code page).
However, this code fails with the following error:
Unable to create file: Invalid argument at .\perl.pl line 15.
I took a deeper look on how the string is encoded by encode:
my $rep = sprintf '%v02X', $file;
print($rep);
This prints:
3F.3F.3F.3F.2E.63
In my current locale (CP-1252), it corresponds to ????.c. We can see that each Unicode characters has been replaced by a question mark.
I think it is normal to have question marks here because the characters in my string are not representable using CP-1252 encoding.
So, my question is: is there a way to create a file with a name containing Unicode characters?
For Windows there is a module Win32::LongPath, which not only allows long file names, but also unicode characters.
I wrote myself a module for all kinds of file and dir IO that I need, that on Windows uses these module's functions, and else the standard perl ones, like so:
use Carp;
use Fcntl qw( :flock :seek );
use constant USE_LONG => ($^O =~ /Win/i) ? 1 : 0;
use if USE_LONG, 'Win32::LongPath', ':funcs';
sub open
{
my $f = shift; # file
my $m = shift; # mode
my $l = #_ ? (shift) : 'utf8'; # encoding
my $lock = $m eq '<' ? LOCK_SH : LOCK_EX;
length $l
and $m .= ":$l";
my $h;
USE_LONG ? openL( \$h, $m, $f ) : open( $h, $m, $f ) # openL needs REF on Handle!
or confess "Can't open file: '$f' ($^E)";
flock( $h, $lock );
return $h;
}
That way the code is portable. It runs on a Linux server as well as on my Windows PC at home.
i am newbie to perl. and this is my second assignment i should create program to parse n files and print m sentences using n-grams model. long story short, i wrote this script that will take n arguments, where the first and second arguments are numeric but the rest are files names, however i am getting this error Wide character in print at ngram.pl line 35, line 1.
steps to reproduce it :
input from command line : perl ngram.pl 5 10 tale-cities.txt bleak-house.txt papers.txt
output : Wide character in print at ngram.pl line 35, line 1.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use Scalar::Util qw(looks_like_number);
use utf8;
use Encode;
#Charles Dickens
sub checkIfNumberic
{
my ($inp)=#_;
if (looks_like_number($inp)){
return "True";
}
else{
return "False" ;
}
}
sub main
{
my $correctInput=", your input must be something like this 5 10 somefile.txt somefile2.txt ";
my #inputs= #ARGV;
if (checkIfNumberic($inputs[0]) eq "False"){
die "first argument must be numberic $correctInput\n";
}
if (checkIfNumberic($inputs[1]) eq "False"){
die "second argument must be numberic $correctInput\n";
}
for (my $i=2; $i< scalar #inputs ;$i++)
{
if (open(my $fh, '<:encoding(UTF-8)', $inputs[$i])) {
while (my $line = <$fh>) {
chomp $line;
print "$line \n";
}
}
}
}
main();
You decoded your inputs (the script, with use utf8;; and the file, with :encoding(UTF-8)), but you didn't encode your outputs. Add
use open ':std', ':encoding(UTF-8)';
This is equivalent to
BEGIN {
binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
binmode STDERR, ':encoding(UTF-8)';
}
It also sets the default encoding for file handles opened in its lexical scope, you can remove the existing :encoding(UTF-8) if you want.
I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.
I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:
#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;
use Encode ();
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $fn = 'test.txt';
write_test_file( $fn );
# Try to improve the Encode::FB_WARN fallback warning message :
#
# utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
# utf8 "\xE5" does not map to Unicode at line xx of file <filename>.
my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };
$str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";
sub my_warn_handler {
my ( $fn, $msg ) = #_;
if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
recover_line_number_and_char_pos( $fn, $msg );
}
else {
warn $msg;
}
}
sub recover_line_number_and_char_pos {
my ( $fn, $err_msg ) = #_;
chomp $err_msg;
$err_msg =~ s/(line \d+)\.$/$1/; # Remove period at end of sentence.
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
my $raw_data = do { local $/; <$fh> };
close $fh;
my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s;
my $line_no = $str =~ tr/\n//;
++$line_no;
my $pos = ( length $last_line ) + 1;
warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}
sub write_test_file {
my ( $fn ) = #_;
my $bytes = "Hello\nA\x{E5}\x{61}"; # 2 lines ending in iso 8859-1: åa
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
}
Output:
utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
Here is another way to locate where the warning fires, with un-buffered sysread
use warnings;
use strict;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";
$SIG{__WARN__} = sub { print "\t==> WARN: #_" };
my $char_cnt = 0;
my $char;
while (sysread($fh, $char, 1)) {
++$char_cnt;
print "$char ($char_cnt)\n";
}
The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.
The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).
This prints
H (1)
e (2)
l (3)
l (4)
o (5)
(6)
A (7)
å (8)
a (9)
==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
(10)
While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.
However, Perl is utf8 internally and I am not sure that sysread needs Encode.
Note. The page for sysread supports its use on data with encoding layers
Note that if the filehandle has been marked as :utf8 Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread are in Unicode characters). The
:encoding(...) layer implicitly introduces the :utf8 layer.
See binmode, open, and the open pragma.
Note Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.
I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like "á", "são", "é") and i want to compare each one of them with each term, and remove if they are equal. The problem is that i'm not certain of the file's encoding format.
I get this from the file command:
words.txt: Non-ISO extended-ASCII English text
My linux terminal is in UTF-8 and it shows the right content for some words and for others don't. Here is the output from some of them:
condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos
You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don't. The correct output for the other lines should be: condiã, conteúdos and moçambique.
If i use binmode(STDOUT, utf8) the "incorrect" lines now output correctly while the other ones don't. For example the 3rd line:
ajuda, mas não resolve
What should i do guys?
I strongly suggest you create a filter that takes a file with lines in mixed encodings and translates them to pure UTF-8. Then instead
open(INPUT, "< badstuff.txt") || die "open failed: $!";
you would open either the fixed version, or a pipe from the fixer, like:
open(INPUT, "fixit < badstuff.txt |") || die "open failed: $!"
In either event, you would then
binmode(INPUT, ":encoding(UTF-8)") || die "binmode failed";
Then the fixit program could just do this:
use strict;
use warnings;
use Encode qw(decode FB_CROAK);
binmode(STDIN, ":raw") || die "can't binmode STDIN";
binmode(STDOUT, ":utf8") || die "can't binmode STDOUT";
while (my $line = <STDIN>) {
$line = eval { decode("UTF-8", $line, FB_CROAK() };
if ($#) {
$line = decode("CP1252", $line, FB_CROAK()); # no eval{}!
}
$line =~ s/\R\z/\n/; # fix raw mode reads
print STDOUT $line;
}
close(STDIN) || die "can't close STDIN: $!";
close(STDOUT) || die "can't close STDOUT: $!";
exit 0;
See how that works? Of course, you could change it to default to some other encoding, or have multiple fall backs. Probably it would be best to take a list of them in #ARGV.
It works like this:
C:\Dev\Perl :: chcp
Aktive Codepage: 1252.
C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei Käse vier fünf Wurst
eins zwei drei Käse vier fünf Wurst
C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf
Where mixed-encoding.pl goes like this:
use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';
my #stopwords = qw( Käse Wurst );
while ( <> ) { # read octets
chomp;
my #tokens;
for ( split /\s+/ ) {
# Try UTF-8 first. If that fails, assume legacy Latin-1.
my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
$token = $_ if $#;
push #tokens, $token unless any { $token eq $_ } #stopwords;
}
print "#tokens\n";
}
Note that the script doesn't have to be encoded in UTF-8. It's just that if you have funky character data in your script you have to make sure the encoding matches, so use utf8 if your encoding is UTF-8, and don't if it isn't.
Update based on tchrist's sound advice:
use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';
my #stopwords = qw( Käse Wurst );
while ( <> ) { # read octets
chomp;
my #tokens;
for ( split /\s+/ ) {
# Try UTF-8 first. If that fails, assume 8-bit encoding.
my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
$token = decode Windows1252 => $_, Encode::FB_CROAK if $#;
push #tokens, uc $token unless any { $token eq $_ } #stopwords;
}
print "#tokens\n";
}