Perl - Correcting char encoding on command line input - perl

I am writing a program to fix mangled encoding, specifically latin1(iso-8859-1) to greek (iso-8859-7).
I created a function that works as intended; a variable with badly encoded text is converted properly.
When I try to convert $ARGV[0] with this function it doesn't seem to correctly interpret the input.
Here is a test program to demonstrate the issue:
#!/usr/bin/env perl
use 5.018;
use utf8;
use strict;
use open qw(:std :encoding(utf-8));
use Encode qw(encode decode);
sub unmangle {
my $input = shift;
print $input . "\n";
print decode('iso-8859-7', encode('latin1',$input)) . "\n";
}
my $test = "ÁöéÝñùìá"; # should be Αφιέρωμα
say "fix variable:";
unmangle($test);
say "\nfix argument:";
unmangle($ARGV[0]);
When I run this program with the same input as my $test variable the reults are not the same (as I expected that they should be):
$ ./fix_bad_encoding.pl "ÁöéÝñùìá"
fix variable:
ÁöéÝñùìá
Αφιέρωμα
fix stdin:
ÃöéÃñùìá
ΓΓΆΓ©ΓñùìÑ
How do I get $ARGV[0] to behave the way the $test variable does?

You decoded the source. You decoded STDIN (which you don't use), STDOUT and STDERR. But not #ARGV.
$_ = decode("UTF-8", $_) for #ARGV;

-CA tells Perl the arguments are UTF-8 encoded. You can decode the argument from UTF-8 yourself:
unmangle(decode('UTF-8', $ARGV[0]));
Also, it's not "stdin" (that would be reading from *STDIN), but "argument".

Related

Sorting UTF-8 input

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my #sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} #_;
}
The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
chomp #lines;
my $uc = Unicode::Collate->new();
my #sorted = $uc->sort(#lines);
say for #sorted;
The module's cmp method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my #sorted = map { $uc->cmp($a, $b) } #data;
where $a and $b need be set suitably so to extract what to compare from #data.
If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from #ARGV included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä > b while the accepted order in German is ä < b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
#s = qw(ä b);
say join " ", sort { $a cmp $b } #s; #--> b ä
say join " ", Unicode::Collate->new->sort(#s); #--> ä b
'
so we need to use Unicode::Collate (or a custom sort routine).
To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир
The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following #zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

Perl: substitute .* to §

I have to substitute multiple substrings from expressions like $fn = '1./(4.*z.^2-1)' in Perl (v5.24.1#Win10):
$fn =~ s/.\//_/g;
$fn =~ s/.\*/§/g;
$fn =~ s/.\^/^/g;
but § is not working; I get a ┬º in the expression result (1_(4┬ºz^2-1)). I need this for folder and file names and it worked fine in Matlab#Win10 with fn = strrep(fn, '.*', '§').
How can a get the § in the Perl substitution result?
It works for me:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open IO => ':encoding(UTF-8)', ':std';
my $fn = '1./(4.*z.^2-1)';
s/.\//_/g,
s/.\*/§/g,
s/.\^/^/g
for $fn;
say $fn;
Output:
1_(4§z^2-1)
You can see use utf8, it tells Perl the source code is in UTF-8. Make sure you save the source as UTF-8, then.
The use open sets the UTF-8 encoding for standard input and output. Make sure the terminal to which you print is configured to work in UTF-8, too.

Perl file processing on SHIFT_JIS encoded Japanese files

I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements.
Here is my requirement:
I want the Perl script’s regular expressions being human readable (at least to a Japanese person)
Ie. like this:
s/北/0/g;
Instead of it littered with some hex codes
s/\x{4eba}/0/g;
Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to search for from the csv data file onto the Perl script.
I have the following working test script below:
use strict;
use warnings;
use utf8;
open (IN1, "<:encoding(shift_jis)", "${work_dir}/tmp00.csv") or die "Error: tmp00.csv\n";
open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/tmp01.csv") or die "Error: tmp01.csv\n";
while (<IN1>)
{
print $_ . "\n";
chomp;
s/北/0/g;
s/10:00/9:00/g;
print OUT1 "$_\n";
}
close IN1;
close OUT1;
This would successfully replace the 10:00 with 9:00 in the csv file, but the issue is I was unable to replace北 (ie. North) with 0 unless use utf8 is also included at the top.
Questions:
1) In the open documentation, http://perldoc.perl.org/functions/open.html, I didn’t see use utf8 as a requirement, unless it is implicit?
a) If I had use utf8 only, then the first print statement in the loop would print garbage character to my xterm screen.
b) If I had called open with :encoding(shift_jis) only, then the first print statement in the loop would print Japanese character to my xterm screen, but the replacement would not happen. There is no warning that use utf8 was not specified.
c) If I used both a) and b), then this example works.
How does “use utf8” modify the behavior of calling open with :enoding(shift_jis) in this Perl script?
2) I also tried to open the file without any encoding specified, wouldn’t Perl treat the file strings as raw bytes, and be able to perform regular expression match that way if the strings I pasted in the script, is in the same encoding as the text in the original data file? I was able to do file name replacement earlier this way without specifying any encoding whatsoever (please refer to my related post here: Perl Japanese to English filename replacement).
Thanks.
UPDATES 1
Testing a simple localization sample in Perl for filename and file text replacement in Japanese
In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.
Script:
Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10
#!/usr/bin/perl
use feature qw(say);
use strict;
use warnings;
use utf8;
use Encode qw(decode encode);
my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";
# forward declare the function prototypes
sub fileProcess;
opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};
# readdir OPTION 1 - shift_jis
#my #files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");
# readdir OPTION 2 - utf8
my #files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)"); # setting display to output utf8
say #files;
# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();
closedir(DIR);
exit;
sub fileNameTranslate
{
foreach (#files)
{
my $original_file = $_;
#print "original_file: " . "$original_file" . "\n";
s/南/south/;
my $new_file = $_;
# print "new_file: " . "$_" . "\n";
if ($new_file ne $original_file)
{
print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
}
}
}
sub fileProcess
{
# file process OPTION 3, open file as shift_jis, the search and replace would work
# open (IN1, "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
# open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";
# file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1, "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";
while (<IN1>)
{
print $_ . "\n";
chomp;
s/南/south/g;
print OUT1 "$_\n";
}
close IN1;
close OUT1;
}
Result:
(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4)
Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS
Result: file name replacement failed..
Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68.
\x93
(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3)
Setup: Readdir encoding, utf8; file open encoding utf8
Result: file name replacement worked, south.txt generated
But south1.txt file content replacement failed , it has the content \x93 ().
Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25.
... -Ao?= (Bx{fffd}.txt
(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4)
Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS
Result: file name replacement worked, south.txt generated
South1.txt file content replacement worked, it has the content south.
Conclusion:
I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS, as the content of the csv file was SHIFT_JIS encoded.
A good place to start would be to read the documentation for the utf8 module. Which says:
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC
based platforms). The no utf8 pragma tells Perl to switch back to
treating the source text as literal bytes in the current lexical
scope.
If you don't have use utf8 in your code, then the Perl compiler assumes that your source code is in your system's native single-byte encoding. And the character '北' will make little sense. Adding the pragma tells Perl that your code includes Unicode characters and everything starts to work.

In Perl how do I pass unicode arguments to external commands?

The root cause for this question is my attempt to write tests for a new option/argument processing module (OptArgs) for Perl. This of course involves parsing #ARGV which I am doing based on the answers to this question. This works fine on systems where I18N::Langinfo::CODESET is defined[1].
On systems where langinfo(CODESET) is not available I would like to at least make a best effort based on observed behaviour. However my tests so far indicate that some systems I cannot even pass a unicode argument to an external script properly.
I have managed to run something like the following on various systems where "test_script" is a Perl script that merely does a print Dumper(#ARGV):
use utf8;
my $utf8 = '¥';
my $result = qx/$^X test_script $utf8/;
What I have found is that on FreeBSD the test_script receives bytes which can be decoded into Perl's internal format. However on OpenBSD and Solaris test_script appears to get the string "\x{fffd}\x{fffd}" which contains only the unicode replacement character (twice?).
I don't know the mechanism underlying the qx operator. I presume it either exec's or shells out, but unlike filehandles (where I can binmode them for encoding) I don't know how to ensure it does what I want. Same with system() for that matter. So my question is what am I not doing correctly above? Otherwise what is different with Perl or the shell or the environment on OpenBSD and Solaris?
[1] Actually I think so far that is only Linux according to CPAN testers results.
Update(x2): I currently have the following running its way through cpantester's setups to test Schwern's hypothesis:
use strict;
use warnings;
use Data::Dumper;
BEGIN {
if (#ARGV) {
require Test::More;
Test::More::diag( "\npre utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
}
}
use utf8;
use utf8::all;
BEGIN {
if (#ARGV) {
Test::More::diag( "\npost utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
exit;
}
}
use Encode;
use Test::More;
my $builder = Test::More->builder;
binmode $builder->output, ':encoding(UTF-8)';
binmode $builder->failure_output, ':encoding(UTF-8)';
binmode $builder->todo_output, ':encoding(UTF-8)';
my $utf8 = '¥';
my $bytes = encode_utf8($utf8);
diag( "\nPassing: " . Dumper( { utf8 => $utf8, bytes => $bytes, } ) );
open( my $fh, '-|', $^X, $0, $utf8, $bytes ) || die "open: $!";
my $result = join( '', <$fh> );
close $fh;
ok(1);
done_testing();
I'll post the results on various systems when they come through. Any comments on the validity andor correctness of this would be apprecicated. Note that it is not intended to be a valid test. The purpose of the above is to be able to compare what is received on different systems.
Resolution: The real underlying issue turns out to be something not addressed in my question nor by Schwern's answer below. What I discovered is that some cpantesters machines only have an ascii locale installed/available. I should not expect any attempt to pass UTF-8 characters to programs in this type of environment to work. So in the end my problem was invalid test conditions, not invalid code.
I have seen nothing so far to indicate that the qx operator or the utf8::all module have any effect on how parameters are passed to external programs. The critical component appears to be the LANG and/or LC_ALL environment variables, to inform the external program what locale they are running in.
By the way, my original assertion that my code was working on all systems where I18N::Langinfo::CODESET is defined was incorrect.
qx makes a call to the shell and it may be interfering.
To avoid that, use utf8::all to switch on all the Perl Unicode voodoo. Then use the open function to open a pipe to your program, avoiding the shell.
use utf8::all;
my $utf8 = '¥';
open my $read_from_script, "-|", "test_script", $utf8;
print <$read_from_script>,"\n";

How do I read UTF-8 with diamond operator (<>)?

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt
cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
while(<>){
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:
a
\xCA\xA7
b
But the other method gives me this:
a
\xC3\x8A
\xC2\xA7
b
Try to use the pragma open instead:
use strict;
use warnings;
use open qw(:std :utf8);
while(<>){
my #chars = split //, $_;
print "$_" foreach(#chars);
}
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in #ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in #ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when #ARGV has files. In this case the <> operator opens a new file handle for each file in #ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
Your script works if you do this:
#!/usr/bin/perl -w
binmode STDOUT, ':utf8';
while(<>){
binmode ARGV, ':utf8';
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
The magic filehandle that <> reads from is called *ARGV, and it is
opened when you call readline.
But really, I am a fan of explicitly using Encode::decode and
Encode::encode when appropriate.
You can switch on UTF8 by default with the -C flag:
perl -CSD -ne 'print join("\n",split //);' utf8.txt
The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.
This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.
If you put a call to binmode inside of the while loop, then it will switch the handle to utf8 mode AFTER the first line is read in. That is probably not what you want to do.
Something like the following might work better:
#!/usr/bin/env perl -w
binmode STDOUT, ':utf8';
eof() ? exit : binmode ARGV, ':utf8';
while( <> ) {
my #chars = split //, $_;
print "$_\n" foreach(#chars);
} continue {
binmode ARGV, ':utf8' if eof && !eof();
}
The call to eof() with parens is magical, as it checks for end of file on the pseudo-filehandle used by <>. It will, if necessary, open the next handle that needs to be read, which typically has the effect of making *ARGV valid, but without reading anything out of it. This allows us to binmode the first file that's read from, before anything is read from it.
Later, eof (without parens) is used; this checks the last handle that was read from for end of file. It will be true after we process the last line of each file from the commandline (or when stdin reaches it's end).
Obviously, if we've just processed the last line of one file, calling eof() (with parens) opens the next file (if there is one), makes *ARGV valid (if it can), and tests for end of file on that next file. If that next file is present, and isn't at end of file, then we can safely use binmode on ARGV.