In Perl how do I pass unicode arguments to external commands? - perl

The root cause for this question is my attempt to write tests for a new option/argument processing module (OptArgs) for Perl. This of course involves parsing #ARGV which I am doing based on the answers to this question. This works fine on systems where I18N::Langinfo::CODESET is defined[1].
On systems where langinfo(CODESET) is not available I would like to at least make a best effort based on observed behaviour. However my tests so far indicate that some systems I cannot even pass a unicode argument to an external script properly.
I have managed to run something like the following on various systems where "test_script" is a Perl script that merely does a print Dumper(#ARGV):
use utf8;
my $utf8 = '¥';
my $result = qx/$^X test_script $utf8/;
What I have found is that on FreeBSD the test_script receives bytes which can be decoded into Perl's internal format. However on OpenBSD and Solaris test_script appears to get the string "\x{fffd}\x{fffd}" which contains only the unicode replacement character (twice?).
I don't know the mechanism underlying the qx operator. I presume it either exec's or shells out, but unlike filehandles (where I can binmode them for encoding) I don't know how to ensure it does what I want. Same with system() for that matter. So my question is what am I not doing correctly above? Otherwise what is different with Perl or the shell or the environment on OpenBSD and Solaris?
[1] Actually I think so far that is only Linux according to CPAN testers results.
Update(x2): I currently have the following running its way through cpantester's setups to test Schwern's hypothesis:
use strict;
use warnings;
use Data::Dumper;
BEGIN {
if (#ARGV) {
require Test::More;
Test::More::diag( "\npre utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
}
}
use utf8;
use utf8::all;
BEGIN {
if (#ARGV) {
Test::More::diag( "\npost utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
exit;
}
}
use Encode;
use Test::More;
my $builder = Test::More->builder;
binmode $builder->output, ':encoding(UTF-8)';
binmode $builder->failure_output, ':encoding(UTF-8)';
binmode $builder->todo_output, ':encoding(UTF-8)';
my $utf8 = '¥';
my $bytes = encode_utf8($utf8);
diag( "\nPassing: " . Dumper( { utf8 => $utf8, bytes => $bytes, } ) );
open( my $fh, '-|', $^X, $0, $utf8, $bytes ) || die "open: $!";
my $result = join( '', <$fh> );
close $fh;
ok(1);
done_testing();
I'll post the results on various systems when they come through. Any comments on the validity andor correctness of this would be apprecicated. Note that it is not intended to be a valid test. The purpose of the above is to be able to compare what is received on different systems.
Resolution: The real underlying issue turns out to be something not addressed in my question nor by Schwern's answer below. What I discovered is that some cpantesters machines only have an ascii locale installed/available. I should not expect any attempt to pass UTF-8 characters to programs in this type of environment to work. So in the end my problem was invalid test conditions, not invalid code.
I have seen nothing so far to indicate that the qx operator or the utf8::all module have any effect on how parameters are passed to external programs. The critical component appears to be the LANG and/or LC_ALL environment variables, to inform the external program what locale they are running in.
By the way, my original assertion that my code was working on all systems where I18N::Langinfo::CODESET is defined was incorrect.

qx makes a call to the shell and it may be interfering.
To avoid that, use utf8::all to switch on all the Perl Unicode voodoo. Then use the open function to open a pipe to your program, avoiding the shell.
use utf8::all;
my $utf8 = '¥';
open my $read_from_script, "-|", "test_script", $utf8;
print <$read_from_script>,"\n";

Related

Perl disable shell access

Certain builtins like system and exec (as well as backticks) will use the shell (I think sh by default) if passed a single argument containing shell metacharacters. If I want to write a portable program that avoids making any assumptions about the underlying shell, is there a pragma or some other option I can use to either disable shell access or trigger a fatal error immediately?
I write about this extensively in Mastering Perl. The short answer is to use system in it's list form.
system '/path/to/command', #args;
This doesn't interpret any special characters in #args.
At the same time, you should enable taint checking to help catch bad data before you pass it to the system. See the perlsec documentation for details.
There are limited options to do this, keep in mind that these are core routines and completely disabling them may have some unexpected consequences. You do have a few options.
Override Locally
You can override system and exec locally by using the subs pragma, this will only effect the package into which you have imported the sub routine:
#!/usr/bin/env perl
use subs 'system';
sub system { die('Do not use system calls!!'); }
# .. more code here, note this will runn
my $out = system('ls -lah'); # I will die at this point, but not before
print $out;
Override Globally
To override globally, in the current perl process, you need to import your function into the CORE::GLOBAL pseudo-namespace at compile time:
#!/usr/bin/env perl
BEGIN {
*CORE::GLOBAL::system = sub {
die('Do not use system calls.');
};
*CORE::GLOBAL::exec = sub {
die('Do not use exec.');
};
*CORE::GLOBAL::readpipe = sub {
die('Do not use back ticks.');
};
}
#...
my $out = system('ls -lah'); # I will die at this point, but not before
print $out;
Prevent anything form running if in source
If you want to prevent any code running before getting to a system call you can include the following, note this is fairly loose in it's matching, I've written it to be easy to modify or update:
package Acme::Noshell;
open 0 or print "Can't execute '$0'\n" and exit;
my $source = join "", <0>;
die("Found back ticks in '$0'") if($source =~ m/^.*[^#].*\`/g);
die("Found 'exec' in '$0'") if($source =~ / exec\W/g);
die("Found 'system' in '$0'") if($source =~ / system\W/g);
1;
Which can be used as follows:
#!/usr/bin/env perl
use strict;
use warnings;
use Acme::Noshell;
print "I wont print because of the call below";
my $out = system('ls -lah');

Setting diamond operator to work in binmode?

Diamond operator (<>) works in text mode by default, is it possible to change binmode for it? Seems binmode function accepts handle only.
See perldoc perlopentut:
Binary Files
On certain legacy systems with what could charitably be called terminally
convoluted (some would say broken) I/O models, a file isn't a file--at
least, not with respect to the C standard I/O library. On these old
systems whose libraries (but not kernels) distinguish between text and
binary streams, to get files to behave properly you'll have to bend over
backwards to avoid nasty problems. On such infelicitous systems, sockets
and pipes are already opened in binary mode, and there is currently no way
to turn that off. With files, you have more options.
Another option is to use the "binmode" function on the appropriate handles
before doing regular I/O on them:
binmode(STDIN);
binmode(STDOUT);
while (<STDIN>) { print }
Passing "sysopen" a non-standard flag option will also open the file in
binary mode on those systems that support it. This is the equivalent of
opening the file normally, then calling "binmode" on the handle.
sysopen(BINDAT, "records.data", O_RDWR | O_BINARY)
|| die "can't open records.data: $!";
Now you can use "read" and "print" on that handle without worrying about
the non-standard system I/O library breaking your data. It's not a pretty
picture, but then, legacy systems seldom are. CP/M will be with us until
the end of days, and after.
On systems with exotic I/O systems, it turns out that, astonishingly
enough, even unbuffered I/O using "sysread" and "syswrite" might do sneaky
data mutilation behind your back.
while (sysread(WHENCE, $buf, 1024)) {
syswrite(WHITHER, $buf, length($buf));
}
Depending on the vicissitudes of your runtime system, even these calls may
need "binmode" or "O_BINARY" first. Systems known to be free of such
difficulties include Unix, the Mac OS, Plan 9, and Inferno.
<> is a convenience. If it only iterated through filenames specified on the command line, you could use $ARGV within while (<>) to detect when a new file was opened, binmode it, and then fseek to the beginning. Of course, this does not work in the presence of redirection (console input is a whole other story).
One solution is to detect if #ARGV does contain something, and open each file individual, and default to reading from STDIN. A rudimentary implementation of this using an iterator could be:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
my $argv = sub {
#_ or return sub {
my $done;
sub {
$done and return;
$done = 1;
binmode STDIN;
\*STDIN;
}
}->();
my #argv = #_;
sub {
#argv or return;
my $file = shift #argv;
open my $fh, '<', $file
or croak "Cannot open '$file': $!";
binmode $fh;
$fh;
};
}->(#ARGV);
binmode STDOUT;
while (my $fh = $argv->()) {
while (my $line = <$fh>) {
print $line;
}
}
Note:
C:\...\Temp> xxd test.txt
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Without binmode:
C:\...\Temp> perl -e "print " test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0d0a this is a test..
00000010: 0d0a ..
With the script above:
C:\...\Temp> perl argv.pl test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Same results using perl ... < test.txt | xxd, or piping text through perl ...

Is there a module that searches for superfluous code?

Is there a module, which can find code not needed?
As an example a script with code not needed to run the script:
#!/usr/bin/env perl
use warnings;
use 5.12.0;
use utf8;
binmode STDOUT, ':utf8';
use DateTime;
use WWW::Mechanize;
sub my_print {
my ( $string, $tab, $color ) = #_;
say $string;
}
sub check {
my $string = shift;
return if length $string > 10;
return $string;
}
my_print( 'Hello World' );
Not categorically. Perl is notoriously difficult to analyze without actually executing, to the point that compiling a Perl program to be run later actually requires including a copy of the perl interpreter! As a result there are very few code analysis tools for Perl. What you can do is use a profiler, but this is a bit overkill (and as I mentioned, requires actually executing the program. I like Devel::NYTProf. This will spit out some HTML files showing how many times eaqch line or sub was executed, as well as how much time was spent there, but this only works for that specific execution of the program. It will allow you to see that WWW::Mechanize is loaded but never called, but it will not be able to tell you if warnings or binmode had any effect on execution.
Devel::Cover provides code coverage metrics that may be of some use here.

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

How do I read UTF-8 with diamond operator (<>)?

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt
cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
while(<>){
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:
a
\xCA\xA7
b
But the other method gives me this:
a
\xC3\x8A
\xC2\xA7
b
Try to use the pragma open instead:
use strict;
use warnings;
use open qw(:std :utf8);
while(<>){
my #chars = split //, $_;
print "$_" foreach(#chars);
}
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in #ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in #ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when #ARGV has files. In this case the <> operator opens a new file handle for each file in #ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
Your script works if you do this:
#!/usr/bin/perl -w
binmode STDOUT, ':utf8';
while(<>){
binmode ARGV, ':utf8';
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
The magic filehandle that <> reads from is called *ARGV, and it is
opened when you call readline.
But really, I am a fan of explicitly using Encode::decode and
Encode::encode when appropriate.
You can switch on UTF8 by default with the -C flag:
perl -CSD -ne 'print join("\n",split //);' utf8.txt
The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.
This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.
If you put a call to binmode inside of the while loop, then it will switch the handle to utf8 mode AFTER the first line is read in. That is probably not what you want to do.
Something like the following might work better:
#!/usr/bin/env perl -w
binmode STDOUT, ':utf8';
eof() ? exit : binmode ARGV, ':utf8';
while( <> ) {
my #chars = split //, $_;
print "$_\n" foreach(#chars);
} continue {
binmode ARGV, ':utf8' if eof && !eof();
}
The call to eof() with parens is magical, as it checks for end of file on the pseudo-filehandle used by <>. It will, if necessary, open the next handle that needs to be read, which typically has the effect of making *ARGV valid, but without reading anything out of it. This allows us to binmode the first file that's read from, before anything is read from it.
Later, eof (without parens) is used; this checks the last handle that was read from for end of file. It will be true after we process the last line of each file from the commandline (or when stdin reaches it's end).
Obviously, if we've just processed the last line of one file, calling eof() (with parens) opens the next file (if there is one), makes *ARGV valid (if it can), and tests for end of file on that next file. If that next file is present, and isn't at end of file, then we can safely use binmode on ARGV.