Assume that I have following code to open filehandle:
open my $fh, "command1 | command2 |";
I found command1 may output that command2 can not handle well, so I'm trying to insert a perl filter between command1 and command2 to deal with them:
use Config;
open my $fh, "command1 | $Config{perlpath} -ple 'blah blah' | command2 |";
My questions are:
Is it OK to use $Config{perlpath} in system call directly?
Calling own perl binary seems nuts. Are there any better solutions?
Thanks
Is it OK to use $Config{perlpath} in system call directly?
Relatively. It's about as portable as the rest of your code (which already depends on running on something unix-ish). There's some security worry, but I'd say not a very large one because someone who can affect the value of that variable already has scope to cause havoc. $^X is probably safer in that regard. You might want to try quoting it using String::ShellQuote just for safety, since you can't bypass the shell in the midst of a pipeline.
Calling own perl binary seems nuts. Are there any better solutions?
Depends on your definition of "better". There's definitely another way around, which is to run both command1 and command2 separately, process command1's output in your original perl process, hand it to command2, and read command2's output. However, you have to be careful how you do it. There are two safe ways.
The first way is easier, but takes more time and memory: run command1 first, read and process all of its output into a string, then run command2 providing the buffered output as input. The best way to do this is to use IPC::Run to handle command2's input and output (and maybe both commands, just for consistency); if you try to just print all the data to command2's input handle and then read all the output you can deadlock, but IPC::Run will interleave reads and writes if necessary behind the scenes, and give you a nice easy interface.
The second way is more complicated but closer in behavior to the original: you need some kind of async framework like IO::Async or POE, using its classes for process construction, and set up handlers to communicate between them, do your filtering, and gather the output.
Here's a tested toy example of that (gist because it's a couple screenfuls of code) that does the equivalent of ls | perl -pe '$_ = uc' | rev, except with the middle part of the pipeline running in the parent perl process. You may never use it, but I thought it was worth illustrating.
Related
Sorry, this is pretty basic, and I suspect a duplicate, but after some searching I'm coming up empty:
Given the following script:
#!/usr/bin/perl
use strict;
use warnings;
use IPC::Run3;
my $stdout2;
print $ARGV[0];
print "\n";
my #cmd1 = split /\s+/, $ARGV[0] ;
run3 (\#cmd1, \undef, \$stdout2, \$stdout2);
print $stdout2
And running it like so:
£ perl comp.pl "md5sum *(.)"
md5sum *(.)
md5sum: '*(.)': No such file or directory
Fair enough. The *(.) isn't being intrepreted by the shell and probably most would consider this a feature. But I would like it to be intepreted by the current shell (or zsh specifically would be fine).
The question is how I can do this without complicating the shell command to run the perl script.
Prepending "zsh" and "-c" to cmd1 is ok if that's a reasonable way to do it. It just seems like...it isn't.
My intention is also to pass slightly more complex commands to this script eventually, like so:
perl comp.pl 'md5sum *(.)' 'ssh remoteHost "md5sum *(.)"'
I have no objection to non-perl answers to the problem you can probably infer I'm trying to solve (I suspect rsync could do this) but I'm primarily interested in solving this through Perl as there'll eventually be business-specific logic in this comparison.
EDIT
I tried various forms of:
my $cmd = $ARGV[0];
run3 (\$cmd, \undef, \$stdout2, \$stdout2);
the documentation seems to think this would be ok, but I get:
Not an ARRAY reference at /usr/local/share/perl/5.22.1/IPC/Run3.pm line 320.
The IPC::Run3 docs say that one can pass a string instead of an arrayref for the command
run3($cmd, $stdin, $stdout, $stderr, \%options)
...
$cmd
Usually $cmd will be an ARRAY reference and the child is invoked via
system #$cmd;
But $cmd may also be a string in which case the child is invoked via
system $cmd;
In this case the string $cmd is passed to the shell if it contains shell metacharacters. So take input without splitting it, $cmd = $ARGV[0], or join it after validation, $cmd = join ' ', #cmd;
Even in general this is not the preferred way, and the docs warn to see system for "pitfalls" of it.
Things are yet much worse here since you'd be passing user input directly for execution! Never mind possible nefarious intents, just think of what a good typo can do. Even without that, there is simply a difference between typing a command at the terminal and passing it to a script, which may edit it, may get modified, pick up bugs, etc.
If nothing else, I'd urge to add code for substantial checks of submitted input. An analysis may involve identifying the known and accepted metacharacters while suitably quoting parts of input that shouldn't be interpreted, for example using String::ShellQuote.
But I'd really suggest to reconsider the design, so to not submit complete commands to the script. Rather, specify with keywords what should happen. Things like globbing (assembling a file list) are done from Perl really nicely and with a lot of control. Do outside only what is necessary; generally there'll be no need for the shell then.
I want to create a simple IO object that represents a pipe opened to another program to that I can periodically write to another program's STDIN as my app runs. I want it to be bullet-proof (in that it catches all errors) and cross-platform. The best options I can find are:
open
sub io_read {
local $SIG{__WARN__} = sub { }; # Silence warning.
open my $pipe, '|-', #_ or die "Cannot exec $_[0]: $!\n";
return $pipe;
}
Advantages:
Cross-platform
Simple
Disadvantages
No $SIG{PIPE} to catch errors from the piped program
Are other errors caught?
IO::Pipe
sub io_read {
IO::Pipe->reader(#_);
}
Advantages:
Simple
Returns an IO::Handle object for OO interface
Supported by the Perl core.
Disadvantages
Still No $SIG{PIPE} to catch errors from the piped program
Not supported on Win32 (or, at least, its tests are skipped)
IPC::Run
There is no interface for writing to a file handle in IPC::Run, only appending to a scalar. This seems…weird.
IPC::Run3
No file handle interface here, either. I could use a code reference, which would be called repeatedly to spool to the child, but looking at the source code, it appears that it actually writes to a temporary file, and then opens it and spools its contents to the pipe'd command's STDIN. Wha?
IPC::Cmd
Still no file handle interface.
What am I missing here? It seems as if this should be a solved problem, and I'm kind of stunned that it's not. IO::Pipe comes closest to what I want, but the lack of $SIG{PIPE} error handling and the lack of support for Windows is distressing. Where is the piping module that will JDWIM?
Thanks to guidance from #ikegami, I have found that the best choice for interactively reading from and writing to another process in Perl is IPC::Run. However, it requires that the program you are reading from and writing to have a known output when it is done writing to its STDOUT, such as a prompt. Here's an example that executes bash, has it run ls -l, and then prints that output:
use v5.14;
use IPC::Run qw(start timeout new_appender new_chunker);
my #command = qw(bash);
# Connect to the other program.
my ($in, #out);
my $ipc = start \#command,
'<' => new_appender("echo __END__\n"), \$in,
'>' => new_chunker, sub { push #out, #_ },
timeout(10) or die "Error: $?\n";
# Send it a command and wait until it has received it.
$in .= "ls -l\n";
$ipc->pump while length $in;
# Wait until our end-of-output string appears.
$ipc->pump until #out && #out[-1] =~ /__END__\n/m;
pop #out;
say #out;
Because it is running as an IPC (I assume), bash does not emit a prompt when it is done writing to its STDOUT. So I use the new_appender() function to have it emit something I can match to find the end of the output (by calling echo __END__). I've also used an anonymous subroutine after a call to new_chunker to collect the output into an array, rather than a scalar (just pass a reference to a scalar to '>' if you want that).
So this works, but it sucks for a whole host of reasons, in my opinion:
There is no generally useful way to know that an IPC-controlled program is done printing to its STDOUT. Instead, you have to use a regular expression on its output to search for a string that usually means it's done.
If it doesn't emit one, you have to trick it into emitting one (as I have done here—god forbid if I should have a file named __END__, though). If I was controlling a database client, I might have to send something like SELECT 'IM OUTTA HERE';. Different applications would require different new_appender hacks.
The writing to the magic $in and $out scalars feels weird and action-at-a-distance-y. I dislike it.
One cannot do line-oriented processing on the scalars as one could if they were file handles. They are therefore less efficient.
The ability to use new_chunker to get line-oriented output is nice, if still a bit weird. That regains a bit of the efficiency on reading output from a program, though, assuming it is buffered efficiently by IPC::Run.
I now realize that, although the interface for IPC::Run could potentially be a bit nicer, overall the weaknesses of the IPC model in particular makes it tricky to deal with at all. There is no generally-useful IPC interface, because one has to know too much about the specifics of the particular program being run to get it to work. This is okay, maybe, if you know exactly how it will react to inputs, and can reliably recognize when it is done emitting output, and don't need to worry much about cross-platform compatibility. But that was far from sufficient for my need for a generally useful way to interact with various database command-line clients in a CPAN module that could be distributed to a whole host of operating systems.
In the end, thanks to packaging suggestions in comments on a blog post, I decided to abandon the use of IPC for controlling those clients, and to use the DBI, instead. It provides an excellent API, robust, stable, and mature, and suffers none of the drawbacks of IPC.
My recommendation for those who come after me is this:
If you just need to execute another program and wait for it to finish, or collect its output when it is done running, use IPC::System::Simple. Otherwise, if what you need to do is to interactively interface with something else, use an API whenever possible. And if it's not possible, then use something like IPC::Run and try to make the best of it—and be prepared to give up quite a bit of your time to get it "just right."
I've done something similar to this. Although it depends on the parent program and what you are trying to pipe. My solution was to spawn a child process (leaving $SIG{PIPE} to function) and writing that to the log, or handling the error in what way you see fit. I use POSIX to handle my child process and am able to utilize all the functionality of the parent. However if you're trying to have the child communicate back to the parent - then things get difficult. Do you have an example of the main program and what you're trying to PIPE?
There is a system() call in a Perl script with multiple pipes, using a single scalar argument. The call looks more or less like this:
system("zcat /foo.gz | grep '^.{6}X|Y|Z' | awk '{print $2,$3,$4,$6}' | bzip2 > /foo.processed.bz2");
The file in question (foo.gz) is quite large, about 2GB compressed in size. I guess that's why it was originally done via a system call.
Questions:
The problem now is, that this system call always seem to return 0, whether one of the system commands fail or not. I assume this is because it gets invoked via sh -c '...'. Is that correct?
Is there a way to check if a system() call was successful if only a single scalar argument is passed?
Is there a better way to process a large file like this, in a way thats equally or more efficient (in terms of speed mainly)?
Thanks for any hints as I am not really familiar with Perl.
Two things:
When you do a system call, the value returned is the last value in the pipeline. Thus, you're getting the status code of the bzip2 command.
The reason the program is doing this is because the people who wrote the program probably didn't know any better. I've seen Perl programs use system calls for finding the basename of the file, doing a find, and even doing a copy/rename/move. These are all things that can be done faster and easier inside the Perl program. And, you don't have the whole Windows/Unix compatibility issues.
You're always better off using Perl modules for things like this. In this case, I bet the Perl modules will be even faster than the shell pipeline, and you'll have more control over the entire operation.
There's a set called IO::Compress that can handle both Zip and BZip2.
I use Archive::Zip which is a great module, but you want to use the Bzip2 compression algorithm, and Archive::Zip can't handle that.
system() returns what the /bin/sh shell returns. When multiple commands are pipelined, the shell forks a new process for each of them and the status code of the last command in the chain is returned, in this case bzip2.
Based on your comments and answers, I'd do it like that now:
$infile =~ s/(.*\.gz)\s*$/gzip -dc < $1|/;
open(OUTFH, "| /bin/bzip > $outfile") or die "Can't open $outfile: $!";
open(INFH, $infile) or die "Can't open $infile: $!";
while (my $line = <INFH>) {
if ($line =~ /^.{6}X|Y|Z) {
# TODO: the awk part...
print OUTFH $line;
}
}
close(INFH);
close(OUTFH);
Please feel free to comment and vote up/down.
You'd be better doing the text processing from within perl itself - that's what perl's for :)
system() only ever returns 0 or 1. To capture actual output, try calling it via backticks: `command` rather than system('command')
In Perl, to run another Perl script from my script, or to run any system commands like mv, cp, pkgadd, pkgrm, pkginfo, rpm etc, we can use the following:
system()
exec()
`` (Backticks)
Are all the three the same, or are they different? Do all the three give the same result in every case? Are they used in different scenarios, like to call a Perl program we have to use system() and for others we have to use ``(backticks).
Please advise, as I am currently using system() for all the calls.
They're all different, and the docs explain how they're different. Backticks capture and return output; system returns an exit status, and exec never returns at all if it's successful.
IPC::System::Simple is probably what you want.
It provides safe, portable alternatives to backticks, system() and other IPC commands.
It also allows you to avoid the shell for most of said commands, which can be helpful in some circumstances.
The best option is to use some module, either in the standard library or from CPAN, that does the job for you. It's going to be more portable, and possibly even faster for quick tasks (no forking to the system).
However, if that's not good enough for you, you can use one of those three, and no, they are not the same. Read the perldoc pages on system(), exec(), and backticks to see the difference.
Calling system is generally a mistake. For instance, instead of saying
system "mv $foo /tmp" == 0
or die "could not move $foo to /tmp";
system "cp $foo /tmp" == 0
or die "could not copy $foo to /tmp";
you should say
use File::Copy;
move $foo, "/tmp"
or die "could not move $foo to /tmp: $!";
copy $foo, "/tmp"
or die "could not copy $foo to /tmp: $!";
Look on CPAN for modules that handle other commands for you. If you find yourself writing a lot of calls to system, it may be time to drop back into a shell script instead.
Well, the more people the more answers.
My answer is to generally avoid external commands execution. If you can - use modules. There is no point executing "cp", "mv" and a lot of another command - there exist modules that do it. And the benefit of using modules is that they usually work cross-platform. While your system("mv") might not.
When put in situation that I have no other way, but to call external command, I prefer to use IPC::Run. The idea is that all simplistic methods (backticks, qx, system, open with pipe) are inherently insecure, and require attention to parameters.
With IPC::Run, I run commands like I would do with system( #array ), which is much more secure, and I can bind to stdin, stdout and stderr separately, using variables, or callbacks, which is very cool when you'll be in situation that you have to pass data to external program from long-running code.
Also, it has built-in handling of timeouts, which come handy more than once :)
If you don't want the shell getting involved (usually you don't) and if waiting for the system command is acceptable, I recommend using IPC::Run3. It is simple, flexible, and does the common task of executing a program, feeding it input and capturing its output and error streams right.
Many beginning programmers write code like this:
sub copy_file ($$) {
my $from = shift;
my $to = shift;
`cp $from $to`;
}
Is this bad, and why? Should backticks ever be used? If so, how?
A few people have already mentioned that you should only use backticks when:
You need to capture (or supress) the output.
There exists no built-in function or Perl module to do the same task, or you have a good reason not to use the module or built-in.
You sanitise your input.
You check the return value.
Unfortunately, things like checking the return value properly can be quite challenging. Did it die to a signal? Did it run to completion, but return a funny exit status? The standard ways of trying to interpret $? are just awful.
I'd recommend using the IPC::System::Simple module's capture() and system() functions rather than backticks. The capture() function works just like backticks, except that:
It provides detailed diagnostics if the command doesn't start, is killed by a signal, or returns an unexpected exit value.
It provides detailed diagnostics if passed tainted data.
It provides an easy mechanism for specifying acceptable exit values.
It allows you to call backticks without the shell, if you want to.
It provides reliable mechanisms for avoiding the shell, even if you use a single argument.
The commands also work consistently across operating systems and Perl versions, unlike Perl's built-in system() which may not check for tainted data when called with multiple arguments on older versions of Perl (eg, 5.6.0 with multiple arguments), or which may call the shell anyway under Windows.
As an example, the following code snippet will save the results of a call to perldoc into a scalar, avoids the shell, and throws an exception if the page cannot be found (since perldoc returns 1).
#!/usr/bin/perl -w
use strict;
use IPC::System::Simple qw(capture);
# Make sure we're called with command-line arguments.
#ARGV or die "Usage: $0 arguments\n";
my $documentation = capture('perldoc', #ARGV);
IPC::System::Simple is pure Perl, works on 5.6.0 and above, and doesn't have any dependencies that wouldn't normally come with your Perl distribution. (On Windows it depends upon a Win32:: module that comes with both ActiveState and Strawberry Perl).
Disclaimer: I'm the author of IPC::System::Simple, so I may show some bias.
The rule is simple: never use backticks if you can find a built-in to do the same job, or if their is a robust module on the CPAN which will do it for you. Backticks often rely on unportable code and even if you untaint the variables, you can still open yourself up to a lot of security holes.
Never use backticks with user data unless you have very tightly specified what is allowed (not what is disallowed -- you'll miss things)! This is very, very dangerous.
Backticks should be used if and only if you need to capture the output of a command. Otherwise, system() should be used. And, of course, if there's a Perl function or CPAN module that does the job, this should be used instead of either.
In either case, two things are strongly encouraged:
First, sanitize all inputs: Use Taint mode (-T) if the code is exposed to possible untrusted input. Even if it's not, make sure to handle (or prevent) funky characters like space or the three kinds of quote.
Second, check the return code to make sure the command succeeded. Here is an example of how to do so:
my $cmd = "./do_something.sh foo bar";
my $output = `$cmd`;
if ($?) {
die "Error running [$cmd]";
}
Another way to capture stdout(in addition to pid and exit code) is to use IPC::Open3 possibily negating the use of both system and backticks.
Use backticks when you want to collect the output from the command.
Otherwise system() is a better choice, especially if you don't need to invoke a shell to handle metacharacters or command parsing. You can avoid that by passing a list to system(), eg system('cp', 'foo', 'bar') (however you'd probably do better to use a module for that particular example :))
In Perl, there's always more than one way to do anything you want. The primary point of backticks is to get the standard output of the shell command into a Perl variable. (In your example, anything that the cp command prints will be returned to the caller.) The downside of using backticks in your example is you don't check the shell command's return value; cp could fail and you wouldn't notice. You can use this with the special Perl variable $?. When I want to execute a shell command, I tend to use system:
system("cp $from $to") == 0
or die "Unable to copy $from to $to!";
(Also observe that this will fail on filenames with embedded spaces, but I presume that's not the point of the question.)
Here's a contrived example of where backticks might be useful:
my $user = `whoami`;
chomp $user;
print "Hello, $user!\n";
For more complicated cases, you can also use open as a pipe:
open WHO, "who|"
or die "who failed";
while(<WHO>) {
# Do something with each line
}
close WHO;
From the "perlop" manpage:
That doesn't mean you should go out of
your way to avoid backticks when
they're the right way to get something
done. Perl was made to be a glue
language, and one of the things it
glues together is commands. Just
understand what you're getting
yourself into.
For the case you are showing using the File::Copy module is probably best. However, to answer your question, whenever I need to run a system command I typically rely on IPC::Run3. It provides a lot of functionality such as collecting the return code and the standard and error output.
Whatever you do, as well as sanitising input and checking the return value of your code, make sure you call any external programs with their explicit, full path. e.g. say
my $user = `/bin/whoami`;
or
my $result = `/bin/cp $from $to`;
Saying just "whoami" or "cp" runs the risk of accidentally running a command other than what you intended, if the user's path changes - which is a security vulnerability that a malicious attacker could attempt to exploit.
Your example's bad because there are perl builtins to do that which are portable and usually more efficient than the backtick alternative.
They should be used only when there's no Perl builtin (or module) alternative. This is both for backticks and system() calls. Backticks are intended for capturing output of the executed command.
Backticks are only supposed to be used when you want to capture output. Using them here "looks silly." It's going to clue anyone looking at your code into the fact that you aren't very familiar with Perl.
Use backticks if you want to capture output.
Use system if you want to run a command. One advantage you'll gain is the ability to check the return status.
Use modules where possible for portability. In this case, File::Copy fits the bill.
In general, it's best to use system instead of backticks because:
system encourages the caller to check the return code of the command.
system allows "indirect object" notation, which is more secure and adds flexibility.
Backticks are culturally tied to shell scripting, which might not be common among readers of the code.
Backticks use minimal syntax for what can be a heavy command.
One reason users might be temped to use backticks instead of system is to hide STDOUT from the user. This is more easily and flexibly accomplished by redirecting the STDOUT stream:
my $cmd = 'command > /dev/null';
system($cmd) == 0 or die "system $cmd failed: $?"
Further, getting rid of STDERR is easily accomplished:
my $cmd = 'command 2> error_file.txt > /dev/null';
In situations where it makes sense to use backticks, I prefer to use the qx{} in order to emphasize that there is a heavy-weight command occurring.
On the other hand, having Another Way to Do It can really help. Sometimes you just need to see what a command prints to STDOUT. Backticks, when used as in shell scripts are just the right tool for the job.
Perl has a split personality. On the one hand it is a great scripting language that can replace the use of a shell. In this kind of one-off I-watching-the-outcome use, backticks are convenient.
When used a programming language, backticks are to be avoided. This is a lack of error
checking and, if the separate program backticks execute can be avoided, efficiency is
gained.
Aside from the above, the system function should be used when the command's output is not being used.
Backticks are for amateurs. The bullet-proof solution is a "Safe Pipe Open" (see "man perlipc"). You exec your command in another process, which allows you to first futz with STDERR, setuid, etc. Advantages: it does not rely on the shell to parse #ARGV, unlike open("$cmd $args|"), which is unreliable. You can redirect STDERR and change user priviliges without changing the behavior of your main program. This is more verbose than backticks but you can wrap it in your own function like run_cmd($cmd,#args);
sub run_cmd {
my $cmd = shift #_;
my #args = #_;
my $fh; # file handle
my $pid = open($fh, '-|');
defined($pid) or die "Could not fork";
if ($pid == 0) {
open STDERR, '>/dev/null';
# setuid() if necessary
exec ($cmd, #args) or exit 1;
}
wait; # may want to time out here?
if ($? >> 8) { die "Error running $cmd: [$?]"; }
while (<$fh>) {
# Have fun with the output of $cmd
}
close $fh;
}