In Perl scripts, should we use shell commands or call Perl functions that imitate shell operations? - perl

I want to know about the best practices here. Suppose I want to get the content of some line of a file. I can use a one-line shell command to get my answer, or write a subroutine, as shown in the code below.
A text file named some_text:
She laughed. Then both continued eating in silence, like strangers,
but after dinner they walked side by side; and there sprang up
between them the light jesting conversation of people who are free
and satisfied, to whom it does not matter where they go or what
they talk about.
Code to get content of line 5 of the file
#!perl
use warnings;
use strict;
my $file = "some_text";
my $lnum = 5;
my $shellcmd = "awk 'NR==$lnum' $file";
print qx($shellcmd);
print getSrcLine($file, $lnum);
sub getSrcLine {
my($file, $lnum) = #_;
open FILE, $file or die "$!";
my #ray = <FILE>;
return $ray[$lnum-1];
}
I ask this because I see a lot of Perl scripts where at some point, a shell command was called, while at some later point, the same task was done by a call to a (library or handwritten) function, for example, rm -rf versus File::Path::rmtree. I just want to make it consistent.
What is the recommended thing to do?

If there's a Perl function for the operation, Perl thinks you should use its version. However, you give an example of a Perl module providing a pure Perl way to do it. That's much different. There's no single answer (as in most things), so you have to decide for yourself what to do:
Does the pure Perl approach do it correctly? For example, File::Copy has some limitations because it makes some awkward decisions for the user, so many people think it's broken. See, for instance, File::Copy versus cp/mv.
Does pure Perl approach do it in an acceptable time? Sometimes the external program is orders of magnitude faster. Sometimes it's a lot slower.
External commands usually are portable within a family of systems (e.g. all linux-like systems) but probably not across families (e.g. Windows and linux). Your tolerance for that might affect your answer. Even if you think you are running the same command, the different flavors of unix-like systems might have different switches for the operations.
Passing complicated arguments—spaces, quotes, and special characters—to external commands can make you cry. You have to do a lot of fiddly work to make sure you're handling arguments correctly. Perl subroutines don't care though.
You have to pay much more attention to what you are doing when you are using the external command. If you just call rm, Perl is going to search through your PATH and use the first thing called rm. That doesn't mean it's the program you think it is. I write about this quite a bit in the "Secure Programming Techniques" in Mastering Perl.
If the pure Perl approach requires a module, especially if that module has many complicated dependencies, you might be in for dependency or distribution hell down the road.
Personally, I start with the pure Perl approach until it doesn't work for the situation.
For your particular examples, I'd use Perl. Shelling out to awk, which is a proto-Perl, is just odd. You should be able to do everything awk does right it Perl. If you have an awk program, you can convert it to Perl with the a2p program:
NR==5
a2p turns that into (modulo some setup bits at the start):
while (<>) {
print $_ if $. == 5;
}
Notice that it still scans the entire file even though you have the fifth line. However, you can use the translated program as a start:
while (<>) {
if( $. == 5 ) {
print;
last;
}
}
I don't think you should shell out to some other program to avoid that Perl code.
To remove a directory tree, I like File::Path. It has some dependencies, but they are all in the Perl Standard Library. There's very little pain, if any, associated with that module. I'd use it until I ran into a problem where it didn't work.

If you want your app to be portable to non-unix systems, then definitely code everything in Perl.
If not, it's really up to you... creating a new process is slower, but if it's not important for the task then it doesn't matter. Personally I would pick the solution which I can quicker implement.

It seems to me that code that works should be the first priority. Yours fails if the file name has a space in it, for example.
Using the shell makes it harder to code correctly since your program needs to properly generate another program to be run by sh. (This problem goes away if you use the multi-arg version of system to avoid the shell.)
Furthermore, using external tools can make it hard to handle errors. You didn't even attempt to do so!
On the flip side, there are multiple reasons for using external tools. For example, Perl doesn't provide as good an file copy utility as cp; using the sort tool allows you to sort arbitrary large files with limited RAM; etc.

Related

Executing system commands safely while coding in Perl

Should one really use external commands while coding in Perl? I see several disadvantages of it. It's not system independent plus security risks might also be there. What do you think? If there is no way and you have to use the shell commands from Perl then what is the safest way to execute that particular command (like checking pid, uid etc)?
It depends on how hard it is going to be to replicate the functionality in Perl. If I needed to run the m4 macro processor on something, I'd not think of trying to replicate that functionality in Perl myself, and since there's no module on http://search.cpan.org/ that looks suitable, it would appear others agree with me. In that case, then, using the external program is sensible. On the other hand, if I needed to read the contents of a directory, then the combination of readdir() et al plus stat() or lstat() inside Perl is more sensible than futzing with the output of ls.
If you need to execute commands, think very carefully about how you invoke them. In particular, you probably want to avoid the shell interpreting the arguments, so use the array form of system (see also exec), etc, rather than a single string for the command plus arguments (which means the shell is used to process the command line).
Executing external commands can be expensive simply because it involves forking new process and watching for its output if you need it.
Probably more importantly, should external process fail for any reason, it may be difficult to understand what happened by means of your script. Worse still, surprisingly often external process can be stuck forever, so will be your script. You can use special tricks like opening pipe and watching for output in loop, but this itself is error-prone.
Perl is very capable of doing many things. So, if you stick to using only Perl native constructs and modules to accomplish your tasks, not only it will be faster because you never fork, but it will be more reliable and easier to catch errors by looking at native Perl objects and structures returned by library routines. And of course, it will be automatically portable to different platforms.
If your script runs under elevated permissions (like root or under sudo), you should be very careful as to what external programs you execute. One of the simple ways to ensure basic security is to always specify commands by full name, like /usr/bin/grep (but still think twice and just do grep by Perl itself!). However, even this may not be enough if attacker is using LD_PRELOAD mechanism to inject rogue shared libraries.
If you are willing to go very secure, it is suggested to use tainted check by using -T flag like this:
#!/usr/bin/perl -T
Taint flag will be also enabled by Perl automatically if your script was determined to have different real and effective user or group ids.
Tainted mode will severely limit your ability to do many things (like system() call) without Perl complaining - see more at http://perldoc.perl.org/perlsec.html#Taint-mode, but it will give you much higher security confidence.
Should one really use external commands while coding in Perl?
There's no single answer to this question. It all depends on what you are doing within the wide range of potential uses of Perl.
Are you using Perl as a glorified shell script on your local machine, or just trying to find a quick-and-dirty solution to your problem? In that case, it makes a lot of sense to run system commands if that is the easiest way to accomplish your task. Security and speed are not that important; what matters is the ability to code quickly.
On the other hand, are you writing a production program? In that case, you want secure, portable, efficient code. It is often preferable to write the functionality in Perl (or use a module), rather than calling an external program. At least, you should think hard about the benefits and drawbacks.

Filehandles and XML::Simple -> Memory corruption. Can't isolate problem

In a small test file, I can run
#!/usr/bin/perl
use warnings;
use strict;
use open qw{:utf8 :std};
use XML::Simple;
my #cmdline = ("hg", "log", "-v", "--style", "xml");
open my $xml, "#cmdline |";
my $xmllog = XMLin($xml, ForceArray => ['logentry', 'parent', 'copy', 'path']);
foreach my $rev (#{$xmllog->{logentry}}) {
#do stuff
}
and it works fine. When I run the same code in a larger program (with the same XML input), it terminates with
*** glibc detected *** /usr/bin/perl: malloc(): memory corruption: 0x0a40e308 ***
(full crash log # pastebin.com)
However, if I do the exchange
#open my $xml, "#cmdline |";
my $xml = `#cmdline`;
then it works (in both files), so this is more a question of curiosity than a real problem for me.
Does anyone have any pointers on what the difference between my test case and the larger code base might be?
Is there a speed/memory/? difference in the different command calls? Best practices?
Debian Sid: Perl 5.12.4-1.
(This is my first Perl encounter, so don't assume too much about what I "should" know about the language. I just dove into existing code.)
(The larger program is ikiwiki, so the code is not a secret, but I don't know where to look for trouble, and I can't include all the code in this post for practical reasons. This concerns the Mercurial backend.)
As per suggestion from cjm, I added print "$_\n" for sort grep /XML/, keys %INC; which gave output
RPC/XML.pm
RPC/XML/Client.pm
RPC/XML/ParserFactory.pm
XML/NamespaceSupport.pm
XML/Parser.pm
XML/Parser/Expat.pm
XML/SAX.pm
XML/SAX/Base.pm
XML/SAX/Exception.pm
XML/SAX/Expat.pm
XML/SAX/ParserFactory.pm
XML/Simple.pm
in the large project, and
XML/NamespaceSupport.pm
XML/Parser.pm
XML/Parser/Expat.pm
XML/SAX.pm
XML/SAX/Base.pm
XML/SAX/Exception.pm
XML/SAX/Expat.pm
XML/SAX/ParserFactory.pm
XML/Simple.pm
in the test file.
Update: I installed the Debian package libxml-libxml-perl and added $XML::SAX::ParserPackage = "XML::LibXML::SAX"; as suggested. This also crashed, with a different message this time:
*** stack smashing detected ***: /usr/bin/perl terminated
full backtrace # pastebin.com
This time it happened consistently in both the large and the small file, though. Also, only when using open, not when using backticks.
I also installed libxml-libxml-simple-perl, but that was not supposed to be more than in practice a wrapper to always use XML::LibXML as parser. It also behaved differently and complained about the options to XMLin() that was set, so I discarded it.
Trying to explicitly (and blindly) make the program use each of the alternatives given by print "$_\n" for sort grep /XML/, keys %INC; seems to point towards that XML::SAX::Expat is used by default as cjm said (since all other alternatives exit with errors, and XML::SAX:Expat behaves exactly like the original problem in both files. Explicitly demanding XML::Simple goes into a loop that allocates all my memory).
I'm thankful for learning about different XML parsers and that XML::Simple automatically chooses different ones. Both parts of my original question somewhat remain though:
Why do the programs behave differently? Even if I explicitly set $XML::SAX::ParserPackage = "XML::SAX::Expat" in both programs, one crashes (using open) and the other works.
Should I use another method to receive output from the external command? Is it even wrong to expect XMLin() ta work with open (but why does it work in one case, then?)?
Or are they simple the "wrong" questions to ask (i.e. irrelevant)?
UPDATE: More than a week has passed, not a flurry of activity here, and I solve it a bit differently now, without problems. I mark cjm's answer as correct, since it got me further in the error analysis. Thanks!
XML::Simple is pure-Perl, so it's unlikely to cause the memory corruption you report. It depends on a lower-level XML parser, and it's likely the bug you've encountered is in there. But there are multiple parsers it could be using, and we'd need to know which one.
Try adding this line right after the XMLin line in your sample program, and update your question with the results:
print "$_\n" for sort grep /XML/, keys %INC;
This will tell us which XML parser you're actually using on your system.
Update: Since it looks like you're using XML::Parser (through its SAX interface XML::SAX::Expat, I'd suggest trying XML::LibXML::SAX instead. Libxml2 is considered one of the better XML parsers.
If you don't already have XML::LibXML::SAX installed, just installing it should switch your default SAX parser to it. If it is installed, try putting
$XML::SAX::ParserPackage = "XML::LibXML::SAX";
at the beginning of your program. (See XML::SAX::ParserFactory for how the SAX parser is selected.)

Is it okay to use modules from within subroutines?

Recently I start playing with OO Perl and I've been creating quite a bunch of new objects for a new project that I'm working on. As I'm unfamilliar with any best practice regarding OO Perl and we're kind in a tight rush to get it done :P
I'm putting a lot of this kind of code into each of my function:
sub funcx{
use ObjectX; # i don't declare this on top of the pm file
# but inside the function itself
my $obj = new ObjectX;
}
I was wondering if this will cause any negative impact versus putting on the use Object line on top of the Perl modules outside of any function scope.
I was doing this so that I feel it's cleaner in case I need to shift the function around.
And the other thing that I have noticed is that when I try to run a test.pl script on the unix server itself which test my objects, it slow as heck. But when the same code are run through CGI which is connected to an apache server, the web page doesn't load as slowly.
Where to put use?
use occurs at compile time, so it doesn't matter where you put it. At least from a purely pragmatic, 'will it work', point of view. Because it happens at compile time use will always be executed, even if you put it in a conditional. Never do this: if( $foo eq 'foo' ) { use SomeModule }
In my experience, it is best to put all your use statements at the top of the file. It makes it easy to see what is being loaded and what your dependencies are.
Update:
As brian d foy points out, things compiled before the use statement will not be affected by it. So, the location can matter. For a typical module, location does not matter, however, if it does things that affect compilation (for example it imports functions that have prototypes), the location could matter.
Also, Chas Owens points out that it can affect compilation. Modules that are designed to alter compilation are called pragmas. Pragmas are, by convention, given names in all lower-case. These effects apply only within the scope where the module is used. Chas uses the integer pragma as an example in his answer. You can also disable a pragma or module over a limited scope with the keyword no.
use strict;
use warnings;
my $foo;
print $foo; # Generates a warning
{ no warnings 'unitialized`; # turn off warnings for working with uninitialized values.
print $foo; # No warning here
}
print $foo; # Generates a warning
Indirect object syntax
In your example code you have my $obj = new ObjectX;. This is called indirect object syntax, and it is best avoided as it can lead to obscure bugs. It is better to use this form:
my $obj = ObjectX->new;
Why is your test script slow on the server?
There is no way to tell with the info you have provided.
But the easy way to find out is to profile your code and see where the time is being consumed. NYTProf is another popular profiling tool you may want to check out.
Best practices
Check out Perl Best Practices, and the quick reference card. This page has a nice run down of Damian Conway's OOP advice from PBP.
Also, you may wish to consider using Moose. If the long script startup time is acceptable in your usage, then Moose is a huge win.
question 1
It depends on what the module does. If it has lexical effects, then it will only affect the scope it is used in:
my $x;
{
use integer;
$x = 5/2; #$x is now 2
}
my $y = 5/2; #$y is now 2.5
If it is a normal module then it makes no difference where you use it, but it is common to use all of those modules at the top of the program.
question 2
Things that can affect the speed of a program between machines
speed of the processor
version of modules installed (some modules have XS versions that are much faster)
version of Perl
number of entries in PERL5LIB
speed of the drive
daotoad and Chas. Owens already answered the part of your question pertaining to the position of use statements. Let me remark on something else here:
I was doing this so that I feel it's
cleaner in case I need to shift the
function around.
Personally, I find it much cleaner to have all the used modules in one place at the top of the file. You won't have to search for use statements to see what other modules are being used and a quick glance will tell you what is being used and even what is not being used.
Regarding your performance problem: with Apache and mod_perl the Perl interpreter will have to parse and compile your used modules only once. The next time the script is run, execution should be much faster. On the command line, however, a second run doesn't get this benefit.

Why shouldn't I use shell tools in Perl code?

It is generally advised not to use additional linux tools in a Perl code;
e.g if someone intends to print the last line of a text file he can:
$last_line = `tail -1 $file` ;
or otherwise, open the file and read it line by line
open(INFO,$file);
while(<INFO>) {
$last_line = $_ if eof;
}
What are the pitfalls of using the previous and why should I avoid using shell tools in my code?
thanx,
Efficiency - you don't have to spawn a new process
Portability - you don't have to worry about an executable not existing, accepting different switches, or having different output
Ease of use - you don't have to parse the output, the results are already in a usable form
Error handling - you have finer-grained control over errors and what to do about them in Perl.
It's better to keep all the action in Perl because it's faster and because it's more secure. It's faster because you're not spawning a new process, and it's more secure because you don't have to worry about shell meta character trickery.
For example, in your first case if $file contained "afilename ; rm -rf ~" you would be a very unhappy camper.
P.S. The best all-Perlway to do the tail is to use File::ReadBackwards
One of the primary reasons (besides portability) for not executing shell commands is that it introduces overhead by spawning another process. That's why much of the same functionality is available via CPAN in Perl modules.
One reason is that your Perl code might be running in an environment where there is no shell tool called 'tail'.
It's a personal call depending on the project:
Is it going to be always used in shell environments with tail?
Do you care about only using pure Perl code?
Using tail? Fine. But that's really a special case, since it's so easy to use and since it is so trivial.
The problem in general is not really efficiency or portability, that is largely irrelevant; the issue is ease of use. To run an external utility, you have to find out what arguments it accepts, write code to transform your program's data structures to that format, quote them properly, build the command line, and run the application. Then, you might have to feed it data and read data from it (involving complexity like an event loop, worrying about deadlocking, etc.), and finally interpret the return value. (UNIX processes consider "0" true and anything else false, but Perl assumes the opposite. foo() and die is hard to read.) This is a lot of work to do, and that's why people avoid it. It's much easier to create an instance of a class and call methods on it to get the data you need.
(You can abstract away processes this way; see Crypt::GpgME for example. It handles the complexity associated with invoking gpg, which would normally involve creating multiple filehandles other than STDOUT, STDIN, and STDERR, among other things.)
The main reason I see for doing it all in Perl would be for robustness. Your use of tail will fail if the filename has shell metacharacters or spaces or doesn't exist or isn't accessible. From Perl, characters in the filename aren't an issue, and you can distinguish between errors in accessing the file. Sometimes being robust is more important than speedy coding and sometimes it's not.

What are the best-practices for implementing a CLI tool in Perl?

I am implementing a CLI tool using Perl.
What are the best-practices we can follow here?
As a preface, I spent 3 years engineering and implementing a pretty complicated command line toolset in Perl for a major financial company. The ideas below are basically part of our team's design guidelines.
User Interface
Command line option: allow as many as possible have default values.
NO positional parameters for any command that has more than 2 options.
Have readable options names. If length of command line is a concern for non-interactive calling (e.g. some un-named legacy shells have short limits on command lines), provide short aliases - GetOpt::Long allows that easily.
At the very least, print all options' default values in '-help' message.
Better yet, print all the options' "current" values (e.g. if a parameter and a value are supplied along with "-help", the help message will print parameter's value from command line). That way, people can assemble command line string for complicated command and verify it by appending "-help", before actually running.
Follow Unix standard convention of exiting with non-zero return code if program terminated with errors.
If your program may produce useful (e.g. worth capturing/grepping/whatnot) output, make sure any error/diagnostic messages go to STDERR so they are easily separable.
Ideally, allow the user to specify input/output files via command line parameter, instead of forcing "<" / ">" redirects - this allows MUCH simpler life to people who need to build complicated pipes using your command. Ditto for error messages - have logfile option.
If a command has side effect, having a "whatif/no_post" option is usually a Very Good Idea.
Implementation
As noted previously, don't re-invent the wheel. Use standard command line parameter handling modules - MooseX::Getopt, or Getopt::Long
For Getopt::Long, assign all the parameters to a single hash as opposed to individual variables. Many useful patterns include passing that CLI args hash to object constructors.
Make sure your error messages are clear and informative... E.g. include "$!" in any IO-related error messages. It's worth expending extra 1 minute and 2 lines in your code to have a separate "file not found" vs. "file not readable" errors, as opposed to spending 30 minutes in production emergency because a non-readable file error was misdiagnosed by Production Operations as "No input file" - this is a real life example.
Not really CLI-specific, but validate all parameters, ideally right after getting them.
CLI doesn't allow for a "front-end" validation like webapps do, so be super extra vigilant.
As discussed above, modularize business logic. Among other reasons already listed, the amount of times I had to re-implement an existing CLI tool as a web app is vast - and not that difficult if the logic is already a properly designed perm module.
Interesting links
CLI Design Patterns - I think this is ESR's
I will try to add more bullets as I recall them.
Use POD to document your tool, follow the guidelines of manpages; include at least the following sections: NAME, SYNOPSIS, DESCRIPTION, AUTHOR. Once you have proper POD you can generate a man page with pod2man, view the documentation at the console with perldoc your-script.pl.
Use a module that handles command line options for you. I really like using Getopt::Long in conjunction with Pod::Usage this way invoking --help will display a nice help message.
Make sure that your scripts returns a proper exit value if it was successful or not.
Here's a small skeleton of a script that does all of these:
#!/usr/bin/perl
=head1 NAME
simplee - simple program
=head1 SYNOPSIS
simple [OPTION]... FILE...
-v, --verbose use verbose mode
--help print this help message
Where I<FILE> is a file name.
Examples:
simple /etc/passwd /dev/null
=head1 DESCRIPTION
This is as simple program.
=head1 AUTHOR
Me.
=cut
use strict;
use warnings;
use Getopt::Long qw(:config auto_help);
use Pod::Usage;
exit main();
sub main {
# Argument parsing
my $verbose;
GetOptions(
'verbose' => \$verbose,
) or pod2usage(1);
pod2usage(1) unless #ARGV;
my (#files) = #ARGV;
foreach my $file (#files) {
if (-e $file) {
printf "File $file exists\n" if $verbose;
}
else {
print "File $file doesn't exist\n";
}
}
return 0;
}
Some lessons I've learned:
1) Always use Getopt::Long
2) Provide help on usage via --help, ideally with examples of common scenarios. It helps people don't know or have forgotten how to use the tool. (I.e., you in six months).
3) Unless it's pretty obvious to the user as why, don't go for long period (>5s) without output to the user. Something like 'print "Row $row...\n" unless ($row % 1000)' goes a long way.
4) For long running operations, allow the user to recover if possible. It really sucks to get through 500k of a million, die, and start over again.
5) Separate the logic of what you're doing into modules and leave the actual .pl script as barebones as possible; parsing options, display help, invoking basic methods, etc. You're inevitably going to find something you want to reuse, and this makes it a heck of a lot easier.
The most important thing is to have standard options.
Don't try to be clever, be simply consistent with already existing tools.
How to achieve this is also important, but only comes second.
Actually, this is quite generic to all CLI interfaces.
There are a couple of modules on CPAN that will make writing CLI programs a lot easier:
App::CLI
App::Cmd
If you app is Moose based also have a look at MooseX::Getopt and MooseX::Runnable
The following points aren't specific to Perl but I've found many Perl CL scripts to be deficient in these areas:
Use common command line options. To show the version number implement -v or --version not --ver. For recursive processing -r (or perhaps -R although in my Gnu/Linux experience -r is more common) not --rec. People will use your script if they can remember the parameters. It's easy to learn a new command if you can remember "it works like grep" or some other familiar utility.
Many command line tools process "things" (files or directories) within the "current directory". While this can be convenient make sure you also add command line options for explicitly identifying the files or directories to process. This makes it easier to put your utility in a pipeline without developers having to issue a bunch of cd commands and remember which directory they're in.
You should use Perl modules to make your code reusable and easy to understand.
should have a look at Perl best practices