what does this Perl foreach do? - perl

I am stuck while understanding what this foreach does, I am new to perl programming
-e && print ("$_\n") foreach $base_name, "build/$base_name";
here build is directory. Thanks

Not very pretty code somebody left you with there. :(
Either of
for ($basename, "build/$basename") { say if -e }
or
for my $file ($basename, "build/$basename") {
say $file if -e $file;
}
would be clearer.

Checks whether the file exists, and if it does,prints its name.
It can be written in a clearer way.
It iterates over $base_name and then build/$base_name while the file name is in $base_name

it is essentially the same as:
foreach( $base_name, "build/$base_name" ){
if( -e ){
print ("$_\n");
}
}

I hope whoever wrote that isn't prolific.
How many ways to do the same thing in a clearer fashion?
grep
grep filters a list and removes anything that returns false for the the check condition.
The current value under consideration is set to $_, so it is very convenient to use file test operators.
Since say and print both handle lists of strings well, it makes sense to filter output before passing it to them.
say grep -e, $base_name, "build/$base_name";
On older perls without say, map is sometimes used to apply newlines to the output before it is printed.
print map "$_\n", grep -e, $base_name, "build/$base_name";
for
Here we safely use a statement modifier if. IMHO, statement modifier forms can be very handy as long as the LHS is kept very simple.
for my $dir ( $base_name, "build/$base_name" ) {
print "$dir\n" if -e $dir;
}
punctuation
Still ugly as sin, but breaking up the line and using some extra parens helps identify what is being looped over and separate it from the looping code.
-e && print("$_\n")
foreach( $base_name, "build/$base_name" );
It's better, but please don't do this. If you want to do ($foo if $bar) for #baz; DON"T. Use a full sized block. I can count one time in my life where it felt OK to use a $bar and $foo for #baz construct like this one--and it was probably wrong to use it.
sub
The OP's mystery code is nicer if it is merely wrapped in a sub with a good name.
I'll write this in the ugliest way I know how:
sub print_existing_files { -e && print ("$_\n") foreach #_; }
The body of the sub remains a puzzle, but at least there's some kind of clue in the name.
Conclusion
I'm sure I left out many variations on how this could be done (I don't have anything that uses until, and damn near anything that isn't intentionally obfuscated would be clearer). But that is beside the point.
The point of this is really to say that in any language there are many ways to achieve any given task. It is, therefore, important that each programmer who works on a system remain mindful of those who follow.
Code exists to serve two purposes: to communicate with future programmers and to instruct the computer how to operate--in that order.

Related

How to get first file in directory by alphabetical order?

I want to execute $filename = `ls *.gz | head -n 1`; through perl but I think the pipe is causing an error. Execution of -e aborted due to compilation errors.
This will be part of the perl script rather than run via -e.
What would be the correct way to do this?
How about
my $filename = (sort glob "*.gz")[0];
You state the alphabeticall order thus the sort, which by default uses the "standard string comparison order". Note that ls can be aliased, while its defaults also may depend on the system.
Going out to the shell would make sense only if you use some particular strengths of ls, that would take a lot of work to do in Perl. For mere sorting there is no reason to go out of your program. It is far less efficient and adds a whole list of new problems to solve.
A good point was raised by mob. Because of sort invocation, sort BLOCK|SUB LIST, one could wonder whether glob above could be taken in an unintended way, as a SUB. It's not, as the builtin certainly runs first. However, that's a little close and this is just clearer
my ($filename) = sort (glob "*.gz");
zdim's solution uses sort and glob, both of which do a lot of needless work.
The glob is something I explain in Wasting time thinking about wasted time, in which I look at some really bad benchmarks we had in Intermediate Perl.
The sort compares a bunch of files to each other, even if it already know that the two files it wants to compare cannot be the one you want. You only care about which one comes first. A linear scan does just fine (and we show an example of that in Learning Perl to find the maximum number in a list):
opendir my $dh, $dir or die "Could not open dir $dir";
my $first;
while( my $file = readdir $dh ) {
$first = $file if $file lt $first;
}
It's a bit more complicated as you probably want to filter out some files (the virtual dirs . and .., and maybe all the hidden files), but a grep handles that:
my $file = grep { ! /\A\.\.?\z/ } readdir $dh
Even better, this wouldn't even have to know how to get the next file. Some other sort of iterator would provide that without high_water knowing how it works. That's a more complicated example that I won't present here (and is one of those areas where Perl should look to Python, and that I demonstrate in Object::Iterate).
But, sometimes the easier, less pure thing is a good enough solution. Maybe not for this problem, but for some other problem. For example, if you have an external process (ls) generate all the input, all its resources can be freed once it does its job. A Perl pipe can do that:
$ perl -e 'open my $ph, q(-|), q(/bin/ls *.gz); print scalar <$ph>'
a.gz
Or, with head as well (notice the disappearing scalar) so I print all the lines instead of the first line (even though in this example there is only one line):
$ perl -e 'open my $ph, q(-|), q(/bin/ls *.gz | /usr/bin/head -n 1); print <$ph>'
a.gz
In your simple case, where you only get one line of input, the backticks can do the job:
$ perl -e 'print `/bin/ls *.gz | /usr/bin/head -n 1`'
a.gz
I don't know what you were having problems with your one liner, but that's something to consider: The shell things tend to be fragile and difficult to get right. Then, when you get it right, someone else messes it up. Or simple translations from Perl strings to the shell don't come out as expected.
I suspect that you had quoting issues, which is why I use generalized quoting, q(), inside my Perl one-liners. It gets even more fun on Windows where you can't use single ticks to quote the argument to -e, but if you use double ticks in a unix shell, you get interpolation.
Remember though, asking for an external process means you have to be really careful. I used /bin/ls to be sure I got what I wanted and not some other thing in my path (although you can also limit $ENV{PATH}). I write much more about that in Mastering Perl, although perlsec has some advice too.

Finding files with Perl

File::Find and the wanted subroutine
This question is much simpler than the original title ("prototypes and forward declaration of subroutines"!) lets on. I'm hoping the answer, however simple, will help me understand subroutines/functions, prototypes and scoping and the File::Find module.
With Perl, subroutines can appear pretty much anywhere and you normally don't need to make forward declarations (except if the sub declares a prototype, which I'm not sure how to do in a "standard" way in Perl). For what I usually do with Perl there's little difference between these different ways of running somefunction:
sub somefunction; # Forward declares the function
&somefunction;
somefunction();
somefunction; # Bare word warning under `strict subs`
I often use find2perl to generate code which I crib/hack into parts of scripts. This could well be bad style and now my dirty laundry is public, but so be it :-) For File::Find the wanted function is a required subroutine - find2perl creates it and adds sub wanted; to the resulting script it creates. Sometimes, when I edit the script I'll remove the "sub" from sub wanted and it ends up as &wanted; or wanted();. But without the sub wanted; forward declaration form I get this warning:
Use of uninitialized value $_ in lstat at findscript.pl line 29
My question is: why does this happen and is it a real problem? It is "just a warning", but I want to understand it better.
The documentation and code say $_ is localized inside of sub wanted {}. Why would it be undefined if I use wanted(); instead of sub wanted;?
Is wanted using prototypes somewhere? Am I missing something obvious in Find/File.pm?
Is it because wanted returns a code reference? (???)
My guess is that the forward declaration form "initializes" wanted in some way so that the first use doesn't have an empty default variable. I guess this would be how prototypes - even Perl prototypes, such as they exist - would work as well. I tried grepping through the Perl source code to get a sense of what sub is doing when a function is called using sub function instead of function(), but that may be beyond me at this point.
Any help deepening (and speeding up) my understanding of this is much appreciated.
EDIT: Here's a recent example script here on Stack Overflow that I created using find2perl's output. If you remove the sub from sub wanted; you should get the same error.
EDIT: As I noted in a comment below (but I'll flag it here too): for several months I've been using Path::Iterator::Rule instead of File::Find. It requires perl >5.10, but I never have to deploy production code at sites with odd, "never upgrade", 5.8.* only policies so Path::Iterator::Rule has become one of those modules I never want to do with out. Also useful is Path::Class. Cheers.
I'm not a big fan of File::Find. It just doesn't work right. The find command doesn't return a list of files, so you either have to use a non-local array variable in your find to capture your list of files you've found (not good), or place your entire program in your wanted subroutine (even worse). Plus, the separate subroutine means that your logic is separate from your find command. It's just ugly.
What I do is inline my wanted subroutine inside my find command. Subroutine stays with the find. Plus, my non-local array variable is now just part of my find command and doesn't look so bad
Here's how I handle the File::Find -- assuming I want files that have a .pl suffix:
my #file_list;
find ( sub {
return unless -f; #Must be a file
return unless /\.pl$/; #Must end with `.pl` suffix
push #file_list, $File::Find::name;
}, $directory );
# At this point, #file_list contains all of the files I found.
This is exactly the same as:
my #file_list;
find ( \&wanted, $directory );
sub wanted {
return unless -f;
return unless /\.pl$/;
push #file_list, $File::Find::name;
}
# At this point, #file_list contains all of the files I found.
In lining just looks nicer. And, it keep my code together. Plus, my non-local array variable doesn't look so freaky.
I also like taking advantage of the shorter syntax in this particular way. Normally, I don't like using the inferred $_, but in this case, it makes the code much easier to read. My original Wanted is the same as this:
sub wanted {
my $file_name = $_;
if ( -f $file_name and $file_name =~ /\.pl$/ ) {
push #file_list, $File::Find::name;
}
}
File::Find isn't that tricky to use. You just have to remember:
When you find a file you don't want, you use return to go to the next file.
$_ contains the file name without the directory, and you can use that for testing the file.
The file's full name is $File::Find::name.
The file's directory is $File::Find::dir.
And, the easiest way is to push the files you want into an array, and then use that array later in your program.
Removing the sub from sub wanted; just makes it a call to the wanted function, not a forward declaration.
However, the wanted function hasn't been designed to be called directly from your code - it's been designed to be called by File::Find. File::Find does useful stuff like populating$_ before calling it.
There's no need to forward-declare wanted here, but if you want to remove the forward declaration, remove the whole sub wanted; line - not just the word sub.
Instead of File::Find, I would recommend using the find_wanted function from File::Find::Wanted.
find_wanted takes two arguments:
a subroutine that returns true for any filename that you would want.
a list of the files you are searching for.
find_wanted returns an array containing the list of filenames that it found.
I used code like the following to find all the JPEG files in certain directories on a computer:
my #files = find_wanted( sub { -f && /\.jpg$/i }, #dirs );
Explanation of some of the syntax, for those that might need it:
sub {...} is an anonymous subroutine, where ... is replaced with the code of the subroutine.
-f checks that a filename refers to a "plain file"
&& is boolean and
/\.jpg$/i is a regular expression that checks that a filename ends in .jpg (case insensitively).
#dirs is an array containing the directory names to be searched. A single directory could be searched as well, in which case a scalar works too (e.g. $dir).
Why not use open and invoke the shell find? The user can edit $findcommand (below) to be anything they want, or can define it in real time based on arguments and options passed to a script.
#!/usr/bin/perl
use strict; use warnings;
my $findcommand='find . -type f -mtime 0';
open(FILELIST,"$findcommand |")||die("can't open $findcommand |");
my #filelist=<FILELIST>;
close FILELIST;
my $Nfilelist = scalar(#filelist);
print "Number of files is $Nfilelist \n";

Why does Perl::Critic dislike using shift to populate subroutine variables?

Lately, I've decided to start using Perl::Critic more often on my code. After programming in Perl for close to 7 years now, I've been settled in with most of the Perl best practices for a long while, but I know that there is always room for improvement. One thing that has been bugging me though is the fact that Perl::Critic doesn't like the way I unpack #_ for subroutines. As an example:
sub my_way_to_unpack {
my $variable1 = shift #_;
my $variable2 = shift #_;
my $result = $variable1 + $variable2;
return $result;
}
This is how I've always done it, and, as its been discussed on both PerlMonks and Stack Overflow, its not necessarily evil either.
Changing the code snippet above to...
sub perl_critics_way_to_unpack {
my ($variable1, $variable2) = #_;
my $result = $variable1 + $variable2;
return $result;
}
...works too, but I find it harder to read. I've also read Damian Conway's book Perl Best Practices and I don't really understand how my preferred approach to unpacking falls under his suggestion to avoid using #_ directly, as Perl::Critic implies. I've always been under the impression that Conway was talking about nastiness such as:
sub not_unpacking {
my $result = $_[0] + $_[1];
return $result;
}
The above example is bad and hard to read, and I would never ever consider writing that in a piece of production code.
So in short, why does Perl::Critic consider my preferred way bad? Am I really committing a heinous crime unpacking by using shift?
Would this be something that people other than myself think should be brought up with the Perl::Critic maintainers?
The simple answer is that Perl::Critic is not following PBP here. The
book explicitly states that the shift idiom is not only acceptable, but
is actually preferred in some cases.
Running perlcritic with --verbose 11 explains the policies. It doesn't look like either of these explanations applies to you, though.
Always unpack #_ first at line 1, near
'sub xxx{ my $aaa= shift; my ($bbb,$ccc) = #_;}'.
Subroutines::RequireArgUnpacking (Severity: 4)
Subroutines that use `#_' directly instead of unpacking the arguments to
local variables first have two major problems. First, they are very hard
to read. If you're going to refer to your variables by number instead of
by name, you may as well be writing assembler code! Second, `#_'
contains aliases to the original variables! If you modify the contents
of a `#_' entry, then you are modifying the variable outside of your
subroutine. For example:
sub print_local_var_plus_one {
my ($var) = #_;
print ++$var;
}
sub print_var_plus_one {
print ++$_[0];
}
my $x = 2;
print_local_var_plus_one($x); # prints "3", $x is still 2
print_var_plus_one($x); # prints "3", $x is now 3 !
print $x; # prints "3"
This is spooky action-at-a-distance and is very hard to debug if it's
not intentional and well-documented (like `chop' or `chomp').
An exception is made for the usual delegation idiom
`$object->SUPER::something( #_ )'. Only `SUPER::' and `NEXT::' are
recognized (though this is configurable) and the argument list for the
delegate must consist only of `( #_ )'.
It's important to remember that a lot of the stuff in Perl Best Practices is just one guy's opinion on what looks the best or is the easiest to work with, and it doesn't matter if you do it another way. Damian says as much in the introductory text to the book. That's not to say it's all like that -- there are many things in there that are absolutely essential: using strict, for instance.
So as you write your code, you need to decide for yourself what your own best practices will be, and using PBP is as good a starting point as any. Then stay consistent with your own standards.
I try to follow most of the stuff in PBP, but Damian can have my subroutine-argument shifts and my unlesses when he pries them from my cold, dead fingertips.
As for Critic, you can choose which policies you want to enforce, and even create your own if they don't exist yet.
In some cases Perl::Critic cannot enforce PBP guidelines precisely, so it may enforce an approximation that attempts to match the spirit of Conway's guidelines. And it is entirely possible that we have misinterpreted or misapplied PBP. If you find something that doesn't smell right, please mail a bug report to bug-perl-critic#rt.cpan.org and we'll look into it right away.
Thanks,
-Jeff
I think you should generally avoid shift, if it is not really necessary!
Just ran into a code like this:
sub way {
my $file = shift;
if (!$file) {
$file = 'newfile';
}
my $target = shift;
my $options = shift;
}
If you start changing something in this code, there is a good chance you might accidantially change the order of the shifts or maybe skip one and everything goes southway. Furthermore it's hard to read - because you cannot be sure you really see all parameters for the sub, because some lines below might be another shift somewhere... And if you use some Regexes in between, they might replace the contents of $_ and weird stuff begins to happen...
A direct benefit of using the unpacking my (...) = #_ is you can just copy the (...) part and paste it where you call the method and have a nice signature :) you can even use the same variable-names beforehand and don't have to change a thing!
I think shift implies list operations where the length of the list is dynamic and you want to handle its elements one at a time or where you explicitly need a list without the first element. But if you just want to assign the whole list to x parameters, your code should say so with my (...) = #_; no one has to wonder.

How can I identify and remove redundant code in Perl?

I have a Perl codebase, and there are a lot of redundant functions and they are spread across many files.
Is there a convenient way to identify those redundant functions in the codebase?
Is there any simple tool that can verify my codebase for this?
You could use the B::Xref module to generate cross-reference reports.
I've run into this problem myself in the past. I've slapped together a quick little program that uses PPI to find subroutines. It normalizes the code a bit (whitespace normalized, comments removed) and reports any duplicates. Works reasonably well. PPI does all the heavy lifting.
You could make the normalization a little smarter by normalizing all variable names in each routine to $a, $b, $c and maybe doing something similar for strings. Depends on how aggressive you want to be.
#!perl
use strict;
use warnings;
use PPI;
my %Seen;
for my $file (#ARGV) {
my $doc = PPI::Document->new($file);
$doc->prune("PPI::Token::Comment"); # strip comments
my $subs = $doc->find('PPI::Statement::Sub');
for my $sub (#$subs) {
my $code = $sub->block;
$code =~ s/\s+/ /; # normalize whitespace
next if $code =~ /^{\s*}$/; # ignore empty routines
if( $Seen{$code} ) {
printf "%s in $file is a duplicate of $Seen{$code}\n", $sub->name;
}
else {
$Seen{$code} = sprintf "%s in $file", $sub->name;
}
}
}
It may not be convenient, but the best tool for this is your brain. Go through all the code and get an understanding of its interrelationships. Try to see the common patterns. Then, refactor!
I've tagged your question with "refactoring". You may find some interesting material on this site filed under that subject.
If you are on Linux you might use grep to help you make list all of the functions in your codebase. You will probably need to do what Ether suggests and really go through the code to understand it if you haven't already.
Here's an over-simplified example:
grep -r "sub " codebase/* > function_list
You can look for duplicates this way too. This idea may be less effective if you are using Perl's OOP capability.
It might also be worth mentioning NaturalDocs, a code documentation tool. This will help you going forward.

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}