Finding files with Perl - perl

File::Find and the wanted subroutine
This question is much simpler than the original title ("prototypes and forward declaration of subroutines"!) lets on. I'm hoping the answer, however simple, will help me understand subroutines/functions, prototypes and scoping and the File::Find module.
With Perl, subroutines can appear pretty much anywhere and you normally don't need to make forward declarations (except if the sub declares a prototype, which I'm not sure how to do in a "standard" way in Perl). For what I usually do with Perl there's little difference between these different ways of running somefunction:
sub somefunction; # Forward declares the function
&somefunction;
somefunction();
somefunction; # Bare word warning under `strict subs`
I often use find2perl to generate code which I crib/hack into parts of scripts. This could well be bad style and now my dirty laundry is public, but so be it :-) For File::Find the wanted function is a required subroutine - find2perl creates it and adds sub wanted; to the resulting script it creates. Sometimes, when I edit the script I'll remove the "sub" from sub wanted and it ends up as &wanted; or wanted();. But without the sub wanted; forward declaration form I get this warning:
Use of uninitialized value $_ in lstat at findscript.pl line 29
My question is: why does this happen and is it a real problem? It is "just a warning", but I want to understand it better.
The documentation and code say $_ is localized inside of sub wanted {}. Why would it be undefined if I use wanted(); instead of sub wanted;?
Is wanted using prototypes somewhere? Am I missing something obvious in Find/File.pm?
Is it because wanted returns a code reference? (???)
My guess is that the forward declaration form "initializes" wanted in some way so that the first use doesn't have an empty default variable. I guess this would be how prototypes - even Perl prototypes, such as they exist - would work as well. I tried grepping through the Perl source code to get a sense of what sub is doing when a function is called using sub function instead of function(), but that may be beyond me at this point.
Any help deepening (and speeding up) my understanding of this is much appreciated.
EDIT: Here's a recent example script here on Stack Overflow that I created using find2perl's output. If you remove the sub from sub wanted; you should get the same error.
EDIT: As I noted in a comment below (but I'll flag it here too): for several months I've been using Path::Iterator::Rule instead of File::Find. It requires perl >5.10, but I never have to deploy production code at sites with odd, "never upgrade", 5.8.* only policies so Path::Iterator::Rule has become one of those modules I never want to do with out. Also useful is Path::Class. Cheers.

I'm not a big fan of File::Find. It just doesn't work right. The find command doesn't return a list of files, so you either have to use a non-local array variable in your find to capture your list of files you've found (not good), or place your entire program in your wanted subroutine (even worse). Plus, the separate subroutine means that your logic is separate from your find command. It's just ugly.
What I do is inline my wanted subroutine inside my find command. Subroutine stays with the find. Plus, my non-local array variable is now just part of my find command and doesn't look so bad
Here's how I handle the File::Find -- assuming I want files that have a .pl suffix:
my #file_list;
find ( sub {
return unless -f; #Must be a file
return unless /\.pl$/; #Must end with `.pl` suffix
push #file_list, $File::Find::name;
}, $directory );
# At this point, #file_list contains all of the files I found.
This is exactly the same as:
my #file_list;
find ( \&wanted, $directory );
sub wanted {
return unless -f;
return unless /\.pl$/;
push #file_list, $File::Find::name;
}
# At this point, #file_list contains all of the files I found.
In lining just looks nicer. And, it keep my code together. Plus, my non-local array variable doesn't look so freaky.
I also like taking advantage of the shorter syntax in this particular way. Normally, I don't like using the inferred $_, but in this case, it makes the code much easier to read. My original Wanted is the same as this:
sub wanted {
my $file_name = $_;
if ( -f $file_name and $file_name =~ /\.pl$/ ) {
push #file_list, $File::Find::name;
}
}
File::Find isn't that tricky to use. You just have to remember:
When you find a file you don't want, you use return to go to the next file.
$_ contains the file name without the directory, and you can use that for testing the file.
The file's full name is $File::Find::name.
The file's directory is $File::Find::dir.
And, the easiest way is to push the files you want into an array, and then use that array later in your program.

Removing the sub from sub wanted; just makes it a call to the wanted function, not a forward declaration.
However, the wanted function hasn't been designed to be called directly from your code - it's been designed to be called by File::Find. File::Find does useful stuff like populating$_ before calling it.
There's no need to forward-declare wanted here, but if you want to remove the forward declaration, remove the whole sub wanted; line - not just the word sub.

Instead of File::Find, I would recommend using the find_wanted function from File::Find::Wanted.
find_wanted takes two arguments:
a subroutine that returns true for any filename that you would want.
a list of the files you are searching for.
find_wanted returns an array containing the list of filenames that it found.
I used code like the following to find all the JPEG files in certain directories on a computer:
my #files = find_wanted( sub { -f && /\.jpg$/i }, #dirs );
Explanation of some of the syntax, for those that might need it:
sub {...} is an anonymous subroutine, where ... is replaced with the code of the subroutine.
-f checks that a filename refers to a "plain file"
&& is boolean and
/\.jpg$/i is a regular expression that checks that a filename ends in .jpg (case insensitively).
#dirs is an array containing the directory names to be searched. A single directory could be searched as well, in which case a scalar works too (e.g. $dir).

Why not use open and invoke the shell find? The user can edit $findcommand (below) to be anything they want, or can define it in real time based on arguments and options passed to a script.
#!/usr/bin/perl
use strict; use warnings;
my $findcommand='find . -type f -mtime 0';
open(FILELIST,"$findcommand |")||die("can't open $findcommand |");
my #filelist=<FILELIST>;
close FILELIST;
my $Nfilelist = scalar(#filelist);
print "Number of files is $Nfilelist \n";

Related

Simple perl file copy methods without using File::Copy

I have very little perl experience and have been trying a few methods on OS X before I attempt to use Macperl on a more difficult to access OS 9 with very limited memory.
I have been trying simple file copy methods without using File::Copy.
Both of the following appear to work:
open R,"<old";
open W,">new";
print W <R>;
open $r,"<old";
open $w,">new";
while (<$r>) { print $w $_ }
Why can't I use $r and $w with the first method?
How does the first method work without a 'while'?
Is there a better way to do this?
Is there a better way to do this?
There sure is... File::Copy is a core module (no installation requred), so there's little reason to avoid using it.
use File::Copy;
copy('old', 'new');
Alternatively, you can use a system call to the underlying OS, in your case, OS-X
system('cp', 'old', 'new');
UPDATE Oops, you're on OS9, so I'm not sure what system calls are available.
You can use your first method with lexical file handles, but you need to disambiguate a little.
open $r, '<', 'old';
open $w, '>', 'new';
print {$w} <$r>;
Bare in mind this is unidiomatic code, and if you just want to create a direct copy, the first method is preferable (EDIT If your memory constraints allow for it).
Perl operators and functions can return different things depending on what their context expects.
The first method works because the print function creates what is called a list context for the <> operator - then thee operator will "slurp in" the entire file.
In the second example, the <> operator is called in the condition of the loop, which creates a scalar context, returning one line at a time (some asterisks here, but that's another story.)
Here is some explanation about contexts: http://perlmaven.com/scalar-and-list-context-in-perl.
And, both methods should work with the R and W filehandles (which are old fashioned Perl filehandles that are different from regular variables), and with the $r/$w notation that actually denotes variables which hold a filehandle reference. The difference is subtle but in most everyday use cases these can be used interchangeably. Have you tried using $ variables in the first example?
In addition to the Hellmar Becker's answer:
The print $w <$r>; does not work (gives a syntax error) because if the FILEHANDLE argument is a variable, perl tries to interpret the print's argument list beginning ($w <$r) as an operator (see the NOTE in http://perldoc.perl.org/functions/print.html). To disambigue put parentheses around the <$r>:
print $w (<$r>);

writing reflective method to load variables from conf file, and assigning references?

I'm working with ugly code and trying to do a cleanup by moving values in a module into a configuration file. I want to keep the modules default values if a variable doesn't exist in the conf file, otherwise use the conf file version. There are lots of variables (too many) in the module so I wanted a helper method to support this. This is a first refactoring step, I likely will go further to better handle config variables later, but one step at a time.
I want a method that would take a variable in my module and either load the value from conf or set a default. So something like this (writing this from scratch, so treat it as just pseudocode for now)
Our ($var_a, $var_b ...);
export($var_a, $var_b ...);
my %conf = #load config file
load_var(\$var_a, "foo");
load_var(\$var_b, "$var_abar");
sub load_var($$){
my($variable_ref, $default) = #_
my $variale_name = Dumper($$variable_ref); #get name of variable
my $variable_value = $conf{$variable_name} // $default;
#update original variable by having $variable_ref point to $variable_value
}
So two questions here. First, does anyone know if some functionality like my load_var already exists which I an reuse?
Second, if I have to write it from scratch, can i do it with a perl version older then 5.22? when I read perlref it refers to setting references as being a new feature in 5.22, but it seems odd that such a basic behavior of references wasn't implemented sooner, so I'm wonder if I'm misunderstanding the document. Is there a way to pass a variable to my load_var method and ensure it's actually updated?
For this sort of problem, I would be thinking along the lines of using the AUTOLOAD - I know it's not quite what you asked, but it's sort of doing a similar thing:
If you call a subroutine that is undefined, you would ordinarily get an immediate, fatal error complaining that the subroutine doesn't exist. (Likewise for subroutines being used as methods, when the method doesn't exist in any base class of the class's package.) However, if an AUTOLOAD subroutine is defined in the package or packages used to locate the original subroutine, then that AUTOLOAD subroutine is called with the arguments that would have been passed to the original subroutine.
Something like:
#!/usr/bin/env perl
package Narf;
use Data::Dumper;
use strict;
use warnings;
our $AUTOLOAD;
my %conf = ( fish => 1,
carrot => "banana" );
sub AUTOLOAD {
print "Loading $AUTOLOAD\n";
##read config file
my $target = $AUTOLOAD =~ s/.*:://gr;
print $target;
return $conf{$target} // 0;
}
sub boo {
print "Boo!\n";
}
You can either call it OO style, or just 'normally' - but bear in mind this creates subs, not variables, so you might need to specify the package (or otherwise 'force' an import/export)
#!/usr/bin/env perl
use strict;
use warnings;
use Narf;
print Narf::fish(),"\n";
print Narf::carrot(),"\n";
print Narf::somethingelse(),"\n";
print Narf::boo;
Note - as these are autoloaded, they're not in the local namespace. Related to variables you've got this perlmonks discussion but I'm not convinced that's a good line to take, for all the reasons outlined in Why it's stupid to `use a variable as a variable name'

Find a file and return after the first match in a perl script

I am using the find from perl. It works but I want to return (exit) from subrutine wanted after a first match is found, I would like to stop the find. I put the return but it doesn't work.
Here is my code:
find(\&wanted, $dir);
sub wanted {
print "Found it $File::Find::dir/$_\n" if /$file/i;
$found_file = "$File::Find::dir/$_";
return "$File::Find::dir/$_";
}
print $found_file;
$dir is the directory I am searching in and $file is the file I need.
Where should I put the returi in the sub wanted. I am new to perl, any help is appreciated. Thanks.
File::Find is a wacky module. It almost requires you code your entire program inside the wanted section. If you don't want to do this, you are usually required to use some sort of badly scoped variable which makes things look ugly. Plus, it uses a separate subroutine that makes the code harder to follow.
The problem is that File::Find isn't object oriented. If it was, you could fetch the files in a loop using a method, and when you get the file you want, you can leave the loop.
Fortunately, there's already a File::Find::Object which is an object oriented version of File::Find.
sub find_file {
my #directories = #_;
my $dir_tree = File::Find::Object->new( #directories );
while ( my $file = $dir_tree->next ) {
next unless ... # Code used to determine if file is what you want
return $file
}
return; #Nothing found
}
The interface is clean and easy to understand. It's how File::Find should work. I don't know why it's not a standard distribution module.
The main problem with File::Find is that it's old (From the Perl 3.x) days, and was never really intended to be a module. It was made to work with the find2perl program that still comes with Perl distributions. Want to see something ugly and scary? Run a find query through find2perl and see the results. This is what old Perl code use to look like.

Perl Subdirectory Traversal

I am writing a script that goes through our large directory of Perl Scripts and checks for certain things. Right now, it takes two kinds of input: the main directory, or a single file. If a single file is provided, it runs the main function on that file. If the main directory is provided, it runs the main function on every single .pm and .pl inside that directory (due to the recursive nature of the directory traversal).
How can I write it (or what package may be helpful)- so that I can also enter one of the seven SUBdirectories, and it will traverse ONLY that subdirectory (instead of the entire thing)?
I can't really see the difference in processing between the two directory arguments. Surely, using File::Find will just do the right thing in both instances.
Something like this...
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my $input = shift;
if (-f $input) {
# Handle a single file
handle_a_file($input);
} else {
# Handler a directory
handle_a_directory($input);
}
sub handle_a_file {
my $file = shift;
# Do whatever you need to with a single file
}
sub handle_a_directory {
my $dir = shift;
find(\&do this, $dir);
}
sub do_this {
return unless -f;
return unless /\.p[ml]$/;
handle_a_file($File::Find::name);
}
One convenient way would be to use the excellent Path::Class module, more precisely: the traverse() method of Path::Class::Dir. You'd control what to process from within the callback function which is supplied as the first argument to traverse(). The manpages has sample snippets.
Using the built-ins like opendir is perfectly fine, of course.
I've just turned to using Path::Class almost everywhere, though, as it has so many nice convenience methods and simply feels right. Be sure to read the docs for Path::Class::File to know what's available. Really does the job 99% of the time.
If you know exactly what directory and subdirectories you want to look at you can use glob("$dir/*/*/*.txt") for example to get ever .txt file in 3rd level of the given $dir

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}