About searching recursively in Perl - perl

I have a Perl script that I, well, mostly pieced together from questions on this site. I've read the documentation on some parts to better understand it. Anyway, here it is:
#!/usr/bin/perl
use File::Find;
my $dir = '/home/jdoe';
my $string = "hard-coded pattern to match";
find(\&printFile, $dir);
sub printFile
{
my $element = $_;
if(-f $element && $element =~ /\.txt$/)
{
open my $in, "<", $element or die $!;
while(<$in>)
{
if (/\Q$string\E/)
{
print "$File::Find::name\n";
last; # stops looking after match is found
}
}
}
}
This is a simple script that, similar to grep, will look down recursively through directories for a matching string. It will then print the location of the file that contains the string. It works, but only if the file is located in my home directory. If I change the hard-coded search to look in a different directory (that I have permissions in), for example /admin/programs, the script no longer seems to do anything: No output is displayed, even when I know it should be matching at least one file (tested by making a file in admin/programs with the hard-coded pattern. Why am I experiencing this behavior?
Also, might as well disclaim that this isn't a really useful script (heck, this would be so easy with grep or awk!), but understanding how to do this in Perl is important to me right now. Thanks
EDIT: Found the problem. A simple oversight in that the files in the directory I was looking for did not have .txt as extension. Thanks for helping me find that.

I was able to get the desired output using the code you pasted by making few changes like:
use strict;
use warnings;
You should always use them as they notify of various errors in your code which you may not get hold of.
Next I changed the line :
my $dir = './home/jdoe'; ##'./admin/programs'
The . signifies current directory. Also if you face problems still try using the absolute path(from source) instead of relative path. Do let me know if this solves your problem.

This script works fine without any issue. One thing hidden from this script to us is the pattern. you can share the pattern and let us know what you are expecting from that pattern, so that we can validate that.
You could also run your program in debug mode i.e.,
perl -d your_program.
That should take you to debug mode and there are lot of options available to inspect through the flow. type 'n' on the debug prompt to step in to the code flow to understand how your code flows. Typing 'n' will print the code execution point and its result

Related

How can I get perl to correctly pass a command line argument with multiple arguments and complex file paths (spaces and symbols)?

I have a small perl script which collects file paths from an excel file and passes them through the command line to perltex which then compiles a pdf based on the files and paths chosen.
My problem is that the moment I introduce more complex file paths (which is necessary based on the network setup of the final user pool) perltex fails to find the file paths, cutting them at the space.
A MWE is a follows
#!/usr/bin/perl
use strict;
use warnings;
use 5.14.2;
use Text::Template;
use Spreadsheet::Read;
use Spreadsheet::ParseXLSX;
use utf8;
use charnames qw( :full :short );
use autodie;
my $row = 5;
my $col = 15;
my $File = "C:/Users/me/Desktop/Reporting-Static/Input-test1.xlsm";
my $parser = Spreadsheet::ParseXLSX->new();
my $workbook = $parser->parse($File);
my $worksheet = $workbook->worksheet("Input");
my $cell = $worksheet->get_cell($row, $col);
my $Filename = $cell->Value();
my $texfile = "C:/Users/me/Desktop/Reporting-Static/file.tex";
# can't find this file if there are spaces in the address
system("perltex", "--latex=pdflatex", "--nosafe", "--jobname=$Filename", "$texfile");
if ( $? == -1 )
{
print "command failed: $!\n";
}
else
{
printf "command exited with value %d", $? >> 8;
}
exit;
However, the moment I change the folder name to one with spaces eg. "Reporting Static" it fails to find the tex file.
I have read several other posts regarding this on stack exchange and other websites but for whatever reason the proposed solutions do not appear to work for me. I have tried
my $texfile = "C:/Users/me/Desktop/Reporting Static/file.tex";
my $texfile = C:/Users/me/Desktop/"Reporting Static"/file.tex;
my $texfile = "\"C:/Users/me/Desktop/"Reporting Static"/file.tex\"";
my $texfile = "\"C:/Users/me/Desktop/Reporting Static/file.tex\"";
my $texfile = "C:/Users/me/Desktop/Reporting^ Static/file.tex";
my $texfile = "C:/Users/me/Desktop/Reporting\^ Static/file.tex";
As well as a few other combinations or varioations of the above, all without success. I have also tried replacing the double quote with a single quote so that perl doesn't interpolate the contents.
I have also tried manually typing all of the above into the command prompt to check whether there was a small issue with the way perl passed the commands to the command line but still no luck.
I am aware that I can use the 'dir /X ~1 c:\' command to find system name allocations that avoid spaces but the idea is that the filename and location will be dynamic and change as a funtion of department and site, so I would prefer to avoid trying to write a script which will go and find this pathname and use it to replace all locations using spaces or other special characters.
The final idea that I had is that this problem could be connected ot the way that perltex passes it's arguments yet I am unable to find any documentation (that I can follow...) on the specifics of how this particular aspect of the file functions.
So my questions are, is there something I am missing not metioned in the other answers that I have read regarding how to correctly pass these paths to perltex, is there perhaps some sort of incompatiblity in how I'm trying to go about this, is this more probabl linked to perltex as opposed to perl or cmd or is there something completely different that I am unaware of that is stopping this from working???
EDIT:
from cmd prompt perltex returns a "unable to find path X, please enter another file location". Until now I hadn't really tested retyping the path by by entering 'C:/Users/me/Desktop/"Reporting Static"/file.tex' (no quotes at the beginning) it is subsequently accepted and runs. but initially passing it this path does not work, suggesting that some internal perltex code accepts the inital path differently to being repassed the same path after an error.... not quite sure what to make of this.
EDIT:
The contents of #latexcmdline that I extracted
$VAR1 = [
'pdflatex',
'--jobname=--',
'\\makeatletter\\def\\plmac#tag{AYNNNUVKQVJGZKKPGPTH}\\def\\plmac#tofile{Perl.topl}\\def\\plmac#fromfile{Perl.frpl}\\def\\plmac#toflag{Perl.tfpl}\\def\\plmac#fromflag{Perl.ffpl}\\def\\plmac#doneflag{Perl.dfpl}\\def\\plmac#pipe{Perl.pipe}\\makeatother\\input C:/Users/me/Desktop/PERLTEST/Perl',
'Modules/RevuedeProjetDB.tex'
];
This was done by inserting
use File::Slurp;
use Data::Dumper;
write_file 'C:\Users\me\Desktop\PERLTEST\mydump.log', Dumper( \#latexcmdline );
before the exec command.
Update
I initially recommended that you should use String::ShellQuote but that module is for Linux only so I deleted my answer when I realised that your question was about the Windows system
It seems that there's also a Win32::ShellQuote which does the same thing for Windows, so I am renewing my suggestion
As I said before, the issue is that perltex itself doesn't properly handle paths containing whitespace, even if they are correctly passed as a single element of #ARGV. I believe the solution is to pass the path including enclosing quotes, although I have never been able to test this properly as I have no LaTex installation
Unfortunately, even if I pass qq{"$texfile"}, the quotes are still stripped before they reach the target program, so they must be protected in some way
You need the quote_system function from that module, which will prepare a list of strings so that they retain any quotation marks
Using a parameter of quote_system(qq{"$texfile"}) produces the correct result in my tests. It is the equivalent of passing qq{"\\"$texfile\\""} but less ugly
So your system call should be like this (with no modification to perltex.pl)
I have applied the same principle to $Filename as it may well be that this also contains whitespace
use Win32::ShellQuote 'quote_system';
system(quote_system(
'perltex',
'--latex=pdflatex',
'--nosafe',
qq{--jobname="$Filename"},
qq{"$texfile"},
));
Okay, well I have a solution of sorts
The issue, as I suspected, is that, although the path is passed as a single string to perltex.pl, the latter doesn't handle paths with spaces properly after it has received them
The temporary fix is to hack perltex.pl
Line 82 of my version of perltex.pl (there is no version number in the source) reads
$latexcmdline[$firstcmd] = "\\input $option";
If you change that to
$latexcmdline[$firstcmd] = qq{\\input "$option"};
then all should be well. However this is a solid fix only when it is distributed by the author of perltex. Meanwhile I am looking for a nicer solution from the calling side
There are two steps to solving this problem.
Work out how to get the correct arguments into an external
program.
Work out how to do that from a Perl program.
For step 1, I find a program like this to be useful.
#!/usr/bin/perl
use strict;
use warnings;
print "Received ", scalar #ARGV, " arguments:\n";
for (1 .. #ARGV) {
print "$_: $ARGV[$_ - 1]\n";
}
It just explains what arguments it receives on the command line. You can use this in place of "perltex" for testing purposes.
You'll see that if you give it an argument that contains spaces, then that is interpreted as the called program as multiple arguments. The way to get round that is to quote the argument that contains a space. And I seem to remember that Windows insists on double-quotes (for reasons that I can never remember).
So I think that you want this:
system('perltex', '--latex=pdflatex', '--nosafe', "--jobname=\"$Filename\"", "\"$texfile\"");
I've double-quoted both of the filenames. Of course, those escaped double-quote characters look really ugly, and Perl gives us qq(...) to make that look nicer.
system('perltex', '--latex=pdflatex', '--nosafe', qq(--jobname="$Filename"), qq("$texfile"));
If that's not quite right, then the program I showed earlier will make it easier to find the solution.
Update: Borodin's comment below about this making no difference to $texfile is accurate. The fact that we're passing a list to system() means that the shell isn't involved at all.

Find a file and return after the first match in a perl script

I am using the find from perl. It works but I want to return (exit) from subrutine wanted after a first match is found, I would like to stop the find. I put the return but it doesn't work.
Here is my code:
find(\&wanted, $dir);
sub wanted {
print "Found it $File::Find::dir/$_\n" if /$file/i;
$found_file = "$File::Find::dir/$_";
return "$File::Find::dir/$_";
}
print $found_file;
$dir is the directory I am searching in and $file is the file I need.
Where should I put the returi in the sub wanted. I am new to perl, any help is appreciated. Thanks.
File::Find is a wacky module. It almost requires you code your entire program inside the wanted section. If you don't want to do this, you are usually required to use some sort of badly scoped variable which makes things look ugly. Plus, it uses a separate subroutine that makes the code harder to follow.
The problem is that File::Find isn't object oriented. If it was, you could fetch the files in a loop using a method, and when you get the file you want, you can leave the loop.
Fortunately, there's already a File::Find::Object which is an object oriented version of File::Find.
sub find_file {
my #directories = #_;
my $dir_tree = File::Find::Object->new( #directories );
while ( my $file = $dir_tree->next ) {
next unless ... # Code used to determine if file is what you want
return $file
}
return; #Nothing found
}
The interface is clean and easy to understand. It's how File::Find should work. I don't know why it's not a standard distribution module.
The main problem with File::Find is that it's old (From the Perl 3.x) days, and was never really intended to be a module. It was made to work with the find2perl program that still comes with Perl distributions. Want to see something ugly and scary? Run a find query through find2perl and see the results. This is what old Perl code use to look like.

Perl: Looping through filehandles

I'm a self-taught Perler, seeking assistance from the Perl experts:
I keep getting an error saying I can't use the filehandle within a foreach loop, even though I'm sure to close it (or undef it, I've tried both). See the full error here: http://cl.ly/image/2b2D1T403s14
The code is available on GitHub: https://github.com/bsima/yeast-TRX
The code in question can be found in the file "script.pl" at around line 90:
foreach my $species (keys %Saccharomyces) {
open(RAW,">./data/$species/$geneName/raw.csv");
print RAW "gene,dinucleotide,position,trx.score,energy.score\n";
undef RAW;
open(SMOOTH,">./data/$species/$geneName/smooth.csv");
print SMOOTH "gene,position,trx.score,energy.score\n";
undef SMOOTH;
}
Help is much appreciated! I don't know the intricacies of how Perl works with filehandles, probably because of my lack of formal training. Any comments on my overall code quality is welcome too, if someone is feeling particularly helpful.
EDIT: Found the problem. Perl cannot generate directories on the fly, so the $species/$geneName directory was never even being created. I added a line at the beginning of the foreach loop that said simply mkdir("$species/$geneName"); and that solve the issue.
You are getting warning that is quite telling:
Bareword RAW not allowed while "strict subs" in use
Also, undef FILEHANDLE is not as good as close FILEHANDLE.
Solution is to use normal scoped variables for file handles and close them, something like this:
foreach my $species (keys %Saccharomyces) {
open my $raw, ">", "./data/$species/$geneName/raw.csv";
print $raw "gene,dinucleotide,position,trx.score,energy.score\n";
close $raw;
open my $smooth, ">", "./data/$species/$geneName/smooth.csv";
print $smooth "gene,position,trx.score,energy.score\n";
close $smooth;
}
Also, you should check if $raw and $smooth were opened before trying to write to them.
Perl cannot generate directories on the fly, so the $species/$geneName directory was never even being created. I added a line at the beginning of the foreach loop that said simply mkdir("$species/$geneName"); and that solve the issue.

Renaming items in a File::Find folder traversal

I am supposed to traverse through a whole tree of folders and rename everything (including folders) to lower case. I looked around quite a bit and saw that the best way was to use File::Find. I tested this code:
#!/usr/bin/perl -w
use File::Find;
use strict;
print "Folder: ";
chomp(my $dir = <STDIN>);
find(\&lowerCase, $dir);
sub lowerCase{
print $_," = ",lc($_),"\n";
rename $_, lc($_);
}
and it seems to work fine. But can anyone tell me if I might run into trouble with this code? I remember posts on how I might run into trouble because of renaming folders before files or something like that.
If you are on Windows, as comments stated, then no, renaming files or folders in any order won't be a problem, because a path DIR1/file1 is the same as dir1/file1 to Windows.
It MAY be a problem on Unix though, in which case you are better off doing a recursive BFS by hand.
Also, when doing system calls like rename, ALWAYS check result:
rename($from, $to) || die "Error renaming $from to $to: $!";
As noted in comments, take care about renaming "ABC" to "abc". On Windows is not a problem.
Personally, I prefer to:
List files to be renamed using find dir/ > 2b_renamed
Review the list manually, using an editor of choice (vim 2b_renamed, in my case)
Use the rename from CPAN on that list: xargs rename 'y/A-Z/a-z/' < 2b_renamed
That manual review is very important to me, even when I can easily rollback changes (via git or even Time Machine).

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}