Perl: Looping through filehandles - perl

I'm a self-taught Perler, seeking assistance from the Perl experts:
I keep getting an error saying I can't use the filehandle within a foreach loop, even though I'm sure to close it (or undef it, I've tried both). See the full error here: http://cl.ly/image/2b2D1T403s14
The code is available on GitHub: https://github.com/bsima/yeast-TRX
The code in question can be found in the file "script.pl" at around line 90:
foreach my $species (keys %Saccharomyces) {
open(RAW,">./data/$species/$geneName/raw.csv");
print RAW "gene,dinucleotide,position,trx.score,energy.score\n";
undef RAW;
open(SMOOTH,">./data/$species/$geneName/smooth.csv");
print SMOOTH "gene,position,trx.score,energy.score\n";
undef SMOOTH;
}
Help is much appreciated! I don't know the intricacies of how Perl works with filehandles, probably because of my lack of formal training. Any comments on my overall code quality is welcome too, if someone is feeling particularly helpful.
EDIT: Found the problem. Perl cannot generate directories on the fly, so the $species/$geneName directory was never even being created. I added a line at the beginning of the foreach loop that said simply mkdir("$species/$geneName"); and that solve the issue.

You are getting warning that is quite telling:
Bareword RAW not allowed while "strict subs" in use
Also, undef FILEHANDLE is not as good as close FILEHANDLE.
Solution is to use normal scoped variables for file handles and close them, something like this:
foreach my $species (keys %Saccharomyces) {
open my $raw, ">", "./data/$species/$geneName/raw.csv";
print $raw "gene,dinucleotide,position,trx.score,energy.score\n";
close $raw;
open my $smooth, ">", "./data/$species/$geneName/smooth.csv";
print $smooth "gene,position,trx.score,energy.score\n";
close $smooth;
}
Also, you should check if $raw and $smooth were opened before trying to write to them.

Perl cannot generate directories on the fly, so the $species/$geneName directory was never even being created. I added a line at the beginning of the foreach loop that said simply mkdir("$species/$geneName"); and that solve the issue.

Related

About searching recursively in Perl

I have a Perl script that I, well, mostly pieced together from questions on this site. I've read the documentation on some parts to better understand it. Anyway, here it is:
#!/usr/bin/perl
use File::Find;
my $dir = '/home/jdoe';
my $string = "hard-coded pattern to match";
find(\&printFile, $dir);
sub printFile
{
my $element = $_;
if(-f $element && $element =~ /\.txt$/)
{
open my $in, "<", $element or die $!;
while(<$in>)
{
if (/\Q$string\E/)
{
print "$File::Find::name\n";
last; # stops looking after match is found
}
}
}
}
This is a simple script that, similar to grep, will look down recursively through directories for a matching string. It will then print the location of the file that contains the string. It works, but only if the file is located in my home directory. If I change the hard-coded search to look in a different directory (that I have permissions in), for example /admin/programs, the script no longer seems to do anything: No output is displayed, even when I know it should be matching at least one file (tested by making a file in admin/programs with the hard-coded pattern. Why am I experiencing this behavior?
Also, might as well disclaim that this isn't a really useful script (heck, this would be so easy with grep or awk!), but understanding how to do this in Perl is important to me right now. Thanks
EDIT: Found the problem. A simple oversight in that the files in the directory I was looking for did not have .txt as extension. Thanks for helping me find that.
I was able to get the desired output using the code you pasted by making few changes like:
use strict;
use warnings;
You should always use them as they notify of various errors in your code which you may not get hold of.
Next I changed the line :
my $dir = './home/jdoe'; ##'./admin/programs'
The . signifies current directory. Also if you face problems still try using the absolute path(from source) instead of relative path. Do let me know if this solves your problem.
This script works fine without any issue. One thing hidden from this script to us is the pattern. you can share the pattern and let us know what you are expecting from that pattern, so that we can validate that.
You could also run your program in debug mode i.e.,
perl -d your_program.
That should take you to debug mode and there are lot of options available to inspect through the flow. type 'n' on the debug prompt to step in to the code flow to understand how your code flows. Typing 'n' will print the code execution point and its result

Perl: renaming doesn't work for $value filename [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I want to fill the folder with copies of the same file that would be called differently. I created a filelist.txt to get filenames using Windows cmd and then the following code:
use strict; # safety net
use warnings; # safety net
use File::NCopy qw(copy);
open FILE, 'C:\blabla\filelist.txt';
my #filelist = <FILE>;
my $filelistnumber = #filelist + 1;
my $file = 0;
## my $filename = 'null.txt';
my $filename = $filelist[$file];
while( $file < $filelistnumber ){
copy('base.smp','temp.smp');
rename 'temp.smp', $filename;
$file = $file + 1;
};
If I try renaming it into 'test.smp' or whatever, it works. If I try the code above, I get this:
Use of uninitialized value $filename in print at blablabla/bla/bla.pl line 25, <FILE> line 90.
What am I doing wrong? I feel there's some kind of little mistake, a syntax mistake probably, that keeps evading me.
First, here's some improved code:
use strict;
use warnings;
use File::Copy;
while (<>) {
chomp;
copy('base.smp', $_) or die $!;
}
You'll save it as script.pl and invoke it like this:
$ perl script.pl C:\blabla\filelist.txt
In what ways is this code an improvement?
It uses the core module File::Copy instead of the deprecated File::NCopy.
It uses the null filehandle or "diamond operator" (<>) to implicitly iterate over a file given as a command line parameter, which is simple and elegant.
It handles errors in the event that copy() fails for some reason.
It doesn't use a while loop or a C-style for loop to iterate over an array, which are both prone to off-by-one errors and forgetting to re-assign the iterator, as you've discovered.
It doesn't use the old 2-argument syntax for open(). (Well, not explicitly, but that's kind of beyond the scope of this answer.)
What am I doing wrong? I feel there's some kind of little mistake, a
syntax mistake probably, that keeps evading me.
A syntax error would have resulted in an error message saying that there was a syntax error. But since you asked what you're doing wrong, let's walk through it:
use File::NCopy qw(copy);
This module was last updated in 2007 and is marked as deprecated. Don't use it.
open FILE, 'C:\blabla\filelist.txt';
You should use the three-argument form of open, use a lexical filehandle, and always check the return values of system calls.
my #filelist = <FILE>;
Rarely do you need to slurp an entire file into memory. In this case, you don't.
my $filelistnumber = #filelist + 1;
There's nothing inherently wrong with this line, but there is when you consider how you're using it later on. Remember that arrays are 0-indexed, so you've just set yourself up for an out of bounds array index. But we'll get to that in a second.
my $filename = $filelist[$file];
You would typically want to do this assignment inside your loop, lest you forget to update it after incrementing your counter (which is exactly what happened here).
while( $file < $filelistnumber ){
This is an odd way to iterate over an array in Perl. You could use a typical C-style for loop, but the most Perlish thing to do would be to use a foreach-style loop:
for my $element (#array) {
...
}
Each element of the list is localized to the loop, and you don't have to worry about counters, conditions, or array bounds.
copy('base.smp','temp.smp');
Again, always check the return values of system calls.
rename 'temp.smp', $filename;
No need to do a copy and a rename. You can copy to your final destination filename the first time. But if you are going to rename, always check the return values of system calls.
};
Blocks don't need to be terminated with a semicolon like simple statements do.
You should avoid using bareword file handles. When opening you should open using a file reference like and make sure you catch it if it fails:
open(my $fh, '<', 'C:\blabla\filelist.txt') or die "Cannot open filelist.txt: $!";
The $fh variable will contain your file reference.
For your problem it looks as though your filelist.txt must be empty. Try using Data::Dumper to print out your #filelist to determine it's contents.
use Data::Dumper;
EDIT:
Looks like you are also wanting to be setting the $filename variable to the next one in the list for each iteration, so put $filename = $filelist[$file]; at the beginning of your loop.
Your problem could be that you are looping too far? Try getting rid of the + 1 in my $filelistnumber = #filelist + 1;

How does the while works in case of Filehandle when reading a gigantic file in Perl

I have a very large file to read, so when I use while for reading it line by line, the script starts taking more time to read the line as I dig deep in the file; and to mention the rise is exponential.
while (<$fh>)
{do something}
Does while has to parse through all the lines it has already read to go to the next unread line or something like that?
How can I overcome such a situation?
EDIT 1:
My code:
$line=0;
%values;
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
$line++;
if ($_=~ s/foo//gi)
{
chomp $_;
$values{'id'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'type'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'pattern'} = $_;
}
if (keys(%values) == 3)
{
open FILE, ">>temp.txt" or die $!;
print FILE "$values{'id'}\t$values{'type'}\t$values{'pattern'}\n";
close FILE;
%values = ();
}
if($line == ($line1+1000000))
{
$line1=$line;
$read_time = time();
$processing_time = $read_time - $start_time - $processing_time;
print "xml file parsed till line $line, time taken $processing_time sec\n";
}
}
EDIT 2
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NLM//DTD NCBI-Entrezgene, 21st January 2005//EN" "http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>816394</Gene-track_geneid>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>7</Date-std_month>
<Date-std_day>30</Date-std_day>
<Date-std_hour>19</Date-std_hour>
<Date-std_minute>53</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2015</Date-std_year>
<Date-std_month>1</Date-std_month>
<Date-std_day>8</Date-std_day>
<Date-std_hour>15</Date-std_hour>
<Date-std_minute>41</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="chromosome">21</BioSource_genome>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Arabidopsis thaliana</Org-ref_taxname>
<Org-ref_common>thale cress</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
This is just a jest of the original xml file, if you like you can check the whole xml file from Here. Select any one entry and send it to file as xml file.
EDIT 3
As suggested by many pioneers that I should avoid using substitute but I feel it is essential to have it in my code as from a line in the xml file:
<Gene-track_geneid>816394</Gene-track_geneid>
I want to take only the Id which here is 816394 can be any number (any number of digits) for other entries; so how can I avoid using substitute.
Thanks in advance
ANSWER:
First, I would like to apologize to take so long to reply; as I started again from root to top for Perl and this time came clear with use strict, which helped me in maintaining the linear time. And also the use of XML Parsers is a good thing to do while handling large Xml files..
Thanks all for help and suggestions
Further to my comment above, you should get into the habit of using the strict and warnings pragma's at the start of every script. warnings just picks up mistakes that might not be found until runtime. strict enforce a number of good rules including declaring all variables with my. The variable then exists only in the scope (typically the code block) it was declared in.
Try something like this and see if you get any improvement.
use strict;
use warnings;
my %values;
my $line = 0;
open my $XML, '<', "file.xml" or die $!;
open my $TEMP, '>>', "temp.txt" or die $!;
while (<$XML>) {
chomp;
$line++;
if (s/foo//gi) { $values{id} = $_; }
elsif (s/foo//gi) { $values{type} = $_; }
elsif (s/foo//gi) { $values{pattern} = $_; }
if (keys(%values) == 3) {
print $TEMP "$values{id}\t$values{type}\t$values{pattern}\n";
undef %values;
}
# if ($line = ...
}
close $TEMP;
Ignore my one-line-if formatting, I did that for brevity. Format however you like
The main thing I've done which I hope helps is declare the %values hash inside the while block, so it doesn't have a "global" scope, and then it's undefine at the end of each block, which if I recall correctly should clear the memory it was using. Also opening and closing your output only once should cut down on a lot of unecessary operations.
Also just cleaned up a few things. Since you are acting on the topical $_ variable, you can leave it out of operations like chomp (which now occurs only once at the beginning of the loop) and you regex substution.
EDIT
It just occured to me that you might be waiting multiple loops until %values reaches 3 in which case it will not work so I moved the undef back inside the if.
MORE EDIT
As has been commented below, you should look into installing and using an XML parser from cpan. If you for whatever reason are unable to use a module, a capturing regex might work better than a replacements... eg: $var = /^<\/(\w+)>/ should capture <this>
There's no reason I see why that code would take exponentially more time. I don't see any memory leaks. %values will not grow. Looping over each line in a file does not depend on the file size only the line size. I even made an XML file with 4 million lines in it from your linked XML data to test it.
My thoughts are...
There's something you're not showing us (those regexes aren't real, $start_time is not initialized).
You're on a wacky filesystem, perhaps a network filesystem. (OP is on NTFS)
You're using a very old version of Perl with a bug. (OP is using Perl 5.20.1)
A poorly implemented network filesystem could slow down while reading an enormous file. It could also misbehave because of how you're opening and closing temp.txt rapidly. You could be chewing through file handles. temp.txt should be opened once before the loop. #Joshua's improvement suggestions are good (though the concern about %values is a red herring).
As also noted, you should not be parsing XML by hand. For a file this large, use a SAX parser which works on the XML a piece at a time keeping the memory costs down, as opposed to a DOM parser which reads the whole file. There are many to choose from.
while (<$fh>) {...} doesn't reread the file from the start on each iteration, no
The most likely cause of your problem is that you're keeping data in memory on each iteration, causing memory usage to grow as you work your way through the file. The slowdown comes in when physical memory is exhausted and the computer has to start paging out to virtual memory, ultimately producing a situation where you could be spending more time just moving memory pages back and forth between RAM and disk than on actual work.
If you can produce a brief, runnable test case which demonstrates your problem, I'm sure we can give more specific advice to fix it. If that's not possible, just a description of your {do something} process could give us enough to go on.
Edit after Edit 1 to question:
Looking at the code posted, I suspect that your slowdown may be caused by how you're handling your output. Closing and reopening the output file each time you add a line to it would definitely slow things down relative to if you just kept it open and, depending on your OS/filesystem combination, it may need to seek through the entire file to find the end to append.
Nothing else stands out to me as potentially causing performance issues, but a couple other minor points:
After your regex substitutions, $_ will never contain line ends (unless you explicitly include them in the foo patterns), so you can probably skip the chomp $_; lines.
You should open the output file the same way as you open the input file (lexical filehandle, three-argument open) instead of doing it the old way.

Need help with my first perl program

I am only a few days in and have only made a couple things from the book I have been going through, so go easy :P. I have tried searching online and have tried different ways but I can not seem to pass the MAC properly to system().
What I am trying to achieve is have perl open a .txt file of MAC's each MAC is on its own separate line. Then with it reading each line taking one MAC at a time and passing it system() as an arg so aircrack can be passed the MAC arg. I have it showing the MAC being read each line properly but I can not figure out why aircrack complains the MAC its being given is not a valid MAC. Is this due to me not chomping the line read?
What I have not tried as of yet due to this complication is I eventually want it to print a found key to a file if aircrack says it has found one, or if it does not find one moves on to the next BSSID, continuing until there are no more MACs in the file to try.
the MACs are in a txt file as so
00:00:00:00:00:00
00:00:00:00:00:00
00:00:00:00:00:00
and so on
#!/usr/bin/perl
use strict;
use warnings;
my $file = '/kismetLOGS/PcapDumpKismet/WEPMACS.txt';
open my $info, $file or die "Could not open $file: $!";
while( my $line = <$info>)
{
print $line;
system("/usr/local/bin/aircrack-ng", "-b $line", "*.pcap *.pcapdump");
last if $. == 0;
}
close $info;
exit;
Thanks for any help, tips and pointers. Not looking for a spoon feed :) And hopefully I posted properly for everyone and if I am way off in how I am trying this for my end goal please feel free to say and any tips about the correct route to try would be appreciated
You can either combine all your arguments together, like
system("/usr/local/bin/aircrack-ng -b $line *.pcap *.pcapdump");
or separate them all, like
system("/usr/local/bin/aircrack-ng", "-b","$line", "*.pcap","*.pcapdump");
The latter is usually safer, for spaces in the items not to need be escaped. But then globbing doesnt work, as the arguments are passed directly to the system for execution.
If you want *.pcap to work, you'll need to go with the first version.
$line ends with a newline character. You should remove the newline character.
chomp $line;
about last if $. == 0;,change it to last if $. ~~ 0 which infers the type of the variables when doing the comparison. Remove it if you want to iterate over all of the MAC addresses, as is it will only run on the first ( 0th ) line.

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}