Avoiding regex match variable being reused - perl

Basically, I'm looping through html files and looking for a couple of regexes. They match which is fine, but I don't expect every file to contain matches, but when the loop runs, every iteration contains the same match (despite it not being in that file). I assume that by using $1 it is persisting through each iteration.
I've tried using an arbitary regex straight after each real match to reset it, but that doesn't seem to work. The thread I got that idea from seemed to have a lot of argument etc on best practice and the original questions problem, so I thought it would be worth asking for specific advice to my code. It's likely not written in a great way either:
# array of diff filenames
opendir(TDIR, "$folder/diff/$today") || die "can't opendir $today: $!";
#diffList = grep !/^\.\.?$/, readdir(TDIR);
closedir TDIR;
# List of diff files
print "List of Diff files:\n" . join("\n", #diffList) . "\n\n";
for($counter = 0; $counter < scalar(#diffList); $counter++) {
# Open diff file, read in to string
$filename = $diffList[$counter];
open FILE, "<", "$folder/diff/$today/$filename";
while(<FILE>) {
$lines .= $_;
}
close FILE or warn "$0: close today/$filename: $!";
# Use regular expressions to extract the found differences
if($lines =~ m/$plus1(.*?)$span/s) {
$plus = $1;
"a" =~ m/a/;
} else {$plus = "0";}
if($lines =~ m/$minus1(.*?)$span/s) {
$minus = $1;
"a" =~ m/.*/;
} else {$minus = "0";}
# If changes were found, send them to the database
if($plus ne "0" && $minus ne "0") {
# Do stuff
}
$plus = "0";
$minus = "0";
}
If I put a print inside the "do stuff" if, it's always true and always shows the same two values that are found in one of the files.
Hopefully I've explained my situation well enough. Any advice is appreciated, thanks.

It may be that your code appends lines from newly-read files onto $lines. Have you tried explicitly clearing it after each iteration?

It's already been answered, but you could also consider a different syntax for reading the file. It can be noticeably quicker and helps you avoid little bugs like this.
Just add this to read the file between the open/close:
local $/ = undef;
$lines = <FILE>;
That'll temporarily unset the line separator so it reads the whole file at once. Just enclose it in a { } block if you need to read another file in the same scope.

Related

How can I know if diamond operator moved to the next file?

I have the following code in a file perl_script.pl:
while (my $line = <>) {
chomp $line;
// etc.
}.
I call the script with more than 1 file e.g.
perl perl_script.pl file1.txt file2.txt
Is there a way to know if the $line is started to read from file2.txt etc?
The $ARGV variable
Contains the name of the current file when reading from <>
and you can save the name and test on every line to see if it changed, updating when it does.
If it is really just about getting to a specific file, as the question seems to say, then it's easier since you can also use #ARGV, which contains command-line arguments, to test directly for the needed name.
One other option is to use eof (the form without parenthesis!) to test for end of file so you'll know that the next file is coming in the next iteration -- so you'll need a flag of some sort as well.
A variation on this is to explicitly close the filehandle at the end of each file so that $. gets reset for each new file, what normally doesn't happen for <>, and then $. == 1 is the first line of a newly opened file
while (<>) {
if ($. == 1) { say "new file: $ARGV" }
}
continue {
close ARGV if eof;
}
A useful trick which is documented in perldoc -f eof is the } continue { close ARGV if eof } idiom on a while (<>) loop. This causes $. (input line number) to be reset between files of the ARGV iteration, meaning that it will always be 1 on the first line of a given file.
There's the eof trick, but good luck explaining that to people. I usually find that I want to do something with the old filename too.
Depending on what you want to do, you can track the filename you're
working on so you can recognize when you change to a new file. That way
you know both names at the same time:
use v5.10;
my %line_count;
my $current_file = $ARGV[0];
while( <> ) {
if( $ARGV ne $current_file ) {
say "Change of file from $current_file to $ARGV";
$current_file = $ARGV;
}
$line_count{$ARGV}++
}
use Data::Dumper;
say Dumper( \%line_count );
Now you see when the file changes, and you can use $ARGV
Change of file from cache-filler.pl to common.pl
Change of file from common.pl to wc.pl
Change of file from wc.pl to wordpress_posts.pl
$VAR1 = {
'cache-filler.pl' => 102,
'common.pl' => 13,
'wordpress_posts.pl' => 214,
'wc.pl' => 15
};
Depending what I'm doing, I might not let the diamond operator do all
the work. This give me a lot more control over what's happening and
how I can respond to things:
foreach my $arg ( #ARGV ) {
next unless open my $fh, '<', $arg;
while( <$fh> ) {
...
}
}

Run a script in multiple directories with multiple output files in Perl (problems comparing hash key values)

I have the script which looks something like this, which I want to use to search through the current directory I am in, open, all directories in that directory, open all files that match certain REs (fastq files that have a format such that every four lines go together), do some work with these files, and write some results to a file in each directory. (Note: the actual script does a lot more than this but I think I have a structural issue associated with the iteration over folders because the script works when a simplified version is used in one folder, and so I am posting a simplified version here)
#!user/local/perl
#Created by C. Pells, M. R. Snyder, and N. T. Marshall 2017
#Script trims and merges high throughput sequencing reads from fastq files for a specific primer set
use Cwd;
use warnings;
my $StartTime= localtime;
my $MasterDir = getcwd; #obtains a full path to the current directory
opendir (DIR, $MasterDir);
my #objects = readdir (DIR);
closedir (DIR);
foreach (#objects){
print $_,"\n";
}
my #Dirs = ();
foreach my $O (0..$#objects){
my $CurrDir = "";
if ((length ($objects[$O]) < 7) && ($O>1)){ #Checking if the length of the object name is < 7 characters. All samples are 6 or less. removing the first two elements: "." and ".."
$CurrDir = $MasterDir."/".$objects[$O]; #appends directory name to full path
push (#Dirs, $CurrDir);
}
}
foreach (#Dirs){
print $_,"\n";#checks that all directories were read in
}
foreach my $S (0..$#Dirs){
my #files = ();
opendir (DIR, $Dirs[$S]) || die "cannot open $Dirs[$S]: $!";
#files = readdir DIR; #reads in all files in a directory
closedir DIR;
my #AbsFiles = ();
foreach my $F (0..$#files){
my $AbsFileName = $Dirs[$S]."/".$files[$F]; #appends file name to full path
push (#AbsFiles, $AbsFileName);
}
foreach my $AF (0..$#AbsFiles){
if ($AbsFiles[$AF] =~ /_R2_001\.fastq$/m){ #finds reverse fastq file
my #readbuffer=();
#read in reverse fastq
my %RSeqHash;
my $c = 0;
print "Reading, reversing, complimenting, and trimming reverse fastq file $AbsFiles[$AF]\n";
open (INPUT1, $AbsFiles[$AF]) || die "Can't open file: $!\n";
while (<INPUT1>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
$rsn = substr($readbuffer[0], 0, 45); #trims reverse seq name
$cc++ % 10000 == 0 and print "$rsn\n";
$RSeqHash{$rsn} = $readbuffer[1];
#readbuffer = ();
}
}
}
}
foreach my $AFx (0..$#AbsFiles){
if ($AbsFiles[$AFx] =~ /_R1_001\.fastq$/m){ #finds forward fastq file
print "Reading forward fastq file $AbsFiles[$AFx]\n";
open (INPUT2, $AbsFiles[$AFx]) || die "Can't open file: $!\n";
my $OutMergeName = $Dirs[$S]."/"."Merged.fasta";
open (OUT, ">", "$OutMergeName");
my $cc=0;
my #readbuffer = ();
while (<INPUT2>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
my $fsn = substr($readbuffer[0], 0, 45); #trims forward seq name
#$cc++ % 10000 == 0 and print "$fsn\n$readbuffer[1]\n";
if ( exists($RSeqHash{$fsn}) ){ #checks to see if forward seq name is present in reverse seq hash
print "$fsn was found in Reverse Seq Hash\n";
print OUT "$fsn\n$readbuffer[1]\n";
}
else {
$cc++ % 10000 == 0 and print "$fsn not found in Reverse Seq Hash\n";
}
#readbuffer = ();
}
}
close INPUT1;
close INPUT2;
close OUT;
}
}
}
my $EndTime= localtime;
print "Script began at\t$StartTime\nCompleted at\t$EndTime\n";
Again, I know that the script works without iterating over folders. But with this version I just get empty output files. Due to the print functions I inserted in this script, I've determined that Perl cant find the variable $fsn as a key in the hash from INPUT2. I cant understand why because each file is there and it works when I don't iterate over folders so I know that the keys match. So either there is something simple I am missing or this is some sort of limitation to Perl's memory that I have found. Any help is appreciated!
Turns out my issue was with where I was declaring the hash. For some reason even though I only declare it after it finds the first input file. The script fails unless I declare the hash before the foreach loop that cycles through all items in #AbsFiles searching for the first input file, which is fine because it means that the hash is cleared in every new directory. But I don't understand why it failed like it was because it should only be declaring (or clearing) the hash when it finds the input file name. I guess I don't NEED to know why it didn't work before, but some help to understand would be nice.
I have to give credit to another user for helping me realize this. They attempted to answer my question but did not, and then gave me this hint about where I declare my hash in a comment on that answer. This answer has now been deleted so I can't credit that user for pointing me in this direction. I would love to know what they understand about Perl that I do not that made it clear to them that this was the problem. I apologize that I was busy with data analysis and a conference so I could not respond to that comment sooner.

How to get rid of the syntax error in this code?

Below I'v proided just a chunk of a huge perl script I am trying to write. I am getting syntax errors in else statement but in the console window its only saying syntax error at perl script and not clearly telling the error. I am trying to create a variable file file_no_$i.txt and copy contents of t_code.txt in it and then find and replace string in the variable file with some selected keys of hash %defines_2
open ( my $pointer, "<", "t_code.txt" ) or die $!;
my $out_pointer;
for (my $i=0 ; $i <=$#match ; $i++) {
for (my $j=0; $j <= $#match ; $j++) {
if ($match[$i]=~$match[$j]) {
next;
}
else {
my $file_name = "file_no_$i.txt";
open $out_pointer, ">" , $file_name or die "Can't open the output file!";
copy("$file_name","t_code.txt") or die "Copy failed: $!";
my #lin = <$out_pointer>;
foreach $_(#lin) {
$_ =~ s/UART90_BASE_ADDRESS/$defines_2{ $_ = grep{/$match[$i]/} (keys %defines_2)};
}
}
}
}
You cannot use / unquoted inside a s/// construct. Instead of backslashes, you can use different delimiters:
s#UART90_BASE_ADDRESS#$defines_2{ $_ = grep{/$match[$i]/} (keys %defines_2)}#;
It fixes the syntax error, but I fear it still won't do what you want. Without data, it's hard to test, though.
What I think you're doing is editing a number of text files whose names look like file_no_1.txt etc. You're doing that by copying the current file to t_code.txt and then reading that file line by line, editing as required, as writing the lines back to the original text file.
The problem with that approach is that the file will be copied and rewritten many times, and it would be better to read the whole file into an array, make all the edits, and then write them back in one operation. That would be fine unless the file is enormous — say, several GB.
Here's some code that implements that approach. You see that $file_name is defined and #lines is filled outside the inner loop. The innermost loop modifies the elements of #lines and, outside that loop again, #lines is written back to the original text file.
I couldn't fathom a couple of things about your code.
I'm not sure if you should be using =~ or if you intended a simple eq. The former does a contains test, and you had a problem in the past where you meant to check that the first string had the second at the end
The grep call
grep{/$match[$i]/} (keys %defines_2)
worries me, as it can potentially return more than one key of the %defines_2 hash, in which case your own code will insert what is pretty much a random selection from the hash elements
If your code is working then that's fine, but if not then I hope this helps you fix it. If you need more help on this chunk of code then you should include a small sample of the data so that we can better understand what is going on.
for my $i (0 .. $#match) {
my $file_name = "file_no_$i.txt";
my #lines = do {
open my $in_fh, '<', 't_code.txt' or die $!;
<$in_fh>;
};
for my $j (0 .. $#match) {
next if $match[$i] =~ $match[$j];
for ( #lines ) {
my ($match) = grep { /$match[$i]/ } keys %defines_2;
s/UART90_BASE_ADDRESS/$defines_2{$match}/;
}
}
open my $out_fh, '>', $file_name or die qq{Can't open "$file_name" for output: $!};
print $out_fh $_ for #lines;
close $out_fh or die qq{Failed to close output file "$file_name": $!};
}

nested while loop comparing two files, outside loop stops after one iteration, need to compare every line of files

Firstly I apologise if my formatting here is incorrect, I am very new to writing scripts (3 days) and this is my first post on this site.
I have two files which are tab separated, File a contains 14 columns, and File b contains 8 columns.
One column in File b has a numeric value which correlates to a range of numbers generated by two numeric fields from File a.
For every line in File a, I need to, search through the File b and print a combination of data from fields on both files. There will be multiple matches for each line of File a due to a numeric range being accepted.
The code that I have created does exactly what I want it to do but only for the first line of File a, and doesn't continue the loop. I have looked all over the internet and I believe it may be something to do with the fact that both files read from standard input. I have tried to correct this problem but I can't seem to get anything to work
My current understanding is that by changing one file to read from a different file descriptor my loop may work... with something such as >$3 but I don't really understand this very well despite my research. Or possibly using the grep function which I am also struggling with.
Here is the outline of the code I am using now:
use strict;
use warnings;
print "which file read from?\n";
my $filea = <STDIN>;
chomp $filea;
{
unless (open ( FILEA, $filea) {
print "cannot open, do you want to try again? y/n?\n?";
my $attempt = <STDIN>;
chomp $again;
if ($again =~ 'n') {
exit;
} else {
print "\n";
$filea = <STDIN>;
chomp $filea;
redo;
}
}
}
#I also open fileb the same way, but wont write it all out to save space and your time.
my output = 'output.txt';
open (OUTPUT, ">>$output");
while (my $loop1 = <FILEA>) {
chomp $loop1;
( my $var1, my $var2, my $var3, my $var4, my $var5, my $var6,
my $var7, my $var8, my $var9, my $var10, my $var11, my $var12,
my $var13, my $var14 ) = split ( "\t", $loop1);
#create the range of number which needs to be matched from file b.
my $length = length ($var4);
my $range = ($var2 + $length);
#perform the search loop through fileb
while (my $loop2 = <FILEB>) {
chomp $loop2;
( my $vala, my $valb, my $valc, my $vald, my $vale, my $valf,
my $valg) = split ( "\t", $loop2 );
#there are then several functions and additions of the data, which all work basicly so I'll just use a quick example.
if ($vald >= $val3 $$ $vald <= $range) {
print OUTPUT "$val1, $vald, $val11, $valf, $vala, $val5 \n";
}
}
}
I hope this all makes sense, I tried to make everything as clear as possible, if anyone could help me edit the code so that the loop continues through all of filea that would be great.
If possible please explain what you've done. Ideally I'd like it if its possible to obtain this result without changing the code too much.
Thanks guys!!!
Avoid naked handles when possible; use $fh (filehandle) instead of FH
You can use until instead of unless, and skip the redo:
print "Enter the file name\n";
my $file_a = <STDIN>;
chomp $file_a;
my $fh_a;
until(open $fh_a, '<', $file_a) {
print "Re-enter the file name or 'n' to cancel\n";
$file_a = <STDIN>;
chomp $file_a;
if($file_a eq 'n') {
exit;
}
}
You can (should) use an array instead of all those individual column variables: my #cols_a = split /\t/, $line;
You should read file B into an array, once, and then search that array each time you need to: my #file_b = <$fh_b>;
The result will look something like this:
#Assume we have opened both files already . . .
my #file_b = <$fh_b>;
chomp #file_b;
while(my $line = <$fh_a>) {
chomp $line;
my #cols_a = split /\t/, $line;
#Remember, most arrays (perl included) are zero-indexed,
#so $cols_a[1] is actually the SECOND column.
my $range = ($cols_a[1] + length $cols_a[3]);
foreach my $line_b (#file_b) {
#This loop will run once for every single line of file A.
#Not efficient, but it will work.
#There are, of course, lots of optimisations you can make
#(starting with, for example, storing file B as an array of array
#references so you don't have to split each line every time)
my #cols_b = split /\t/, $line_b;
if($cols_b[3] > $cols_a[2] && $cols_b[3] < ($cols_a[2] + $range)) {
#Do whatever here
}
}
}

Perl editting a file

I'm trying to open a file, search for a specific string in the file to begin my search from and then performing a replacement on a string later on in the file. For example, my file looks like:
Test Old
Hello World
Old
Data
Begin_search_here
New Data
Old Data
New Data
I want to open the file, begin my search from "Begin_search_here" and then replace the next instance of the word "Old" with "New". My code is shown below and I'm correctly finding the string, but for some reason I'm not writing in the correct location.
open(FILE, "+<$filename") || die "problem opening file";
my search = 0;
while(my $line = <FILE>)
{
if($line =~ m/Begin_search_here/)
{
$search = 1;
}
if($search == 1 && $line =~m/Old/)
{
$line = s/Old/New/;
print FILE $line
}
close FILE;
Here ya go:
local $^I = '.bak';
local #ARGV = ($filename);
local $_;
my $replaced = 0;
while (<>) {
if (!$replaced && /Begin_search_here/ .. $replaced) {
$replaced = s/Old/New/;
}
print;
}
Explanation:
Setting the $^I variable enables inplace editing, just as if you had run perl with the -i flag. The original file will be saved with the same name as the original file, but with the extension ".bak"; replace ".bak" with "" if you don't want a backup made.
#ARGV is set to the list of files to do inplace editing on; here just your single file named in the variable $filename.
$_ is localized to prevent overwriting this commonly-used variable in the event this code snippet occurs in a subroutine.
The flip-flop operator .. is used to figure out what part of the file to perform substitutions in. It will be false until the first time a line matching the pattern Begin_search_here is encountered, and then will remain true until the first time a substitution occurs (as recorded in the variable $replaced), when it will turn off.
You would probably be best served by opening the input file in read mode (open( my $fh, '<', $file ) or die ...;), and writing the modified text to a temporary output file, then copying the temporary file overtop of the input file when you're done doing your processing.
You are misusing the random-access file mode. By the time you update $line and say print FILE $line, the "cursor" of your filehandle is already positioned at the beginning of the next line. So the original line is not changed and the next line is over-written, instead of overwriting the original line.
Inplace editing (see perlrun) looks like it would be well suited for this problem.
Otherwise, you need to read up on the tell function to save your file position before you read a line and seek back to that position before you rewrite the line. Oh, and the data that you write must be exactly the same size as the data you are overwriting, or you will totally fubar your file -- see this question.
I have done a number of edits like this, that I came up with a generic (yet stripped-down) strategy:
use strict;
use warnings;
use English qw<$INPLACE_EDIT>;
use Params::Util qw<_CODE>;
local $INPLACE_EDIT = '.bak';
local #ARGV = '/path/to/file';
my #line_actions
= ( qr/^Begin_search_here/
, qr/^Old Data/ => sub { s/^Old/New/ }
);
my $match = shift #line_actions;
while ( <> ) {
if ( $match and /$match/ ) {
if ( _CODE( $line_actions[0] )) {
shift( #line_actions )->( $_ );
}
$match = shift #line_actions;
}
print;
}
This works. It will, as you specified, only replaces one occurrence.
#! /usr/bin/perl -pi.bak
if (not $match_state) {
if (/Begin_search_here/) {
$match_state = "accepting";
}
}
elsif ($match_state eq "accepting") {
if (s/Old/New/) {
$match_state = "done";
}
}
Be very careful about editing a file in place. If the data you're replacing is a different length, you wreck the file. Also, if your program fails in the middle, you end up with a destroyed file.
Your best bet is to read in each line, process the line, and write each line to a new file. This will even allow you to run your program, examine the output, and if you have an error, fix it and rerun the program. Then, once everything is okay, add in the step to move your new file to the old name.
I've been using Perl since version 3.x, and I can't think of a single time I modified a file in place.
use strict;
use warnings;
open (INPUT, "$oldfile") or die qq(Can't open file "$oldFile" for reading);
open (OUTPUT, "$oldfile.$$") or die qq(Can't open file "$oldfile.$$" for writing);
my $startFlag = 0;
while (my $line = <INPUT>) {
if ($line ~= /Begin_search_here/) {
$startFlag = 1;
}
if ($startFlag) {
$line =~ s/New/Old/;
}
print OUTPUT "$line";
}
#
# Only implement these two steps once you've tested your program
#
unlink $oldfile;
rename $oldfile.$$", $oldfile;