Ignore lines in a file till match and process lines after that - perl

I am looping over lines in a file and when matched a particular line, i want to process the lines after the current (matched) line. I can do it :-
open my $fh, '<', "abc" or die "Cannot open!!";
while (my $line = <$fh>){
next if($line !~ m/Important Lines below this Line/);
last;
}
while (my $line = <$fh>){
print $line;
}
Is there a better way to do this (code needs to be a part of a bigger perl script) ?

I'd use flip-flop operator:
while(<DATA>) {
next if 1 .. /Important/;
print $_;
}
__DATA__
skip
skip
Important Lines below this Line
keep
keep
output:
keep
keep

Related

Populate an array by splitting a string

I am trying to convert a string into an array based on space delimiter.
My input file looks like this:
>Reference
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnctcACCATGGTGTCGACTC
TTCTATGGAAACAGCGTGGATGGCGTCTCCAGGCGATCTGACGGTTCACTAAACGAGCTC
Ignoring the line starting with >, the length of rest of the string is 360.
I am trying to convert this into an array.
Here's my code so far:
#!/usr/bin/perl
use strict;
use warnings;
#### To to change bases with less than 10X coverage to N #####
#### Take depth file and consensus fasta file as input arguments ####
my ($in2) = #ARGV;
my $args = $#ARGV + 1;
if ( $args != 1 ) {
print "Error!!! Insufficient Number of Argumrnts\n";
print "Usage: $0 <consensus fasta file> \n";
}
#### Open a filehandle to read in consensus fasta file ####
my $FH2;
my $line;
my #consensus;
my $char;
open($FH2, '<', $in2) || die "Could not open file $in2\n";
while ( <$FH2> ) {
$line = $_;
chomp $line;
next if $line =~ />/; # skip header line
$line =~ s/\s+//g;
my $len = length($line);
print "$len\n";
#print "$line";
#consensus = split(// , $line);
print "$#consensus\n";
#print "#consensus\n";
#for $char (0 .. $#consensus){
# print "$char: $consensus[$char]\n";
# }
}
The problem is the $len variable returns a value of 60 instead of 360 and $#consensus returns a value of 59 instead of 360 which is the length of the string.
I have removed the whitespace after each line with code $line =~ s/\s+//g;but it still is not working.
It looks like your code is essentially working. It's just your checking logic that makes no sense. I'd do the following:
use strict;
use warnings;
if (#ARGV != 1) {
print STDERR "Usage: $0 <consensus fasta file>\n";
exit 1;
}
open my $fh, '<', $ARGV[0] or die "$0: cannot open $ARGV[0]: $!\n";
my #consensus;
while (my $line = readline $fh) {
next if $line =~ /^>/;
$line =~ s/\s+//g;
push #consensus, split //, $line;
}
print "N = ", scalar #consensus, "\n";
Main things to note:
Error messages should go to STDERR, not STDOUT.
If an error occurs, the program should exit with an error code, not keep running.
Error messages should include the name of the program and the reason for the error.
chomp is redundant if you're going to remove all whitespace anyway.
As you're processing the input line by line, you can just keep pushing elements to the end of #consensus. At the end of the loop it'll have accumulated all characters across all lines.
Examining #consensus within the loop makes little sense as it hasn't finished building yet. Only after the loop do we have all characters we're interested in.

Perl script - Confusing error

When I run this code, I am purely trying to get all the lines containing the word "that" in them. Sounds easy enough. But when I run it, I get a list of matches that contain the word "that" but only at the end of the line. I don't know why it's coming out like this and I have been going crazy trying to solve it. I am currently getting an output of 268 total matches, and the output I need is only 13. Please advise!
#!/usr/bin/perl -w
#Usage: conc.shift.pl textfile word
open (FH, "$ARGV[0]") || die "cannot open";
#array = (1,2,3,4,5);
$count = 0;
while($line = <FH>) {
chomp $line;
shift #array;
push(#array, $line);
$count++;
if ($line =~ /that/)
{
$output = join(" ",#array);
print "$output \n";
}
}
print "Total matches: $count\n";
Don't you want to increment your $count variable only if the line contains "that", i.e.:
if ($line =~ /that/) {
$count++;
instead of incrementing the counter before checking if $line contains "that", as you have it:
$count++;
if ($line =~ /that/) {
Similarly, I suspect that your push() and join() calls, for stashing a matching line in #array, should also be within the if block, only executed if the line contains "that".
Hope this helps!

Want to add random string to identifier line in fasta file

I want to add random string to existing identifier line in fasta file.
So I get:
MMETSP0259|AmphidiniumcarteCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Then the sequence on the next lines as normal. I am have problem with i think in the format output. This is what I get:
MMETSP0259|AmphidiniumCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
CTTCATCGCACATGGATAACTGTGTACCTGACTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
TCTGGGAAAGGTTGCTATCATGAGTCATAGAATaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
It's added to every line. (I altered length to fit here.) I want just to add to the identifier line.
This is what i have so far:
use strict;
use warnings;
my $currentId = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
my $header_line;
my $seq;
my $uniqueID;
open (my $fh,"$ARGV[0]") or die "Failed to open file: $!\n";
open (my $out_fh, ">$ARGV[0]_longer_ID_MMETSP.fasta");
while( <$fh> ){
if ($_ =~ m/^(\S+)\s+(.*)/) {
$header_line = $1;
$seq = $2;
$uniqueID = $currentId++;
print $out_fh "$header_line$uniqueID\n$seq";
} # if
} # while
close $fh;
close $out_fh;
Thanks very much, any ideas will be greatly appreciated.
Your program isn't working because the regex ^(\S+)\s+(.*) matches every line in the input file. For instance, \S+ matches CTTCATCGCACATGGATAACTGTGTACCTGACT; the newline at the end of the line matches \s+; and nothing matches .*.
Here's how I would encode your solution. It simply appends $current_id to the end of any line that contains a pipe | character
use strict;
use warnings;
use 5.010;
use autodie;
my ($filename) = #ARGV;
my $current_id = 'a' x 57;
open my $in_fh, '<', $filename;
open my $out_fh, '>', "${filename}_longer_ID_MMETSP.fasta";
while ( my $line = <$in_fh> ) {
chomp $line;
$line .= $current_id if $line =~ tr/|//;
print $line, "\n";
}
close $out_fh;
output
MMETSP0259|AmphidiniumCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
CTTCATCGCACATGGATAACTGTGTACCTGACT
TCTGGGAAAGGTTGCTATCATGAGTCATAGAAT

Perl: printing original file with changes

I wrote this code and it works fine, it should find lines in which there's no string like 'SID' and append a pipe | at the beginning of the line, so like this: find all lines in which there's no 'SID' and append a pipe | at the beginning of the line. But how I wrote it, I can just output the lines which were changed and have a pipe. What I actually want: leave the file as it is and just append the pipes to the lines which match. Thank you.
#!usr/bin/perl
use strict;
use warnings;
use autodie;
my $fh;
open $fh, '<', 'file1.csv';
my $out = 'file2.csv';
open(FILE, '>', $out);
my $myline = "";
while (my $line = <$fh>) {
chomp $line;
unless ($line =~ m/^SID/) {
$line =~ m/^(.*)$/;
$myline = "\|$1";
}
print FILE $myline . "\n";
}
close $fh;
close FILE;
my file example:
SID,bla
foo bar <- my code adds the pipe to the beginning of this line
output should be like this:
SID,bla
| foo bar
but in my case I only print $myline, I know:
| foo bar
The line
$line =~ m/^(.*)$/
is misguided: all it does is put the contents of $line into $1, so the following statement
$myline = "\|$1"
may as well be
$myline = "|$line"
(The pipe | doesn't need escaping unless it is part of a regular expression.)
Since you are printing $myline at the end of your loop you are never seeing the contents of unmodified lines.
You can fix that by printing $line or $myline according to which one contains the required output, like this
while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^SID/) {
print "$line\n";
}
else {
my $myline = "|$line";
print "$myline\n";
}
}
or, much more simply, by dropping the intermediate variable and using the default $_ for the input lines, like this
while (<$fh>) {
print '|' unless /^SID/;
print;
}
Note that I have also removed the chomp as it just means you have to put the newline back on the end of the string when you print it.
Instead of creating a new variable $myline, use the one you already have:
while (my $line =<$fh>) {
$line = '|' . $line if $line !~ /^SID/;
print FILE $line;
}
Also, you can use lexical filehandle for the output file as well. Moreover, you should check the return value of open:
open my $OUT, '>', $out or die $!;

perl script miscounting because of empty lines

the below script is basically catching the second column and counting the values. The only minor issue I have is that the file has empty lines at the end (it's how the values are being exported) and because of these empty lines the script is miscounting. Any ideas please? Thanks.
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while( my $line = <$file>) {
$line =~ m/\s+(\d+)/; #regexpr to catch second column values
$sum_column_b += $1;
}
print $sum_column_b, "\n";
I think the main issue has been established, you are using $1 when it is not conditionally tied to the regex match, which causes you to add values when you should not. This is an alternative solution:
$sum_column_b += $1 if $line =~ m/\s+(\d+)/;
Typically, you should never use $1 unless you check that the regex you expect it to come from succeeded. Use either something like this:
if ($line =~ /(\d+)/) {
$sum += $1;
}
Or use direct assignment to a variable:
my ($num) = $line =~ /(\d+)/;
$sum += $num;
Note that you need to use list context by adding parentheses around the variable, or the regex will simply return 1 for success. Also note that, like Borodin says, this will give an undefined value when the match fails, and you must add code to check for that.
This can be handy when capturing several values:
my #nums = $line =~ /(\d+)/g;
The main problem is that if the regex does not match, then $1 will hold the value it received in the previous successful match. So every empty line will cause the previous line to be counted again.
An improvement would be:
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while( my $line = <$file>) {
next if $line =~ /^\s*$/; # skip "empty" lines
# ... maybe skip other known invalid lines
if ($line =~ m/\s+(\d+)/) { #regexpr to catch second column values
$sum_column_b += $1;
} else {
warn "problematic line '$line'\n"; # report invalid lines
}
}
print $sum_column_b, "\n";
The else-block is of course optional but can help noticing invalid data.
Try putting this line just after the while line:
next if ( $line =~ /^$/ );
Basically, loop around to the next line if the current line has no content.
#!/usr/bin/perl
use warnings;
use strict;
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while (my $line = <$file>) {
next if (m/^\s*$/); # next line if this is unsignificant
if ($line =~ m/\s+(\d+)/) {
$sum_column_b += $1;
}
}
print "$sum_column_b\n";