Best way to keep track of previous and following line in perl - perl

What is the best/right way, in perl, of keeping the information from the previous and/or following line. For example, with this code:
while (<IN>) {
print;
}
how can it be changed to not print the line only if the previous or the next line in the file match foo, but printing otherwise?
Could you give code examples. Thanks.

Updated: Simplified exposition.
Basically, you need to keep track of two extra lines if you want to print the current lines based on information contained in two other lines. Here is a simple script with everything hard-coded:
#!/usr/bin/env perl
use strict;
use warnings;
my $prev = undef;
my $candidate = scalar <DATA>;
while (defined $candidate) {
my $next = <DATA>;
unless (
(defined($prev) && ($prev =~ /foo/)) ||
(defined($next) && ($next =~ /foo/))
) {
print $candidate;
}
($prev, $candidate) = ($candidate, $next);
}
__DATA__
1
2
foo
3
4
5
foo
6
foo
7
8
9
foo
We can generalize this to a function that takes a filehandle and a test (as a subroutine reference):
#!/usr/bin/env perl
use strict; use warnings;
print_mid_if(\*DATA, sub{ return !(
(defined($_[0]) && ($_[0] =~ /foo/)) ||
(defined($_[1]) && ($_[1] =~ /foo/))
)} );
sub print_mid_if {
my $fh = shift;
my $test = shift;
my $prev = undef;
my $candidate = scalar <$fh>;
while (defined $candidate) {
my $next = <$fh>;
print $candidate if $test->($prev, $next);
($prev, $candidate) = ($candidate, $next);
}
}
__DATA__
1
2
foo
3
4
5
foo
6
foo
7
8
9
foo

You could read your line into an array, and then if you get something that signals you in some way, pop out the last few elements of the array. Once you've finished reading everything in, you could print it:
use strict;
use warnings;
use feature qw(say);
use autodie; #Won't catch attempt to read from an empty file
use constant FILE_NAME => "some_name.txt"
or die qq(Cannot open ) . FILE_NAME . qq(for reading: $!\n);
open my $fh, "<", FILE_NAME;
my #output;
LINE:
while ( my $line = <DATA> ) {
chomp $line;
if ( $line eq "foo" ) {
pop #output; #The line before foo
<DATA>; #The line after foo
next LINE; #Skip line foo. Don't push it into the array
}
push #output, $line;
}
From there, you can print out the array with the values you don't want printed already taken care of.
for my $line ( #output ) {
say $line;
}
The only problem is that this takes memory. If your file is extremely large, you could run out of memory.
One way to get around this is to use a buffer. You store your values in an array, and shift out the last value when you push another in the array. If the value read in is foo, you can reset the array. In this case, the buffer will contain at most one line:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
my #buffer;
LINE:
while ( my $line = <DATA> ) {
chomp $line;
if ( $line eq "foo" ) {
#buffer = (); #Empty buffer of previous line
<DATA>; #Get rid of the next line
next LINE; #Foo doesn't get pushed into the buffer
}
push #buffer, $line;
if ( #buffer > 1 ) { #Buffer is "full"
say shift #buffer; #Print out previous line
}
}
#
# Empty out buffer
#
for my $line ( #buffer ) {
say $line;
}
__DATA__
2
3
4
5
6
7
8
9
10
11
12
13
1
2
foo
3
4
5
foo
6
7
8
9
foo
Note that it is very possible that I might attempt to read from an empty file when I skip the next line. This is okay. The <$fh> will return either an empty string or undef, but I can ignore that. I'll catch the error when I go back to the top of my loop.

I didn't see that you had any specific criteria for "best", so I'll give you a solution that may be "best" along a different axis than those presented so far. You could use Tie::File and treat the entire file as an array, then iterate the array using an index. The previous and next lines are just $index-1 and $index+1 respectively. You just have to worry a little about your indices going beyond the bounds of your array. Here's an example:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010; # just for "say"
use Tie::File;
tie my #array, 'Tie::File', "filename" or die;
for my $i (0..$#array) {
if ($i > 0 && $i < $#array) { # ensure $i-1 and $i+1 make sense
next if $array[$i-1] =~ /BEFORE/ &&
$array[$i+1] =~ /AFTER/;
}
say $array[$i];
}
If it's more convenient, you can specify a filehandle instead of a filename and Tie::File also has some parameters to control memory usage or change what it means to be a "line" if you want that. Check the docs for more info.
Anyway, that's another way to do what you want that might be conceptually simpler if you are familiar with arrays and like to think in terms of arrays.

I would read the file into an array, with each line being an array element, then you can do the comparisons. The only real design consideration is the size of the file being read into memory.

Related

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

How to find lines containing a match between two files in perl?

I'm a novice at using perl. What I want to do is compare two files. One is my index file that I am calling "temp." I am attempting to use this to search through a main file that I am calling "array." The index file has only numbers in it. There are lines in my array that have those numbers. I've been trying to find the intersection between those two files, but my code is not working. Here's what I've been trying to do.
#!/usr/bin/perl
print "Enter the input file:";
my $filename=<STDIN>;
open (FILE, "$filename") || die "Cannot open file: $!";
my #array=<FILE>;
close(FILE);
print "Enter the index file:";
my $temp=<STDIN>;
open (TEMP, "$temp") || die "Cannot open file: $!";
my #temp=<TEMP>;
close(TEMP);
my %seen= ();
foreach (#array) {
$seen{$_}=1;
}
my #intersection=grep($seen{$_}, #temp);
foreach (#intersection) {
print "$_\n";
}
If I can't use intersection, then what else can I do to move each line that has a match between the two files?
For those of you asking for the main file and the index file:
Main file:
1 CP TRT
...
14 C1 MPE
15 C2 MPE
...
20 CA1 MPE
Index file
20
24
22
17
18
...
I want to put those lines that contain one of the numbers in my index file into a new array. So using this example, only
20 CA1 MPE would be placed into a new array.
My main file and index file are both longer than what I've shown, but that hopefully gives you an idea on what I'm trying to do.
I am assuming something like this?
use strict;
use warnings;
use Data::Dumper;
# creating arrays instead of reading from file just for demo
# based on the assumption that your files are 1 number per line
# and no need for any particular parsing
my #array = qw/1 2 3 20 60 50 4 5 6 7/;
my #index = qw/10 12 5 3 2/;
my #intersection = ();
my %hash1 = map{$_ => 1} #array;
foreach (#index)
{
if (defined $hash1{$_})
{
push #intersection, $_;
}
}
print Dumper(\#intersection);
==== Out ====
$VAR1 = [
'5',
'3',
'2'
];
A few things:
Always have use strict; and use warnings; in your program. This will catch a lot of possible errors.
Always chomp after reading input. Perl automatically adds \n to the end of lines read. chomp removes the \n.
Learn a more modern form of Perl.
Use nemonic variable names. $temp doesn't cut it.
Use spaces to help make your code more readable.
You never stated the errors you were getting. I assume it has to do with the fact that the input from your main file doesn't match your index file.
I use a hash to create an index that the index file can use via my ($index) = split /\s+/, $line;:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
print "Input file name: ";
my $input_file = <STDIN>;
chomp $input_file; # Chomp Input!
print "Index file name: ";
my $index_file = <STDIN>;
chomp $index_file; # Chomp Input!
open my $input_fh, "<", $input_file;
my %hash;
while ( my $line = <$input_fh> ) {
chomp $line;
#
# Using split to find the item to index on
#
my ($index) = split /\s+/, $line;
$hash{$index} = $line;
}
close $input_fh;
open my $index_fh, "<", $index_file;
while ( my $index = <$index_fh> ) {
chomp $index;
#
# Now index can look up lines
#
if( exists $hash{$index} ) {
say qq(Index: $index Line: "$hash{$index}");
}
else {
say qq(Index "$index" doesn't exist in file.);
}
}
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#ARGV = 'main_file';
open(my $fh_idx, '<', 'index_file');
chomp(my #idx = <$fh_idx>);
close($fh_idx);
while (defined(my $r = <>)) {
print $r if grep { $r =~ /^[ \t]*$_/ } #idx;
}
You may wish to replace those hardcoded file names for <STDIN>.
FYI: The defined call inside a while condition might be "optional".

Selecting records from a file based on keys from a second file

My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.