Perl, multiple match string in multiple files - perl

I would like to use
myscript.pl targetfolder/* > result.csv
to grep some number from multiple ASCII files.
The data table is like
| 44.2 | 3123.7 | 3123 |
+--------+--------+--------+
--> this is the end of data table is like
myscript.pl
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
#####start######
Title1();
Title2();
print "\n";
#####Grep#######
foreach my $currentfile (#ARGV) { # ARGV is the target files list
print Dumper($currentfile); # debug
open my $filehanlder, '<', $currentfile or die "$currentfile: $!";
while ($r <= $#fswf) { #judge end of the open file
Value1();
Value2();
Print1();
Print2();
print "\n";
$r++;
} #go next line output
Close $filehanlder;
}
#####sub#######
sub Title1 {
print "title1,title2";
}
sub Title2{
print "title5,title6,title7,title8";
}
sub Value1 {
my ($line);
while ($line = <$filehanlder>)) {
if ($line =~ /^\|\sMachine\:(\S+)\s+Release\:(\S+)\s+/) {
our ($machine) = $1;our ($sw) = $2;
}
}
}
sub Value2 {
my ($line);
while ($line = <$filehanlder>)) {
if ($line =~ /^\|\sProduction\sResult\s+\|\s(\S+)\s+\|/) {
next if 1..4;
my (#b) = "";
$r = 1
#result1 = #result2 = #result3 = #result4 = "";
while ($line !~ /\+\-/) {
chomp $line;
#b = split / *\| */, $line;
our ($result1[$r]) = $b[1];
our ($result2[$r]) = $b[2];
our ($result3[$r]) = $b[3];
our ($result4[$r]) = $b[4];
$r++;
$line = (<$filehanlder>);
#b = "";
}
}
}
}
##I need a value as file counter, but not sure where to put it.
Sub Print1 {
print "$machine,$sw,"; # this keeps same cross lines from same file
}
Sub Print2 {
print "$result1[$r],$result2[$r],$result3[$r],$result4[$r],"; # change every line
}
#####sub#######
I don't know is this correct way to pass the $filehander to the subroutine and pass it throught different subroutine.
#Dave Cross: Thanks for pointing out. Exactly as you said. If I do loop in the subroutine, then one subroutine will go to the end of file, other subroutine get nothing. So shall I do while loop in the main ? Or shall I do open in every subroutines? so I can reset the filehandler to the 1st line of the file in every subroutine. If I have multiple #result as I grep in the sub values2 , how I can print them with the max lines number of these #result. For example, I have #result5[7] ,#result6[12], so I would like to print 12 lines record, the first 7 lines with result5 grep result, the last 5 line ,result5 column keeps empty and result6 column continue printout.

Your filehandle is just stored in a scalar variable ($filehanlder) so it can be passed into a subroutine in exactly the same way as any other variable.
some_subroutine($filehanlder);
And, inside the subroutine:
sub some_subroutine {
my ($fh) = #_;
# do something with $fh
}
But I think you have more serious problems to worry about. You have two subroutines that have a while (<$filehanlder>) loop in them. The first of those to be called, will go to the end of the file, leaving the second with no data to process.
You probably want to rethink the design of this code.

Related

Extract and filter a range of lines from the input using Perl

I'm quite new to Perl and I have some problems in skipping lines using a foreach loop. I want to copy some lines of a text file to a new one.
When the first words of a line are FIRST ITERATION, skip two more lines and print everything following until the end of the file or an empty line is encountered.
I've tried to find out a similar post but nobody talks about working with text files.
This is the form I thought of
use 5.010;
use strict;
use warnings;
open( INPUT, "xxx.txt" ) or die("Could not open log file.");
open( OUT, ">>yyy.txt" );
foreach my $line (<INPUT>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
# print OUT
}
}
close(OUT);
close(INFO);
I tried using next and $line++ but my program prints only the line that begins with FIRST ITERATION.
I may try to use a for loop but I don't know how many lines my file may have, nor do I know how many lines there are between "First Iteration" and the next empty line.
The simplest way is to process the file a line at a time and keep a state flag which is set to 1 if the current line is begins with FIRST ITERATION and 0 if it is blank, otherwise it is incremented if it is already positive so that it provides a count of the line number within the current block
This solution expects the path to the input file as a parameter on the command line and prints its output to STDOUT, so you will need to redirect the output to the file on the command line as necessary
Note that the regex pattern /\S/ checks whether there is a non-blank character anywhere in the current line, so not /\S/ is true if the line is empty or all blank characters
use strict;
use warnings;
my $lines = 0;
while ( <> ) {
if ( /^FIRST ITERATION/ ) {
$lines = 1;
}
elsif ( not /\S/ ) {
$lines = 0;
}
elsif ( $lines > 0 ) {
++$lines;
}
print if $lines > 3;
}
This can be simplified substantially by using Perl's built-in range operator, which keeps its own internal state and returns the number of times it has been evaluated. So the above may be written
use strict;
use warnings;
while ( <> ) {
my $s = /^FIRST ITERATION/ ... not /\S/;
print if $s and $s > 3;
}
And the last can be rewritten as a one-line command line program like this
$ perl -ne '$s = /^FIRST ITERATION/ ... not /\S/; print if $s and $s > 3' myfile.txt
Use additional counter, that will say on which condition print line. Something like this:
$skipCounter = 3;
And in foreach:
if ($skipCounter == 2) {
// print OUT
}
if ( $line =~ m/^FIRST ITERATION/) {
$skipCounter = 0;
}
$skipCounter++;
Advice: Use STDIN and STDOUT instead of files, this will allowes you to change them without modifying script
Code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
open(INPUT, "xxx.txt" ) or die "Could not open log file: $!.";
open(OUT, ">yyy.txt") or die "Could not open output file: $!";
while( my $line = <INPUT> )
{
if ( $line =~ m/^FIRST ITERATION/) {
<INPUT>; # skip line
<INPUT>; # skip line
while( $line = <INPUT>) # print till empty line
{
last if $line eq "\n";
print OUT $line;
}
};
};
close (OUT);
close (INPUT);
You're on the right track. What you need to use is the flip-flop operator (which is basically the range operator) ... It will toggle for you between two matches, so you get everything in between. After that, it's a matter of keeping track of the lines you want to skip.
So basically we are checking for FIRST ITERATION and for an empty line, and grab everything in between those. $skip is used to remember how many lines were skipped. It starts at 0 and gets incremented for the first two lines after we start being in the flip-flop if block. In the else case, where we are after the flip-flop, it gets reset to 0 so we can start over with the next block.
Since you know how to open and write files, I'll skip that.
use strict;
use warnings;
my $skip = 0;
while (<DATA>) {
if (/^FIRST ITERATION$/ .. /^$/) {
next if $skip++ <= 2;
print $_;
} else {
$skip = 0;
}
}
__DATA__
FIRST ITERATION
skip1
skip2
foo
bar
baz
don't print this
The output of this is:
foo
bar
baz
To stick with your own code, here's a very verbose solution that uses a foreach and no flip-flop. It does the same thing, just with a lot more words.
my $skip = 0; # skip lines
my $match = 0; # keep track of if we're in between the borders
foreach my $line (<DATA>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
$match = 1; # we are inside the match
next;
}
if ($line =~ m/^$/) {
$match = 0; # we are done matching
next;
}
if ($match) {
$skip++; # count skip-lines
if ($skip <= 2) {
next; # ... and skip the first two
}
print $line; # this is the content we want
}
}
Using paragraph mode (which returns blocks separated by blank lines rather than lines):
local $/ = ""; # Paragraph mode.
while (<>) {
s/\n\K\n+//; # Get rid of trailing empty lines.
print /^FIRST ITERATION\n[^\n]*\n[^\n]*\n(.*)/ms;
}
Using the flip-flop operator:
while (<>) {
if (my $line_num = /^FIRST ITERATION$/ .. /^$/) {
print if $line_num > 3 && $line_num !~ /E0/;
}
}
$line_num !~ /E0/ is true when the flip-flop is flopping (i.e. for the first empty line after FIRST ITERATION). This is checked to avoid printing the blank line.

Append a new column to file in perl

I've got the follow function inside a perl script:
sub fileSize {
my $file = shift;
my $opt = shift;
open (FILE, $file) or die "Could not open file $file: $!";
$/ = ">";
my $junk = <FILE>;
my $g_size = 0;
while ( my $rec = <FILE> ) {
chomp $rec;
my ($name, #seqLines) = split /\n/, $rec;
my $sec = join('',#seqLines);
$g_size+=length($sec);
if ( $opt == 1 ) {
open TMP, ">>", "tmp" or die "Could not open chr_sizes.log: $!\n";
print TMP "$name\t", length($sec), "\n";
}
}
if ( $opt == 0 ) {
PrintLog( "file_size: $g_size", 0 );
}
else {
print TMP "file_size: $g_size\n";
close TMP;
}
$/ = "\n";
close FILE;
}
Input file format:
>one
AAAAA
>two
BBB
>three
C
I have several input files with that format. The line beginning with ">" is the same but the other lines can be of different length. The output of the function with only one file is:
one 5
two 3
three 1
I want to execute the function in a loop with this for each file:
foreach my $file ( #refs ) {
fileSize( $file, 1 );
}
When running the next iteration, let's say with this file:
>one
AAAAABB
>two
BBBVFVF
>three
CS
I'd like to obtain this output:
one 5 7
two 3 7
three 1 2
How can I modify the function or modify the script to get this? As can be seen, my function append the text to the file
Thanks!
I've left out your options and the file IO operations and have concentrated on showing a way to do this with an array of arrays from the command line. I hope it helps. I'll leave wiring it up to your own script and subroutines mostly up to to you :-)
Running this one liner against your first data file:
perl -lne ' $name = s/>//r if /^>/ ;
push #strings , [$name, length $_] if !/^>/ ;
END { print "#{$_ } " for #strings }' datafile1.txt
gives this output:
one 5
two 3
three 1
Substituting the second version or instance of the data file (i.e where record one contains AAAAABB) gives the expected results as well.
one 7
two 7
three 2
In your script above, you save to an output file in this format. So, to append columns to each row in your output file, we can just munge each of your data files in the same way (with any luck this might mean things can be converted into a function that will work in a foreach loop). If we save the transformed data to be output into an array of arrays (AoA), then we can just push the length values we get for each data file string onto the corresponding anonymous array element and then print out the array. Voilà! Now let's hope it works ;-)
You might want to install Data::Printer which can be used from the command line as -MDDP to visualize data structures.
First - run the above script and redirect the output to a file with > /tmp/output.txt
Next - try this longish one-liner that uses DDP and p to show the structure of the array we create:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift;
#tmp = map { [split] } <>; p #tmp }
$name = s/>//r if /^>/ ;
push #out , [ $name, length $_ ] if !/^>/ ;
END{ p #out ; }' /tmp/output.txt datafile2.txt `
In the BEGIN block we local-ize #ARGV ; shift off the first file (our version of your TMP file) - {local #ARGV=shift} is almost a perl idiom for handling multiple input files; we then split it inside an anonymous array constructor ([]) and map { } that into the #tmp array which we display with DDP's p() function. Once we are out of the BEGIN block, the implicit while (<>){ ... } that we get with perl's -n command line switch takes over and reads in the remaining file from #ARGV ; we process lines starting with > - stripping the leading character and assigning the string that follows to the $name variable; the while continues and we push $name and the length of any line that does not start with > (if !/^>/) wrapped as elements of an anonymous array [] into the #out array which we display with p() as well (in the END{} block so it doesn't print inside our implicit while() loop). Phew!!
See the AoA that results as a gist #Github.
Finally - building on that, and now we have munged things nicely - we can change a few things in our END{...} block (add a nested for loop to push things around) and put this all together to produce the output we want.
This one liner:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift; #tmp = map {[split]} <>; }
$name = s/>//r if /^>/ ; push #out, [ $name, length $_ ] if !/^>/ ;
END{ foreach $row (0..$#tmp) { push $tmp[$row] , $out[$row][-1]} ;
print "#$_" for #tmp }' output.txt datafile2.txt
produces:
one 5 7
two 3 7
three 1 2
We'll have to convert that into a script :-)
The script consists of three rather wordy subroutines that reads the log file; parses the datafile ; merges them. We run them in order. The first one checks to see if there is an existing log and creates one and then does an exit to skip any further parsing/merging steps.
You should be able to wrap them in a loop of some kind that feeds files to the subroutines from an array instead of fetching them from STDIN. One caution - I'm using IO::All because it's fun and easy!
use 5.14.0 ;
use IO::All;
my #file = io(shift)->slurp ;
my $log = "output.txt" ;
&readlog;
&parsedatafile;
&mergetolog;
####### subs #######
sub readlog {
if (! -R $log) {
print "creating first log entry\n";
my #newlog = &parsedatafile ;
open(my $fh, '>', $log) or die "I CAN HAZ WHA????" ;
print $fh "#$_ \n" for #newlog ;
exit;
}
else {
map { [split] } io($log)->slurp ;
}
}
sub parsedatafile {
my (#out, $name) ;
while (<#file>) {
chomp ;
$name = s/>//r if /^>/;
push #out, [$name, length $_] if !/^>/ ;
}
#out;
}
sub mergetolog {
my #tmp = readlog ;
my #data = parsedatafile ;
foreach my $row (0 .. $#tmp) {
push $tmp[$row], $data[$row][-1]
}
open(my $fh, '>', $log) or die "Foobar!!!" ;
print $fh "#$_ \n" for #tmp ;
}
The subroutines do all the work here - you can likely find ways to shorten; combine; improve them. Is this a useful approach for you?
I hope this explanation is clear and useful to someone - corrections and comments welcome. Probably the same thing could be done with place editing (i.e with perl -pie '...') which is left as an exercise to those that follow ...
You need to open the output file itself. First in read mode, then in write mode.
I have written a script that does what you are asking. What really matters is the part that appends new data to old data. Adapt that to your fileSize function.
So you have the output file, output.txt
Of the form,
one 5
two 3
three 1
And an array of input files, input1.txt, input2.txt, etc, saved in the #inputfiles variable.
Of the form,
>one
AAAAA
>two
BBB
>three
C
>four
DAS
and
>one
AAAAABB
>two
BBBVFVF
>three
CS
Respectively.
After running the following perl script,
# First read previous output file.
open OUT, '<', "output.txt" or die $!;
my #outlines;
while (my $line = <OUT> ) {
chomp $line;
push #outlines, $line;
}
close OUT;
my $outsize = scalar #outlines;
# Suppose you have your array of input file names already prepared
my #inputfiles = ("input1.txt", "input2.txt");
foreach my $file (#inputfiles) {
open IN, '<', $file or die $!;
my $counter = 1; # Used to compare against output size
while (my $line = <IN>) {
chomp $line;
$line =~ m/^>(.*)$/;
my $name = $1;
my $sequence = <IN>;
chomp $sequence;
my $seqsize = length($sequence);
# Here is where I append a column to output data.
if($counter <= $outsize) {
$outlines[$counter - 1] .= " $seqsize";
} else {
$outlines[$counter - 1] = "$name $seqsize";
}
$counter++;
}
close IN;
}
# Now rewrite the results to output.txt
open OUT, '>', "output.txt" or die $!;
foreach (#outlines) {
print OUT "$_\n";
}
close OUT;
You generate the output,
one 5 5 7
two 3 3 7
three 1 1 2
four 3

Perl Text-Parsing; Which algorithm is correct?

I am writing a Perl script that takes two files as input: one input is a tab-separated table with an identifier of interested in the second column, the second input is a list of identifiers that match the second column of the first file.
THE GOAL is print only those lines of the table which contain an identifier in the second column and to print each line only once. I have written three versions of this program and have been finding different numbers of lines printed in each.
Version 1:
# TAB-SEPARTED TABLE FILE
open (FILE, $file);
while (<FILE>) {
my $line = $_;
chomp $line;
# ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
foreach(#refs) {
my $ref = $_;
chomp $ref;
if ( $line =~ $ref) { print "$line\n"; next; }
}
}
Version 2:
# ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
foreach(#doc) {
my $full = $_;
# IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
if ( $counter == 0 ) {
print "$full\n";
$counter++;
next; }
# EXTRACT IDENTIFIER FROM LINE
my #cells = split('\t', $full);
my $gene = $cells[1];
foreach(#refs) {
my $text = $_;
if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER
print "$full\n";
next;
}
}
$counter--;
}
Version 3:
# LIST OF IDENTIFIERS
foreach(#refs) {
my $ref = $_;
# LIST OF EACH ROW OF THE TABLE
foreach(#doc) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
if ( $gene =~ $ref ) { print "$line\n"; next; }
}
}
Each of these approaches gives me different output and I do not understand why. I also do not understand if I can trust any of them to give me the right output. The right output should not contain any duplicate lines but more than one row might match any identifier from the list.
Sample Input File:
Position Symbol Name REF ALT
chr1:887801 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
chr1:888639 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:888659 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:897325 KLHL17 kelch-like 17 (Drosophila) G C
chr1:909238 PLEKHN1 pleckstrin homology domain containing, family N member 1 G C
chr1:982994 AGRN agrin T C
chr1:1254841 CPSF3L cleavage and polyadenylation specific factor 3-like C G
chr1:3301721 PRDM16 PR domain containing 16 C T
chr1:3328358 PRDM16 PR domain containing 16 T C
List is pulled from a file that looks like this:
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
Its put into an array using this code:
open (REF, $ref_file);
while (<REF>) {
my $line = $_;
chomp $line;
push(#refs, $line);
}
close REF;
Whenever you hear "I need to look up something", think hashes.
What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use autodie; # This way, I don't have to check my open for failures
use constant {
TABLE_FILE => "file1.txt",
LOOKUP_FILE => "file2.txt",
};
open my $lookup_fh, "<", LOOKUP_FILE;
my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
chomp $symbol,
$lookup_table{$symbol} = 1;
}
close $lookup_fh;
open my $table_file, "<", TABLE_FILE;
my %is_printed;
while ( my $line = <$table_file> ) {
chomp $line;
my #line_array = split /\s+/, $line;
my $symbol = $line_array[1];
if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
say $line;
$is_printed{$symbol} = 1;
}
}
Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.
I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.
I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:
open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);
I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.
The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.
Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.
I'm pretty sure this should work:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' identifiers_file.txt data_file.txt
Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L
then your output will be:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' idents.txt data.txt
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
If that's correct and you want a Perl version, you can just:
$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
| a2p
I suggest you to mix first version and second and add hashes to them.
First version because it's good(clear way) parse your data file line by line.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
my $line = $_;
chomp $line;
$refs{$line} = 0;
}
close REF;
#for head printing
$refs{'Symbol'} = 0;
open (FILE, $ARGV[1]);
while (<FILE>) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
#print $line, "\n" if exists $refs{$gene};
if(exists $refs{$gene} and $refs{$gene} == 0)
{
$refs{$gene}++;
print $line;
}
}
close FILE;

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.