I'm new in perl and I would like to read a table and make a sum of some values from specific lines. This is a simplified example of my input file:
INPUT :
Gene Size Feature
GeneA 1200 Intron 1
GeneB 100 Intron 1
GeneB 200 Intron 1
GeneB 150 Intron 2
GeneC 300 Intron 5
OUTPUT :
GeneA 1200 Intron 1
GeneB 300 Intron 1 <-- the size values are summed
GeneB 150 Intron 2
GeneC 300 Intron 5
Because Gene B is present for intron 1 with two different sizes, I would like to sum these two values and print only one line per intron number.
This is an example of code that I want to do. But I would like to make it more complicated if I can understand How to handle this kind of data.
#!/usr/bin/perl
use strict;
use warnings;
my $sum;
my #GAP_list;
my $prevline = 'na';
open INFILE,"Table.csv";
while (my $ligne = <INFILE>)
{
chomp ($ligne);
my #list = split /\t/, $ligne;
my $gene= $list[0];
my $GAP_size= $list[2];
my $intron= $list[3];
my $intron_number=$list[4];
if($prevline eq 'na'){
push #GAP_list, $GAP_size;
}
elsif($prevline ne 'na') {
my #list_p = split /\t/,$prevline;
my $gene_p= $list_p[0];
my $GAP_size_p= $list_p[2];
my $intron_p= $list_p[3];
my $intron_number_p=$list_p[4];
if (($gene eq $gene_p) && ($intron eq $intron_p) && ($intron_number eq $intron_number_p)){
push #GAP_list, $GAP_size;
}
}
else{
$sum = doSum(#GAP_list);
print "$gene\tGAP\t$GAP_size\t$intron\t$intron_number\t$sum\n";
$prevline=$ligne;
}
}
# Subroutine
sub doSum {
my $sum = 0;
foreach my $x (#_) {
$sum += $x;
}
return $sum;
}
Assuming the fields are seperated by tabs, then the following strategy would work. It buffers the last line, either adding up if the other fields are equal, or printing the old data and then replacing the buffer with the current line.
After the whole input was processed, we must not forget to print out the contents that are still in the buffer.
my $first_line = do { my $l = <>; chomp $l; $l };
my ($last_gene, $last_tow, $last_intron) = split /\t/, $first_line;
while(<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
if ($gene eq $last_gene and $intron eq $last_intron) {
$last_tow += $tow;
} else {
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
($last_gene, $last_tow, $last_intron) = ($gene, $tow, $intron);
}
}
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
This works fine as long as genes that may be folded together are always consecutive. If the joinable records are spread all over the file, we have to keep a data structure of all records. After the whole file is parsed, we can emit nicely sorted sums.
We will use a multilevel hash that uses the gene as first level key, and the intron as 2nd level key. The value is the count/tow/whatever:
my %records;
# parse the file
while (<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
$records{$gene}{$intron} += $tow;
}
# emit the data:
for my $gene (sort keys %records) {
for my $intron (sort keys %{ $records{$gene} }) {
print join("\t", $gene, records{$gene}{$intron}, $intron), \n";
}
}
This seems more like something that can be done easily using a simple SQL Query. Especially as you get your files in a database table format. I couldn't comment on your question, to ask you more about it as I don't have enough reputation to do so.
So I'm assuming that you get your data from a table. Not that you can't solve this problem in Perl. But I strongly recommend using the database to do such calculation when fetching the data file, as that seems much easier. And I am not sure why you chose to do it in Perl, especially when you have lots of such fields in a file and you wanted to do such operations on all of them. And you could still use Perl to interact with your database when solving your problem via an SQL Query.
So my proposed solution in SQL, if the data is collected from a database is:
Write an SQL statement involving a GROUP BY on the GENE and feature field and aggregate the size column.
If your table looked exactly like what you described, let us call it GeneInformation table and you loaded your data file to the SQL database (SQLLite maybe) then your select query would be:
SELECT gene, feature, SUM(size) FROM GeneInformation
GROUP
BY gene, feature;
That should give you a list of genes, features and their corresponding total sizes .
If SQL solution is completely impossible for you then I will talk about the Perl solution.
I noticed that the Perl solutions are based on the assumption that a particular gene's values would appear consecutively in the file. If that is the case then I would like to up vote amon's answer (which I can't do at the moment).
i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)
I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
I've been trying to write a program to read columns of text-formatted numbers into Perl variables.
Basically, I have a file with descriptions and numbers:
ref 5.25676 0.526231 6.325135
ref 1.76234 12.62341 9.1612345
etc.
I'd like to put the numbers into variables with different names, e.g.
ref_1_x=5.25676
ref_1_y=0.526231
etc.
Here's what I've got so far:
print "Loading file ...";
open (FILE, "somefile.txt");
#text=<FILE>;
close FILE;
print "Done!\n";
my $count=0;
foreach $line (#text){
#coord[$count]=split(/ +/, $line);
}
I'm trying to compare the positions written in the file to each other, so will need another loop after this.
Sorry, you weren't terribly clear on what you're trying to do and what "ref" refers to. If I misunderstood your problem please commend and clarify.
First of all, I would strongly recommend against using variable names to structure data (e.g. using $ref_1_x to store x coordinate for the first row with label "ref").
If you want to store x, y and z coordinates, you can do so as an array of 3 elements, pretty much like you did - the only difference is that you want to store an array reference (you can't store an array as a value in another array in Perl):
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = \#data; # Store the reference to coordinate array
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->[0]; # index 1 for row 2; then sub-index 0 for x coordinate.
If you insist on storing the 3 coordinates in named data structure, because sub-index 0 for x coordinate looks less readable - which is a valid concern in general but not really an issue with 3 columns - use a hash instead of array:
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = { x => $data[0], y => $data[1], z => $data[2] };
# curly braces - {} - to store hash reference again
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->{x}; # index 1 for row 2
Now, if you ALSO want to store the rows that have a first column value "ref" in a separate "ref"-labelled data structure, you can do that by wrapping the original #coordinates array into being a value in a hash with a key of "ref".
my ($label, #data) = split(/ +/, $line); # Save first "ref" label
$coordinates{$label} ||= []; # Assign an empty array ref
#if we did not create the array for a given label yet.
push #{ $coordinates{$label} }, { x => $data[0], y => $data[1], z => $data[2] };
# Since we don't want to bother counting per individual label,
# Simply push the coordinate hash at the end of appropriate array.
# Since coordinate array is stored as an array reference,
# we must dereference for push() to work using #{ MY_ARRAY_REF } syntax
Then, to access the x coordinate for row 2 for label "ref", you do:
$label = "ref";
$coordinates{$label}->[1]->{x}; # index 1 for row 2 for $label
Also, your original example code has a couple of outdated idioms that you may want to write in a better style (use 3-argument form of open(), check for errors on IO operations like open(); use of lexical filehandles; storing entire file in a big array instead of reading line by line).
Here's a slightly modified version:
use strict;
my %coordinates;
print "Loading file ...";
open (my $file, "<", "somefile.txt") || die "Can't read file somefile.txt: $!";
while (<$file>) {
chomp;
my ($label, #data) = split(/ +/); # Splitting $_ where while puts next line
$coordinates{$label} ||= []; # Assign empty array ref if not yet assigned
push #{ $coordinates{$label} }
, { x => $data[0], y => $data[1], z => $data[2] };
}
close($file);
print "Done!\n";
It is not clear what you want to compare to what, so can't advise on that without further clarifications.
The problem is you likely need a double-array (or hash or ...). Instead of this:
#coord[$count]=split(/ +/, $line);
Use:
#coord[$count++]=[split(/ +/, $line)];
Which puts the entire results of the split into a sub array. Thus,
print $coord[0][1];
should output "5.25676".