Perl script for combining 2 files with multiple entries - perl

I have a tab-delimited text file like this:
contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data
And another like this:
contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N
I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:
contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data
I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.
EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?
#!/usr/bin/env/perl
use warnings;
open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";
my %goterm;
while (<GOTERMS>)
{
my($id, $rest) = /^(\S++)(,*)/s;
push #{$goterm{$id}}, $rest;
}
while (my $row2 = <SNPS>)
{
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $goterm{$id} })
{
print("$row2$rest\n");
}
}
close GOTERMS;
close SNPS;

Look at your output. It's clearly produced by
for each row of the second file,
for each row of the first file with the same id,
print out the combined rows
So the question is: How does you find the rows of the first file with the same id as a row of the second file?
The answer is: You store the rows of the first file in a hash indexed by the row's id.
my %file1;
while (<$file1_fh>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push #{ $file1{$id} }, $rest;
}
So the earlier pseudo code resolves to
while (my $row2 = <$file2_fh>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $file1{$id} }) {
print("$row2$rest");
}
}
#!/usr/bin/env perl
use strict;
use warnings;
open(my $GOTERMS, $ARGV[0])
or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
or die("Error opening SNP file \"$ARGV[1]\": $!\n");
my %goterm;
while (<$GOTERMS>) {
my ($id, $rest) = /^(\S++)(.*)/s;
push #{ $goterm{$id} }, $rest;
}
while (my $row2 = <$SNPS>) {
chomp($row2);
my ($id) = $row2 =~ /^(\S+)/;
for my $rest (#{ $goterm{$id} }) {
print("$row2$rest");
}
}

I will describe how you can do this. You need each file pu to array (each libe is an array item). Then you just need to compare these array in needed way. You need 2 loops. Main loops for each record of array/file which contains string which you you will use to campare (in your example it will be 2nd file). Under this loop you need to have another loop for each record in a array/file with records which you will compare with. And just check each record of array with the each recrod of another array and process results.
foreach my $record2 (#array2) {
foreach my $record1 (#array1){
if ($record2->{field} eq $record1->{field}){
#here you need to create the string which you will show
my $res_string = $record2->{field}.$record1->{field};
print "$res_string\n";
}
}
}
Or dont use array. Just read files and compare each line with each line of another file. General idea is the same ))

Related

Parse tab delimited file into hash of array

I'm a perl novice attempting to perform the following:
1) Take a user input
2) Match the input with instances of that value from column1 of file 1 and store the corresponding value from the column 2 in a hash, hash of array or hash of hash. (below code stores in hash of array but I'm not sure if this is optimal to accomplish 3 below)
3) I need to find all instances (if they exist) of the first column in file 2 = column 2 in file 1.
For simplicity I've provided sample file below.
I'm attempting to take a user input of 'AAA' in column 1 of the input file into a hash or array, as the key for all corresponding values in column 2.
My input file has multiple instances of 'AAA' in column 1 with different values for column 2, also there are multiple instances of 'AAA' and 'BBB' in columns 1 & 2. I believe in order to output this properly I need to use a hash of hash but I'm not sure syntactically how to approach it.
I've tried searching this site and found some examples but I'm afraid I'm only confusing myself more.
Example of input file.
AAA BBB
AAA CCC
AAA BBB
BBB DDD
CCC AAA
Example of my code
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
use Data::Dumper;
#declare values
my %hash = ();
#Get protein name from user
print "Get column 1 value: ";
my $value = <STDIN>;
chomp $value;
#open input file
open FILE, "file" or die("unable to open file\n");
while(my $line = <FILE>) {
chomp($line);
my($column1, $column2) = split("\t", $line);
if ($column1 eq $value) {
push #{ $hash{$column1} }, $column2;
}
}
close FILE;
print Dumper(\%hash);
Code output
$VAR1 = {
'AAA' => [
'BBB',
'CCC'
]
};
My question is will my current hash of array setup work best for reading column 1 in file 2 and comparing it with column 2 of file 1? Or should I approach it differently?
Your current code overwrites the value of $hash{$column1} on each iteration. You can use push to add a new element to the array instead of overwriting by changing this line:
$hash{$column1} = [$column2];
to
push #{ $hash{$column1} }, $column2;
Note that the data structure you're creating is not a hash of hashes but a hash of arrays.

Merge lines and do operations if a condition is statisfied

I'm new in perl and I would like to read a table and make a sum of some values from specific lines. This is a simplified example of my input file:
INPUT :
Gene Size Feature
GeneA 1200 Intron 1
GeneB 100 Intron 1
GeneB 200 Intron 1
GeneB 150 Intron 2
GeneC 300 Intron 5
OUTPUT :
GeneA 1200 Intron 1
GeneB 300 Intron 1 <-- the size values are summed
GeneB 150 Intron 2
GeneC 300 Intron 5
Because Gene B is present for intron 1 with two different sizes, I would like to sum these two values and print only one line per intron number.
This is an example of code that I want to do. But I would like to make it more complicated if I can understand How to handle this kind of data.
#!/usr/bin/perl
use strict;
use warnings;
my $sum;
my #GAP_list;
my $prevline = 'na';
open INFILE,"Table.csv";
while (my $ligne = <INFILE>)
{
chomp ($ligne);
my #list = split /\t/, $ligne;
my $gene= $list[0];
my $GAP_size= $list[2];
my $intron= $list[3];
my $intron_number=$list[4];
if($prevline eq 'na'){
push #GAP_list, $GAP_size;
}
elsif($prevline ne 'na') {
my #list_p = split /\t/,$prevline;
my $gene_p= $list_p[0];
my $GAP_size_p= $list_p[2];
my $intron_p= $list_p[3];
my $intron_number_p=$list_p[4];
if (($gene eq $gene_p) && ($intron eq $intron_p) && ($intron_number eq $intron_number_p)){
push #GAP_list, $GAP_size;
}
}
else{
$sum = doSum(#GAP_list);
print "$gene\tGAP\t$GAP_size\t$intron\t$intron_number\t$sum\n";
$prevline=$ligne;
}
}
# Subroutine
sub doSum {
my $sum = 0;
foreach my $x (#_) {
$sum += $x;
}
return $sum;
}
Assuming the fields are seperated by tabs, then the following strategy would work. It buffers the last line, either adding up if the other fields are equal, or printing the old data and then replacing the buffer with the current line.
After the whole input was processed, we must not forget to print out the contents that are still in the buffer.
my $first_line = do { my $l = <>; chomp $l; $l };
my ($last_gene, $last_tow, $last_intron) = split /\t/, $first_line;
while(<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
if ($gene eq $last_gene and $intron eq $last_intron) {
$last_tow += $tow;
} else {
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
($last_gene, $last_tow, $last_intron) = ($gene, $tow, $intron);
}
}
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
This works fine as long as genes that may be folded together are always consecutive. If the joinable records are spread all over the file, we have to keep a data structure of all records. After the whole file is parsed, we can emit nicely sorted sums.
We will use a multilevel hash that uses the gene as first level key, and the intron as 2nd level key. The value is the count/tow/whatever:
my %records;
# parse the file
while (<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
$records{$gene}{$intron} += $tow;
}
# emit the data:
for my $gene (sort keys %records) {
for my $intron (sort keys %{ $records{$gene} }) {
print join("\t", $gene, records{$gene}{$intron}, $intron), \n";
}
}
This seems more like something that can be done easily using a simple SQL Query. Especially as you get your files in a database table format. I couldn't comment on your question, to ask you more about it as I don't have enough reputation to do so.
So I'm assuming that you get your data from a table. Not that you can't solve this problem in Perl. But I strongly recommend using the database to do such calculation when fetching the data file, as that seems much easier. And I am not sure why you chose to do it in Perl, especially when you have lots of such fields in a file and you wanted to do such operations on all of them. And you could still use Perl to interact with your database when solving your problem via an SQL Query.
So my proposed solution in SQL, if the data is collected from a database is:
Write an SQL statement involving a GROUP BY on the GENE and feature field and aggregate the size column.
If your table looked exactly like what you described, let us call it GeneInformation table and you loaded your data file to the SQL database (SQLLite maybe) then your select query would be:
SELECT gene, feature, SUM(size) FROM GeneInformation
GROUP
BY gene, feature;
That should give you a list of genes, features and their corresponding total sizes .
If SQL solution is completely impossible for you then I will talk about the Perl solution.
I noticed that the Perl solutions are based on the assumption that a particular gene's values would appear consecutively in the file. If that is the case then I would like to up vote amon's answer (which I can't do at the moment).

i want to merge multiple csv files by specific condition using perl

i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)

How do I replace row identifiers with sequential numbers?

I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;
The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}
This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.
Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;

Reading numbers from a file to variables (Perl)

I've been trying to write a program to read columns of text-formatted numbers into Perl variables.
Basically, I have a file with descriptions and numbers:
ref 5.25676 0.526231 6.325135
ref 1.76234 12.62341 9.1612345
etc.
I'd like to put the numbers into variables with different names, e.g.
ref_1_x=5.25676
ref_1_y=0.526231
etc.
Here's what I've got so far:
print "Loading file ...";
open (FILE, "somefile.txt");
#text=<FILE>;
close FILE;
print "Done!\n";
my $count=0;
foreach $line (#text){
#coord[$count]=split(/ +/, $line);
}
I'm trying to compare the positions written in the file to each other, so will need another loop after this.
Sorry, you weren't terribly clear on what you're trying to do and what "ref" refers to. If I misunderstood your problem please commend and clarify.
First of all, I would strongly recommend against using variable names to structure data (e.g. using $ref_1_x to store x coordinate for the first row with label "ref").
If you want to store x, y and z coordinates, you can do so as an array of 3 elements, pretty much like you did - the only difference is that you want to store an array reference (you can't store an array as a value in another array in Perl):
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = \#data; # Store the reference to coordinate array
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->[0]; # index 1 for row 2; then sub-index 0 for x coordinate.
If you insist on storing the 3 coordinates in named data structure, because sub-index 0 for x coordinate looks less readable - which is a valid concern in general but not really an issue with 3 columns - use a hash instead of array:
my ($first_column, #data) = split(/ +/, $line); # Remove first "ref" column
#coordinates[$count++] = { x => $data[0], y => $data[1], z => $data[2] };
# curly braces - {} - to store hash reference again
Then, to access the x coordinate for row 2, you do:
$coordinates[1]->{x}; # index 1 for row 2
Now, if you ALSO want to store the rows that have a first column value "ref" in a separate "ref"-labelled data structure, you can do that by wrapping the original #coordinates array into being a value in a hash with a key of "ref".
my ($label, #data) = split(/ +/, $line); # Save first "ref" label
$coordinates{$label} ||= []; # Assign an empty array ref
#if we did not create the array for a given label yet.
push #{ $coordinates{$label} }, { x => $data[0], y => $data[1], z => $data[2] };
# Since we don't want to bother counting per individual label,
# Simply push the coordinate hash at the end of appropriate array.
# Since coordinate array is stored as an array reference,
# we must dereference for push() to work using #{ MY_ARRAY_REF } syntax
Then, to access the x coordinate for row 2 for label "ref", you do:
$label = "ref";
$coordinates{$label}->[1]->{x}; # index 1 for row 2 for $label
Also, your original example code has a couple of outdated idioms that you may want to write in a better style (use 3-argument form of open(), check for errors on IO operations like open(); use of lexical filehandles; storing entire file in a big array instead of reading line by line).
Here's a slightly modified version:
use strict;
my %coordinates;
print "Loading file ...";
open (my $file, "<", "somefile.txt") || die "Can't read file somefile.txt: $!";
while (<$file>) {
chomp;
my ($label, #data) = split(/ +/); # Splitting $_ where while puts next line
$coordinates{$label} ||= []; # Assign empty array ref if not yet assigned
push #{ $coordinates{$label} }
, { x => $data[0], y => $data[1], z => $data[2] };
}
close($file);
print "Done!\n";
It is not clear what you want to compare to what, so can't advise on that without further clarifications.
The problem is you likely need a double-array (or hash or ...). Instead of this:
#coord[$count]=split(/ +/, $line);
Use:
#coord[$count++]=[split(/ +/, $line)];
Which puts the entire results of the split into a sub array. Thus,
print $coord[0][1];
should output "5.25676".