Perl extract columns from two files based on condition

Perl extract columns from two files based on condition - perl

I have 2 files, say file1 and file2.
file1.txt
RAC1 GK1 111
RAC2 GK2 222
RAC1 GK3 333
RAC1 GK4 222
RAC2 GK5 111
file2.txt
R1,PAAE,222,TESTA,COLA,NO
R2,RWWG,111,TESTB,COLM,YES
R3,TDAS,444,TESTC,COLZ,NO
I am comparing 2 files and trying to extract data from them. Condition here is if Column3 value of file1 matches with Column3 value of file2 then print the following output -
RAC1,GK1,111,R2,RWWG,TESTB,COLM,YES
RAC2,GK5,111,R2,RWWG,TESTB,COLM,YES
RAC2,GK2,222,R1,PAAE,TESTA,COLA,NO
RAC1,GK4,222,R1,PAAE,TESTA,COLA,NO
I have written a script for the same, by taking file1 column2 value as key. But this column value doesn't exists in file2. So comparison is not working.
Even I am not able to take column3(from file1) as key, because its having duplicated values.
Code below -
my %hash1 = ();
open(FH1, "file1.txt");
while(<FH1>){
chomp($_);
my #val = split(' ', $_);
$hash1{$val[1]}{'RAC_VAL'} = $val[0];
$hash1{$val[1]}{'ID'} = $val[2];
}
#print Dumper(\%hash1);
open(FH2, "file2.txt");
while(<FH2>){
chomp($_);
my #array = split(',', $_);
print "$hash1{$array[2]}{'RAC_VAL'},,$hash1{$array[2]}{'ID'},$array[0],$array[1],$array[3],$array[4],$array[5]\n" if(exists $hash1{$array[2]}{'ID'});
}
Please help me to get output for above data files based on the above said condition.

Here is an example using array of arrays as values in %hash1 (since the keys are not unique):
use feature qw(say);
use strict;
use warnings;
my %hash1;
open(FH1, "file1.txt");
while(<FH1>){
chomp($_);
my #val = split(' ', $_);
push #{ $hash1{$val[2]} }, [ #val[0,1] ];
}
open(FH2, "file2.txt");
while(<FH2>){
chomp($_);
my #array = split(',', $_);
if ( exists $hash1{$array[2]} ) {
for my $item ( #{ $hash1{$array[2]} } ) {
say join ',', #$item, #array[0,1,3,4,5];
}
}
}
Output:
RAC2,GK2,R1,PAAE,TESTA,COLA,NO
RAC1,GK4,R1,PAAE,TESTA,COLA,NO
RAC1,GK1,R2,RWWG,TESTB,COLM,YES
RAC2,GK5,R2,RWWG,TESTB,COLM,YES

Related

Best way to compare multi-dimension hash in perl

I want to write a script that can compare two multi-dimension hash to see whether they match or not. This two hash has same value of key, the comparison will done in a pair mean starting with $j=0 first compare $line[0] $line[1] then follow by $line[2] $line[3]
$hash{"key"}{$key}{"$row $col"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}="$line1[$j] $line1[$j+1]";
my $line and $line1 is take from a huge file and will process the huge file line by line. For example:
#line = 1 2 3 4 5 6 7 8 #fst line from file1
#line1 = 1 2 3 3 4 5 6 7 7 #fst line from file2
when $key and $key1 match the scripts will continue to compare the $row $col then lastly compare $line[$j] $line[$j+1]. I am trying to use one of the module Test::More that mention at previous similar question but it can only compare the 1st not matching and the output is in default format.
not ok 1 - data structures should be the same
# Failed test 'data structures should be the same'
# at hash_check1.pl line 80.
# Structures begin differing at:
# $got->{key}{100}{2 3} = '62 19'
# $expected->{key}{100}{2 3} = '12 24'
# Tests were run but no plan was declared and done_testing() was not seen.
Any best way to compare this type of multi-dimension hash?
8/22 Edited
If Test::More module is able to compare all the $key match for two hash i can take that as well but if i can output the matching and non matching in my own prefer format will be great. For example i would like to output like
$key at $row $col no match with value $line[$j] $line[$j+1] ( expected value $line1[$j] $line1[$j+1] )
8/22 Edited
Below is part of my code
use strict;
use warnings;
use Test::More;
open ( FILE1 , '<', "file.txt" ) or die $!;
open ( FILE2 , '<' , "file1.txt" ) or die $!;
chomp (my #file1 = <FILE1>);
chomp (my #file2 = <FILE2>);
my %hash=();
my %hash1=();
for ( $i =0 ; $i<=511 ; $i++ ) {
my #line = split(" ",$file1[$i]);
my #line1 = split(" ",$file2[$i]);
my $key = ($i+1)*10;
my $key1 = ($i+2)*10;
for ( $j=0; $j<=15 ; $j+=2){
my $col = hex($j);
my $col1 = hex($j+1);
$hash{"key"}{$key}{"$row $col1"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}= "$line1[$j] $line1[$j+1]";
}
}
##############comparison part start here###################
is_deeply(\%hash, \%hash1, 'data structures should be the same'); #can only print one mismatch
################Any better way?###########

Here is an example that will compare nested hashes with string values:
#! /usr/bin/env perl
use feature qw(say);
use warnings;
use strict;
my %hash;
my %hash1;
$hash{"key"}{A}{"2 3"}= "1 2";
$hash1{"key"}{A}{"2 3"}="2 3";
$hash{"key"}{B}{"2 3"}= "1 2";
$hash1{"key"}{C}{"2 3"}="2 3";
compare(\%hash, \%hash1);
sub compare {
return _compare( #_, "" );
}
sub get_key_str { return $_[0] . '{' . $_[1] . '}' }
sub _process_key {
my ( $h1, $h2, $info, $key, $str1, $str2 ) = #_;
if ( exists $h2->{$key} ) {
my $val1 = $h1->{$key};
my $val2 = $h2->{$key};
if ( ref $val1 eq "HASH" and ref $val2 eq "HASH" ) {
_compare( $val1, $val2, get_key_str( $info, $key ) );
}
else {
die "Expected string value" if ref $val1 or ref $val2;
if ( $val1 ne $val2 ) {
say "Value '$val1' in $str1 hash for key " . get_key_str( $info, $key )
. " does not match value '$val2' in $str2 hash";
}
}
}
else {
my $cur_key = get_key_str( $info, $key );
say "Got key $cur_key in $str1 hash, but missing in $str2 hash";
}
}
sub _compare {
my ( $h1, $h2, $info ) = #_;
my %processed_keys;
for (keys %$h1) {
_process_key( $h1, $h2, $info, $_, "first", "second" );
$processed_keys{$_}++;
}
for (keys %$h2) {
next if exists $processed_keys{$_};
_process_key( $h2, $h1, $info, $_, "second", "first" );
}
}
Output:
Got key {key}{B} in first hash, but missing in second hash
Value '1 2' in first hash for key {key}{A}{2 3} does not match value '2 3' in second hash
Got key {key}{C} in second hash, but missing in first hash

Perl: Matching columns from two different files

(Note: Column headers are there for readability and are not in the actual files)
File 1
COLUMN1 COLUMN2 COLUMN3
AG_446337835.1 example1 grgsdt
AG_448352465.1 example2 190197
AG_449465753.1 example3 h837h8
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4
AG_441227463.1 example6 f3fw4t
AG_449986090.1 example7 gft7r4
AG_445666926.1 example8 4vsr55
AG_441004541.1 example9 fh893b
AG_444837264.1 example0 k3883d
File 2
COLUMN1 COLUMN2
grgsdt AAHG
h837h8 JUJN
190197 POKJ
f45ge4 DFRF
gft7r4 NNHN
d34tw4
fh893b YUNIP
k3883d YUNIP
f3fw4t YUNIP
190197 YUNIP
4vsr55 GHGF
Desired Output file
COLUMN1 COLUMN2 COLUMN3 COLUMN4 (formerly column2 from file2)
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
I am barely familiar with Perl (or programming general) and I was wondering if you would mind advising me with this problem.
Essentially, Column 3 in file1 corresponds to Column 1 in File2.
I want to take each line in file1, read column 3 of that line, search file2 for a matching entry, if a matching entry exists print the line from file1 with an extra column from file 2 to a new file (as seen in the example output).
The file sizes are
File1: 2GB
File2: 718MB
This script will be run off a machine with 250GB of ram so memory is not an issue.
This is what I have so far
#!/usr/bin/perl ;
#use warnings;
use Getopt::Long qw(GetOptions);
use experimental 'smartmatch';
#Variable to store inputted text file data
my $db ;
my $db2 ;
#Open and read File one into memory
open FPIN, "file1.txt" or die "Could not open";
my #file1 = <FPIN> ;
close FPIN;
#Open and read file two into memory
open FPIN, "file2.tab" or die "Could not open";
my #file2 = <FPIN> ;
close FPIN ;
foreach (#file2)
{
if (/(^\w+)\t(.+)/)
{
split /\t/, $2;
$db2->{$1}->{"geneName"} = $1 ;
$db2->{$1}->{"protein"} = $2 ;
}
}
foreach (#file1)
{
#if line begins with any word character tab and anything
if (/(^\w+.\d+)\t(.+)/)
{
my #fields = split /\t/, $2;
my $refSeqID = $1;
#assign the data in the array to variables
my ($geneSymbol, $geneName) = #fields[0, 1];
#Create database data structure and fill it with the info
$db->{$2}->{"refSeqID"} = $refSeqID ;
$db->{$2}->{"geneSymbol"} = $geneSymbol ;
$db->{$2}->{"geneName"} = $geneName ;
}
}
foreach my $id (sort keys %{$db2})
{
if ( exists $db->{$id} )
{
print $db2->{$id}."\t".$db->{$id}->{$geneSymbol}."\t".$db->{$id}->
{$refSeqID}."\t".$db2->{$id}->{$protein}."\n";
}
}
I seem to be able to read both files into memory correctly.
However I have been completely unable to compare the files to each other and I am dumbstruck on how to approach it.
Actually printing it will be another issue I need to tackle.

This will do as you ask
It starts by reading file2.txt and building a hash %f2 that relates the value of the first column to the value of the second
Thereafter it's just a matter of reading through file1.txt, splitting it into fields, and adding a further field obtained by accessing the hash using the value of the third field
I've used autodie to save the trouble of handling errors in the open calls. Otherwise everything is standard
Update
I've just noticed that a column 1 value may be repeated in file2.txt, so I've changed the code to make each key of the hash correspond to an array of values. All the values in the array appear, space-separated, in column 4 of the output
use strict;
use warnings 'all';
use autodie;
my %f2;
{
open my $fh, '<', 'file2.txt';
while ( <$fh> ) {
my ($key, $val) = split;
$f2{$key} //= [];
push #{ $f2{$key} }, $val if $val;
}
}
open my $fh, '<', 'file1.txt';
while ( <$fh> ) {
my #line = split;
my $c4 = $f2{$line[2]};
push #line, $c4 ? join(' ', #$c4) : '';
local $" = "\t";
print "#line\n";
}
output
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP

This one makes a left join. The key idea is to use geneName as a key in a hash.
#! /usr/bin/perl
use strict;
use warnings;
my %data = ();
open $a, "file1";
while (<$a>) {
chomp;
my #c = split;
$data{$c[2]} = [$c[0], $c[1], $c[2]];
}
open $b, "file2";
while (<$b>) {
chomp;
my #c = split;
push #{$data{$c[0]}}, exists $c[1] ? $c[1] : "";
}
print map { "#{$_}\n" } values %data;

perl match part of string in two files

I'm using a perl script to look for matches between columns in two tab-delimited files. However for one column I only want to look for a partial match between two strings in two columns.
It concerns $row[4] of $table2 and $row{d} of $table1.
The values in $row[4] of $table2 look like this:
'xxxx'.
The values in $row{d} of $table1 look like this:
'xxxx.aaa'.
If the part before the '.' is the same, there is a match. If not, there is no match. I'm not sure how to implement this in my script. This is what I have so far. I only looks for complete matches between different columns. '...' denotes code that is not important for this question
#! /usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
...
...
chomp( my #header_table2 = split /\t/, <$table2> );
my %lookup;
while(<$table2>){
chomp;
my #row = split(/\t/);
$lookup{ $row[0] }{ $row[1] }{ $row[4] }{ $row[5] }{ $row[6] }{ $row[7] }{ $row[8] } = [ $row[9], $row[10] ];
}
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
{
no warnings 'uninitialized';
while(<$table1>){
s/\t?\n\z//;
my %row;
#row{#header} = split /\t/;
print $table3 join ( "\t", #row{#header},
#{ $lookup{ $row{a} }{ $row{b} }{ $row{c} }{ $row{d} }{ $row{e} }{ $row{f} }{ $row{g} }
// [ "", "" ] }), "\n";
}
}

This is looking like a job for a database
The solution below isn't going to work, because you are building your %lookup hash with nine levels of keys ($row[0] .. $row[8]) , and accessing it with only seven levels ($row{a} .. $row{g}), so you're going to have to edit in the real situation
I see no reason to next your hashes so deeply. A single key formed by using join on the relevant fields will work fine and probably a little faster. I also see no reason to extract table2 fields into an array and table1 fields into a hash. An array seems fine in both cases
I've solved your immediate problem by copying each #row from table1 into array #key, and removing the last dot and anything following from the fourth element before building the $key string
In view of your history of adding a spare tab character before the newline at the end of each record, I've also added four die statements that verify the size of the header row and columns rows before continuing. You will probably need to tweak those values according to your real data
use strict;
use warnings 'all';
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
use constant TABLE1_COLUMNS => 9;
use constant TABLE2_COLUMNS => 11;
open my $table2, '<', 'table2.txt' or die $!;
my #header_table2 = do {
my $header = <$table2>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 2 header count " . scalar #header_table2
unless #header_table2 == TABLE2_COLUMNS;
my %lookup;
while ( <$table2> ) {
chomp;
my #row = split /\t/;
die "Incorrect table 2 column count " . scalar #row
unless #row == TABLE2_COLUMNS;
my $key = do {
local $" = "\n";
"#row[0..8]";
};
$lookup{ $key } = [ #row[9,10] ];
}
open my $table1, '<', 'table1.txt' or die $!;
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 1 header count " . scalar #header
unless #header == TABLE1_COLUMNS;
open my $table3, '>', 'table3.txt' or die $!;
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
while ( <$table1> ) {
s/\t?\n\z//;
my #row = split /\t/;
die "Incorrect table 1 column count " . scalar #row
unless #row == TABLE1_COLUMNS;
my $key = do {
my #key = #row;
$key[3] =~ s/\.[^.]*\z//;
local $" = "\n";
"#key";
};
my $lookup = $lookup{ $key } // [ "", "" ];
print $table3 join("\t", #row, #$lookup), "\n";
}

You're going to have a scoping problem because your array #row and your hash %row both exist in completely different scopes.
But if you have variables (say, $foo and $bar) and you want to know if $foo starts with the contents of $bar followed by a dot, then you can do that using a regular expression check like this:
if ($foo =~ /^$bar\./) {
# match
} else {
# no match
}

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}

my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.

I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.

If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

comparing the contents of 2 lines of a file and adding them using perl

I have a file called listofvalues.txt. The file has over 1000 lines and 5 columns.
1,232,3434,54343,434343
1,232,100,4546,3456
1,122,45454,4546,3456
2,212,334,5555,4654
...
...
I want to add up the values of the third column if column 1 and 2 are equal and print the result into a file like as follows
1,232,3534,54343,434343
1,122,45454,4546,3456
2,212,334,5555,4654
....
.........
.........
......
how do you think I can do it in Perl? Since I am new to Perl I am finding it hard to do.

This program works by maintaining an array #data containing a list of all records with unique column1|column2 keys. The first time a new key is encountered in the file the complete record is pushed onto the stack. Each subsequent encounter just adds the third field of the record to the original value.
The hash %data maintains references to the element of #data corresponding to each different value of the key.
use strict;
use warnings;
open my $fh, '<', 'listofvalues.txt' or die $!;
my #data;
my %data;
while (<$fh>) {
chomp;
my #record = split /,/;
my $key = join '|', #record[0,1];
if ($data{$key}) {
$data{$key}[2] += $record[2];
}
else {
push #data, ($data{$key} = \#record);
}
}
print join(',', #$_), "\n" for #data;
output
1,232,3534,54343,434343
1,122,45454,4546,3456
2,212,334,5555,4654
Update
A one-line solution
perl -F, -ane '$k="#F[0,1]";$s{$k}?$s{$k}[2]+=$F[2]:do{push#d,$k;$s{$k}=[#F]};END{$\"=',';print"#{$s{$_}}"for#d}' listofvalues.txt

Here is another one-linerish:
perl -F, -lane '
BEGIN { $, = "," }
if(defined(#A)) {
if($A[0] == $F[0] and $A[1] == $F[1]) {
$A[3] += $F[3];
} else {
print #A;
#A = (#F);
}
} else {
#A = (#F);
}
END { print #A }' listofvalues.txt
See perlrun(1) for the implications of switches.

Just because you can do it in a one-liner doesn't mean you should ;)
$ perl -F',' -lane '
push #order, [ #F[0,1] ]
unless $seen{$F[0]}{$F[1]}++; # Preserve order
$total{$F[0]}{$F[1]} += $F[2]; # Sum up
$value{$F[0]}{$F[1]} = join ',' => #F[0,1], $total{$F[0]}{$F[1]}, #F[3..$#F];
} END {
print $value{$_->{0]}{$_->[1]} for #order;
' file.txt

You could try the database approach, although it doesn't deal with col4 or col5.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect("DBI:CSV:");
$dbh->{'csv_tables'}->{'data'} = { 'file' => 'o33.txt',
'col_names' => [qw/col1 col2 col3 col4 col5/]};
my $sql = <<SQL;
select col1, col2, SUM(col3)
from data
group by col1, col2
order by col1, col2
SQL
my $sth = $dbh->prepare( $sql );
$sth->execute;
{
local $" = ',';
while ( my $row = $sth->fetchrow_arrayref ) {
print "#$row\n";
}
}
__END__
C:\Old_Data\perlp>type o33.txt
1,232,3434,54343,434343
1,232,100,4546,3456
1,122,45454,4546,3456
2,212,334,5555,4654
C:\Old_Data\perlp>perl t3.pl
1,122,45454
1,232,3534
2,212,334

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl extract columns from two files based on condition - perl

Related

Best way to compare multi-dimension hash in perl

Perl: Matching columns from two different files

perl match part of string in two files

Perl compare individual elements of two arrays

comparing the contents of 2 lines of a file and adding them using perl

Categories

Resources