Perl: Matching columns from two different files - perl

(Note: Column headers are there for readability and are not in the actual files)
File 1
COLUMN1 COLUMN2 COLUMN3
AG_446337835.1 example1 grgsdt
AG_448352465.1 example2 190197
AG_449465753.1 example3 h837h8
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4
AG_441227463.1 example6 f3fw4t
AG_449986090.1 example7 gft7r4
AG_445666926.1 example8 4vsr55
AG_441004541.1 example9 fh893b
AG_444837264.1 example0 k3883d
File 2
COLUMN1 COLUMN2
grgsdt AAHG
h837h8 JUJN
190197 POKJ
f45ge4 DFRF
gft7r4 NNHN
d34tw4
fh893b YUNIP
k3883d YUNIP
f3fw4t YUNIP
190197 YUNIP
4vsr55 GHGF
Desired Output file
COLUMN1 COLUMN2 COLUMN3 COLUMN4 (formerly column2 from file2)
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
I am barely familiar with Perl (or programming general) and I was wondering if you would mind advising me with this problem.
Essentially, Column 3 in file1 corresponds to Column 1 in File2.
I want to take each line in file1, read column 3 of that line, search file2 for a matching entry, if a matching entry exists print the line from file1 with an extra column from file 2 to a new file (as seen in the example output).
The file sizes are
File1: 2GB
File2: 718MB
This script will be run off a machine with 250GB of ram so memory is not an issue.
This is what I have so far
#!/usr/bin/perl ;
#use warnings;
use Getopt::Long qw(GetOptions);
use experimental 'smartmatch';
#Variable to store inputted text file data
my $db ;
my $db2 ;
#Open and read File one into memory
open FPIN, "file1.txt" or die "Could not open";
my #file1 = <FPIN> ;
close FPIN;
#Open and read file two into memory
open FPIN, "file2.tab" or die "Could not open";
my #file2 = <FPIN> ;
close FPIN ;
foreach (#file2)
{
if (/(^\w+)\t(.+)/)
{
split /\t/, $2;
$db2->{$1}->{"geneName"} = $1 ;
$db2->{$1}->{"protein"} = $2 ;
}
}
foreach (#file1)
{
#if line begins with any word character tab and anything
if (/(^\w+.\d+)\t(.+)/)
{
my #fields = split /\t/, $2;
my $refSeqID = $1;
#assign the data in the array to variables
my ($geneSymbol, $geneName) = #fields[0, 1];
#Create database data structure and fill it with the info
$db->{$2}->{"refSeqID"} = $refSeqID ;
$db->{$2}->{"geneSymbol"} = $geneSymbol ;
$db->{$2}->{"geneName"} = $geneName ;
}
}
foreach my $id (sort keys %{$db2})
{
if ( exists $db->{$id} )
{
print $db2->{$id}."\t".$db->{$id}->{$geneSymbol}."\t".$db->{$id}->
{$refSeqID}."\t".$db2->{$id}->{$protein}."\n";
}
}
I seem to be able to read both files into memory correctly.
However I have been completely unable to compare the files to each other and I am dumbstruck on how to approach it.
Actually printing it will be another issue I need to tackle.

This will do as you ask
It starts by reading file2.txt and building a hash %f2 that relates the value of the first column to the value of the second
Thereafter it's just a matter of reading through file1.txt, splitting it into fields, and adding a further field obtained by accessing the hash using the value of the third field
I've used autodie to save the trouble of handling errors in the open calls. Otherwise everything is standard
Update
I've just noticed that a column 1 value may be repeated in file2.txt, so I've changed the code to make each key of the hash correspond to an array of values. All the values in the array appear, space-separated, in column 4 of the output
use strict;
use warnings 'all';
use autodie;
my %f2;
{
open my $fh, '<', 'file2.txt';
while ( <$fh> ) {
my ($key, $val) = split;
$f2{$key} //= [];
push #{ $f2{$key} }, $val if $val;
}
}
open my $fh, '<', 'file1.txt';
while ( <$fh> ) {
my #line = split;
my $c4 = $f2{$line[2]};
push #line, $c4 ? join(' ', #$c4) : '';
local $" = "\t";
print "#line\n";
}
output
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP

This one makes a left join. The key idea is to use geneName as a key in a hash.
#! /usr/bin/perl
use strict;
use warnings;
my %data = ();
open $a, "file1";
while (<$a>) {
chomp;
my #c = split;
$data{$c[2]} = [$c[0], $c[1], $c[2]];
}
open $b, "file2";
while (<$b>) {
chomp;
my #c = split;
push #{$data{$c[0]}}, exists $c[1] ? $c[1] : "";
}
print map { "#{$_}\n" } values %data;

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

perl read file from specified line thru the end

I'm new with perl. I'm trying to read a large comma separate file, split and grab only some columns. I could create it with some internet help, but I'm struggling to change to code to start reading from a specific line thru the end of the file.
my need is open file start reading on line 12, split ',' grab column 0,2,10,11 and concatenate those needed columns with '\t'.
here is my code
#!/usr/bin/perl
my $filename = 'file_to_read.csv';
open(FILER, $filename) or die "Could not read $filename.";
open(FILEW, ">$filename.txt") || die "couldn't create the file\n";
while(<FILER>) {
chomp;
my #fields = split(',', $_);
print FILEW "$fields[0]\t$fields[3]\t$fields[10]\t$fields[11]\n";
}
close FILER;
close FILEW;
here is the file example:
[Header]
GSGT Version: X
Processing Date:12/01/2010 7:20 PM
Content:
Num SNPs:
Total SNPs:
Num Samples:
Total Samples:
Sample:
[Data]
SNP Name,Chromosome,Pos,GC Score,Theta,R,X,Y,X Raw,Y Raw,B Allele Freq,Log R Ratio,Allele1 - TOP,Allele2 - TOP
1:10001102-G-T,1,10001102,0.4159,0.007,0.477,0.472,0.005,6281,126,0.0000,-0.2581,A,A
1:100011159-T-G,1,100011159,0.4259,0.972,0.859,0.036,0.822,807,3648,0.9942,-0.0304,C,C
1:10002775-GA,1,10002775,0.4234,0.977,1.271,0.043,1.228,809,5140,0.9892,0.0111,G,G
Rather than skipping until a specific line number, which may vary from file to file, it is best to keep track of the current section of the file marked by [Header], [Data] etc.
This solution keeps a state variable $section which is updated to the current section name every time a [Section] label is encountered in the file. Everything from the Data section is summarised and printed
A similar thing could be done with the column headers, using names instead of numbers to select the fields to be output, but I have chosen to keep the complexity down instead
use strict;
use warnings 'all';
use feature 'say';
my $filename = 'file_to_read.csv';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my $section = "";
while ( <$fh> ) {
next unless /\S/; # Skip empty lines
if ( $section eq 'Data' ) { # Skip unless we're in the [Data] section
chomp;
my #fields = split /,/;
say join ',', #fields[0,3,10,11];
}
elsif ( /\[(\w+)\]/ ) {
$section = $1;
}
}
output
SNP Name,GC Score,B Allele Freq,Log R Ratio
1:10001102-G-T,0.4159,0.0000,-0.2581
1:100011159-T-G,0.4259,0.9942,-0.0304
1:10002775-GA,0.4234,0.9892,0.0111
please assign a variable to count the lines processed like my $line_count = 0;
and inside the beginning of while loop increment the varialbe $line_count++;
and skip if the line count is below 12 ie , next if $line_count > 12;

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

perl to loop and check across files

still having trouble with perl programming and I need to be pushed to make a script work out.
I have two files and I want to use the list file to "extract" rows from the data one. The problem is that the list file is formatted as follow:
X1 A B
X2 C D
X3 E F
And my data looks like this:
A X1 2 5
B X1 3 7
C X2 1 4
D X2 1 5
I need to obtain the element pairs from the list file by which select the row in the data file. At the same time I would like to write an output like this:
X1 A B 2 5 3 7
X2 C D 1 4 1 5
I'm trying writing a perl code, but I'm not able to produce something useful. I'm at this point:
open (LIST, "< $fils_list") || die "impossibile open the list";
#list = <LIST>;
close (LIST);
open (HAN, "< $data") || die "Impossible open data";
#r = <HAN>;
close (HAN);
for ($p=0; $p<=$#list; $p++){
chomp ($list[$p]);
($x, $id1, $id2) = split (/\t/, $list[$p]);
$pair_one = $id1."\t".$x;
$pair_two = $id2."\t".$x;
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($a, $b, $value1, $value2) = split (/\t/, $r[$i]);
$bench = $a."\t".$b;
if (($pair_one eq $bench) || ($pair_two eq $bench)){
print "I don't know what does this script must print!\n";
}
}
}
I'm not able to rationalize about what to print.
Any kind of suggestion is very welcome!
A few general recommendations:
Indent your code to show the structure of your program.
Use meaningful variable names, not $a or $value1 (if I do so below, this is due to my lack of domain knowledge).
Use data structures that suit your program.
Don't do operations like parsing a line more that once.
In Perl, every program should use strict; use warnings;.
use autodie for automatic error handling.
Also, use the open function like open my $fh, "<", $filename as this is safer.
Remember what I said about data structures? In the second file, you have entries like
A X1 2 5
This looks like a secondary key, a primary key, and some data columns. Key-value relationships are best expressed through a hash table.
use strict; use warnings; use autodie;
use feature 'say'; # available since 5.010
open my $data_fh, "<", $data;
my %data;
while (<$data_fh>) {
chomp; # remove newlines
my ($id2, $id1, #data) = split /\t/;
$data{$id1}{$id2} = \#data;
}
Now %data is a nested hash which we can use for easy lookups:
open my $list_fh, "<", $fils_list;
LINE: while(<$list_fh>) {
chomp;
my ($id1, #id2s) = split /\t/;
my $data_id1 = $data{$id1};
defined $data_id1 or next LINE; # maybe there isn't anything here. Then skip
my #values = map #{ $data_id1->{$_} }, #id2s; # map the 2nd level ids to their values and flatten the list
# now print everything out:
say join "\t", $id1, #id2s, #values;
}
The map function is a bit like a foreach loop, and builds a list of values. We need the #{ ... } here because the data structure doesn't hold arrays, but references to arrays. The #{ ... } is a dereference operator.
This is how i would do it, mostly using Hashes resp. Hash- and Array-References (test1.txt and test2.txt contain the data you provided in your example):
use strict;
use warnings;
open(my $f1, '<','test1.txt') or die "Cannot open file1: $!\n";
open(my $f2, '<','test2.txt') or die "Cannot open file2: $!\n";
my #data1 = <$f1>;
my #data2 = <$f2>;
close($f1);
close($f2);
chomp #data1;
chomp #data2;
my %result;
foreach my $line1 (#data1) {
my #fields1 = split(' ',$line1);
$result{$fields1[0]}->{$fields1[1]} = [];
$result{$fields1[0]}->{$fields1[2]} = [];
}
foreach my $line2 (#data2){
my #fields2 = split(' ',$line2);
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[2];
push #{$result{$fields2[1]}->{$fields2[0]}}, $fields2[3];
}
foreach my $res (sort keys %result){
foreach (sort keys %{$result{$res}}){
print $res . " " . $_ . " " . join (" ", sort #{$result{$res}->{$_}}) . "\n";
}
}

Multiple values in a hash for a single key

I have two files.
One composed of a unique list while the other one is a redundant list of name with the age.
for example
File1: File2:
Gaia Gaia 3
Matt Matt 12
Jane Gaia 89
Reuben 4
My aim is to match File1 and File2 and to retrieve the highest age for each name.
So far I have written the below code.
The bit that do not work quite well is: when the same key is found in the hash, print the bigger value.
Any suggestion/comment is welcome!
Thanks!!
#!/usr/bin/perl -w
use strict;
open (FILE1, $ARGV[0] )|| die "unable to open arg1\n"; #Opens first file for comparison
open (FILE2, $ARGV[1])|| die "unable to open arg2\n"; #2nd for comparison
my #not_red = <FILE1>;
my #exonslength = <FILE2>;
#2) Produce an Hash of File2. If the key is already in the hash, keep the couple key- value with the highest value. Otherwise, next.
my %hash_doc2;
my #split_exons;
my $key;
my $value;
foreach my $line (#exonslength) {
#split_exons = split "\t", $line;
#hash_doc2 {$split_exons[0]} = ($split_exons[1]);
if (exists $hash_doc2{$split_exons[0]}) {
if ( $hash_doc2{$split_exons[0]} > values %hash_doc2) {
$hash_doc2{$split_exons[0]} = ($split_exons[1]);
} else {next;}
}
}
#3) grep the non redundant list of gene from the hash with the corresponding value
my #a = grep (#not_red,%hash_doc2);
print "#a\n";
Do you need to keep all the values? If not, you can only keep the max value:
#split_exons = split "\t", $line;
if (exists $hash_doc2{$slit_exons[0]}
and $hash_doc2{$slit_exons[0]} < $split_exons[1]) {
$hash_doc2{$split_exons[0]} = $split_exons[1];
}
You code does not keep all the values, either. You cannot store an array into a hash value, you have to store a reference. Adding a new value to an array can by done by push:
push #{ $hash_doc2{$split_exons[0]} }, $split_exons[1];
Your use of numeric comparison against values is also not doing what you think. The < operator imposes a scalar context, so values returns the number of values. Another option would be to store the values sorted and always ask for the highest value:
$hash_doc2{$split_exons[0]} = [ sort #{ $hash_doc2{$split_exons[0]} }, $split_exons[1] ];
# max for $x is at $hash_doc2{$x}[-1]
Instead of reading in the whole of file2 into an array (which will be bad if it's big), you could loop through and process the data file line by line:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open( my $nameFh, '<', $ARGV[0]);
open( my $dataFh, '<', $ARGV[1]);
my $dataHash = {};
my $processedHash = {};
while(<$dataFh>){
chomp;
my ( $name, $age ) = split /\s+/, $_;
if(! defined($dataHash->{$name}) or $dataHash->{$name} < $age ){
$dataHash->{$name} = $age
}
}
while(<$nameFh>){
chomp;
$processedHash->{$_} = $dataHash->{$_} if defined $dataHash->{$_};
}
print Dumper($processedHash);