Compare three files based on columns using Perl

Compare three files based on columns using Perl - perl

I have three files, and I need to match the first column of file 1 to the first column of file 2 and then match the second column of file 1 with the first column of file 3.
file 1:
fji01dde AIDJFMGKG
dlp02sle VMCFIJGM
cmr03lsp CKEIFJ
and so on...
file 2:
fji01dde 25 30
dlp02sle 40 50
cmr03lsp 60 70
and so on...
file 3:
AIDJFMGKG
CKEIFJ
output needs to be:
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
and so on...
I only want lines that are common in all three files.
The below code results in the following output:
AIDJFMGKG
CKEIFJ
fji01dde 25
dlp02sle 40
cmr03lsp 60
#!/usr/bin/env perl
use strict;
use warnings;
my %data;
while (<>) {
my ( $key, $value ) = split;
push( #{ $data{$key} }, $value );
}
foreach my $key ( sort keys %data ) {
if ( #{ $data{$key} } >= #ARGV ) {
print join( "\t", $key, #{ $data{$key} } ), "\n";
}
}
Any ideas? Thanks in advance!

OK, looking at it - your problem is with that split - because by default, it splits on whitespace. Your second file has 3 fields by that yardstick, not two.
But also - you're not actually crossreferecing the same things, so your while ( <> ) { loop isn't going to do the trick.
In file 1 - you want to check for the value.
In file2, you're checking the key (and appending the values).
In file3, you have no value, just a key.
So with that in mind:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
#read file1 into a hash - but invert is it's value => key instead:
# 'CKEIFJ' => 'cmr03lsp',
# etc.
open( my $file1, '<', "file1.txt" ) or die $!;
my %file1_content = map { reverse split } <$file1>;
close($file1);
print Dumper \%file1_content;
#read file 2 - read keys, store the values.
#split _2_ fields, so we keep both numbers as a substring:
#e.g.:
# 'cmr03lsp' => '60 70
#',
open( my $file2, '<', "file2.txt" ) or die $!;
my %file2_content = map { split( " ", $_, 2 ) } <$file2>;
close($file2);
print Dumper \%file2_content;
#then iterate file 3, checking if:
#file1 has a matching 'key' (but inverted - as a value)
#file2 has a cross reference.
open( my $file3, '<', "file3.txt" ) or die $!;
while ( my $line = <$file3> ) {
chomp $line;
if ( $file1_content{$line}
and $file2_content{ $file1_content{$line} } )
{
print
"$file1_content{$line} $line $file2_content{$file1_content{$line}}";
}
}
close($file3);
This prints (excluding the "dumper" output):
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
When I run this code, I get an error message: "Odd number of elements in hash assignment at line 10." Also, the columns in these files are separated by tabs.
Not with that sample data you don't. But yes - if your first file has more than two words per line, this will happen.
You can unroll that loop into a while loop:
while ( <$file1> ) {
my #fields = split;
warn "Too many fields on line $. \n" if #fields > 2;
$file1_data{$fields[1]} = $fields[0];
}

Related

Join element in an array and separate with space

I want to join the first to 16th word and 17th to 31st, etc in an array with space to one line but do not know why the code does not work. Hope to get help here.Thanks
my #file = <FILE>;
for ( $i=0; $i<=$#file; $i+=16 ){
my $string = join ( " ", #file[$i..$i+15] );
print FILE1 "$string\n";
}
Below is part of my file.
1
2
3
...
What i wan to print is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21....

I wouldn't do it the way you've done it.
Instead I would:
open ( my $input, '<', "your_file_name" ) or die $!;
chomp ( my #file = <$input> );
print join (" ",splice (#file, 0, 15)),"\n" while #file;
Note - I've used a lexical file handle with a 3 argument open, because that's better style.
splice removes the first 16 elements from #file each iteration, and continues until #file is empty.

Your lines have newlines attached to them. Remove them with chomp. Then loop over the array, remove 16 items and print them.
my #file = <FILE>;
chomp #file;
while (#file) {
my #temp;
INNER: for ( 0 .. 15 ) {
push #temp, shift #file || last INNER; # not or
}
print join( q{ }, #temp ), "\n";
}
This is the long implementation of the splice solution Sobrique suggested in the comments. It's does the same thing, just way more verbose.
This is the old answer before the edit:
If you only want the first 16, this is way more effective.
my $string = join q{ }, map { <FILE>; chomp; $_ } 1 .. 16;
This reads 16 lines and chomp each of them, then joins.
You might also want to use lexical file handles $fh instead of the GLOB FILE.
open my $fh, '<', $path_to_file or die $!;

Suppose if you want to read it from file, don't store the whole file into an array. Instead loop through line by line. And check the line number with $. special variable.
use warnings;
use strict;
open my $fh,"<","input.txt";
my $line;
while (<$fh>)
{
chomp;
$line.= $_." ";
print "$line\n" and $line="" if($. % 16 == 0);
END{ print "$line\n";};
}
Or this also will work
use warnings;
use strict;
open my $wh,"<","input.txt";
my $line;
foreach (;;)
{
my $data = join " ",(map { my $m=<$wh> || ""; chomp($m); $m} (0..15));
last if ($data =~m/^\s+$/);
print $data,"\n";
}

Assuming that you have FILE and FILE1 descriptors open, try:
$.%16?s!\s+! !:1 and print FILE1 $_ while <FILE>;

Using perl to cycle through a list of values (x) and compare to another file with value ranges

I have two files. One file has a list of values like so
NC_SNPStest.txt
250
275
375
The other file has space delimited information. Column one is the first value of a range, Column two has the second value of a range, Column 5 has the name of the range, and Column eight has what acts on that range.
promoterstest.txt
20 100 yaaX F yaaX 5147 5.34 Sigma70 99
200 300 yaaA R yaaAp1 6482 6.54 Sigma70 35
350 400 yaaA R yaaAp2 6498 2.86 Sigma70 51
I am trying to write a script that takes the first line from file 1 and then parses file 2 line by line to see if that value falls in the range is between the first two columns.
When the first match is found, I want to print the value from file 1 and then the values in file 2 for columns 5 and 8 from the line with the match. If no match is found in File 2 then just print the value from File 1 and move on.
It seems like it should be a simple enough task but I'm having an issue cycling though both files.
This is what I have written:
#!/usr/bin/perl
use warnings;
use strict;
open my $PromoterFile, '<', 'promoterstest.txt' or die $!;
open my $SNPSFile, '<', 'NC_SNPtest.txt' or die $!;
open (FILE, ">PromoterMatchtest.txt");
while (my $SNPS = <$SNPSFile>) {
chomp ($SNPS);
while (my $Cord = <$PromoterFile>) {
chomp ($Cord);
my #CordFile =split(/\s/, $Cord);
my $Lend = $CordFile[0];
my $Rend = $CordFile[1];
my $Promoter = $CordFile[4];
my $SigmaFactor = $CordFile[7];
foreach $a ($SNPS)
{
if ($a >= $Lend && $a <= $Rend)
{
print FILE "$a\t$CordFile[4]\t$CordFile[7]\n";
}
else
{
print FILE "$a\n";
}
}
}
}
close FILE;
close $PromoterFile;
close $SNPSFile;
exit;
So far my output looks like so:
250
250 yaaAp1 Sigma70
250
Where the first line of file 1 is being called and file 2 is being cycled through. But the else command is being used on each line of file 2 and the script never cycles through the other lines of file 1.

Your problem is you're not resetting your progress through the second file. You read one line from $SNPSFile, check that against ever line in the second file.
But when you start over, you're already at the end of file, so:
while (my $Cord = <$PromoterFile>) {
Doesn't have anything to read.
A quick fix for this would be to add a seek command in there, but that'll make inefficient code. I'd suggest instead reading file 1 into a array, and referencing that instead.
Here's a first draft rewrite that may help.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
open my $PromoterFile, '<', 'promoterstest.txt' or die $!;
open my $SNPSFile, '<', 'NC_SNPtest.txt' or die $!;
open my $output, ">", "PromoterMatchtest.txt" or die $!;
my #data;
while (<$PromoterFile>) {
chomp;
my #CordFile = split;
my $Lend = $CordFile[0];
my $Rend = $CordFile[1];
my $Promoter = $CordFile[4];
my $SigmaFactor = $CordFile[7];
push(
#data,
{ lend => $CordFile[0],
rend => $CordFile[1],
promoter => $CordFile[4],
sigmafactor => $CordFile[7]
}
);
}
print Dumper \#data;
foreach my $value (<$SNPSFile>) {
chomp $value;
my $found = 0;
foreach my $element (#data) {
if ( $value >= $element->{lend}
and $value <= $element->{rend} )
{
#print "Found $value\n";
print {$output} join( "\t",
$value, $element->{promoter}, $element->{sigmafactor} ),
"\n";
$found++;
last;
}
}
if ( not $found ) {
print {$output} $value,"\n";
}
}
close $output;
close $PromoterFile;
close $SNPSFile;
First - we open file2, read in the stuff in it to an array of hashes. (If any of the elements there are unique, we could key off that instead.)
Then we read through SNPSfile one line at a time, looking for each key - printing it if it exists (at least once, on the first hit) and printing just the key if it doesn't.
This generates the output:
250 yaaAp1 Sigma70
275 yaaAp1 Sigma70
375 yaaAp2 Sigma70
Was that what you were aiming for?
Aside from that 'Dumper' statement which outputs the content of #data as thus:
$VAR1 = [
{
'sigmafactor' => 'Sigma70',
'promoter' => 'yaaX',
'lend' => '20',
'rend' => '100'
},
{
'sigmafactor' => 'Sigma70',
'promoter' => 'yaaAp1',
'rend' => '300',
'lend' => '200'
},
{
'promoter' => 'yaaAp2',
'sigmafactor' => 'Sigma70',
'rend' => '400',
'lend' => '350'
}
];

Here's my take on a programming solution. It's important to
Use lexical file handles and the three-paremeter form of open
Keep to lower-case letters, digits and underscores for local variables
I have also used the autodie pragma to remove the need to test the status of open explicitly, and the first function from the core library List::Util to make the code clearer and more concise
use strict;
use warnings;
use 5.010;
use autodie;
use List::Util 'first';
my #promoters;
{
open my $fh, '<', 'promoterstest.txt';
while ( <$fh> ) {
my #fields = split;
push #promoters, [ #fields[0,1,4,7] ];
}
}
open my $fh, '<', 'NC_SNPStest.txt';
open my $out_fh, '>', 'PromoterMatchtest.txt';
select $out_fh;
while ( <$fh> ) {
my ($num) = split;
my $match = first { $num >= $_->[0] and $num <= $_->[1] } #promoters;
if ( $match ) {
print join("\t", $num, #{$match}[2,3]), "\n";
}
else {
print $num, "\n";
}
}
output
250 yaaAp1 Sigma70
275 yaaAp1 Sigma70
375 yaaAp2 Sigma70

Parsing out text from string

I have a tab-delimited file1:
20 50 80 110
520 590 700 770
410 440 20 50
300 340 410 440
read and put them into an array:
while(<INPUT>)
{
chomp;
push #inputarray, $_;
}
Now I'm looping through another file2:
20, 410, 700
80, 520
300
foreach number of each line in file2, I want to search the #inputarray for the number. If it exists, I want to grab the corresponding number that follows. For instance, for number 20, I want to grab the number 50. I assume that they are still separated by a tab in the string that exists as an array element in #inputarray.
while(my $line = <INPUT2>)
{
chomp $line;
my #linearray = split("\t", $line);
foreach my $start (#linearray)
{
if (grep ($start, #inputarray))
{
#want to grab the corresponding number
}
}
}
Once grep finds it, i don't know how to grab that array element to find the position of the number to extract the corresponding number using perhaps the substr function. How do i grab the array element that grep found?
A desired output would be:
line1:
20 50
410 440
700 770
line2:
80 110
520 590
line3:
300 340

IMHO, it would be best to store the numbers from file1 in a hash. Referring to the example clontent of file1 as you provided above you can have something like below
{
'20' => '50',
'80' => '110',
'520'=> '590',
'700'=> '770',
'410'=> '440',
'20' => '50',
'300'=> '340',
'410' => '440'
}
A sample piece of code will be like
my %inputarray;
while(<INPUT>)
{
my #numbers = split $_;
my $length = scalar $numbers;
# For $i = 0 to $i < $length;
# $inputarray{$numbers[$i]} = $numbers[$i+1];
# $i+=2;
}
An demonstration of the above loop
index: 0 1 2 3
numbers: 20 50 80 110
first iteration: $i=0
$inputarray{$numbers[0]} = $numbers[1];
$i = 2; #$i += 2;
second iteration: $i=2
$inputarray{$numbers[2]} = $numbers[3];
And then while parsing file2, you just need to treat the number as the key of %inputarray.

I believe this gets you close to what you want.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $prev = undef;
foreach my $curr ( split /\s+/ )
{
$follows{$prev} = $curr if ($prev);
$prev = $curr;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}
If you only want to consider numbers from file1 in pairs, as opposed to seeing which numbers follow what other numbers without taking pair boundaries into account, then you need to change the logic in the first while loop slightly.
#!/usr/bin/perl -w
my %follows;
open my $file1, "<", $ARGV[0] or die "could not open $ARGV[0]: $!\n";
while (<$file1>)
{
chomp;
my $line = $_;
while ( $line =~ s/(\S+)\s+(\S+)\s*// )
{
$follows{$1} = $2;
}
}
close $file1;
open my $file2, "<", $ARGV[1] or die "could not open $ARGV[1]: $!\n";
my $lineno = 1;
while (<$file2>)
{
chomp;
print "line $lineno\n";
$lineno++;
foreach my $val ( split /,\s+/, $_ )
{
print $val, " ", ($follows{$val} // "no match"), "\n";
}
print "\n";
}

If you want to read the input once but check for numbers a lot, you might be better off to split the input line into individual numbers. Then add each each number as key into a hash with the following number as value. That makes reading slow and takes more memory but the second part, where you want to check for following numbers will be a breeze thanks to exist and the nature of hashes.
If i understood your question correct, you could use just one big hash. That is of course assuming that every number is always followed by the same number.

if value lies between two values then add another value to corresponding line

This is a description of my problem: I have two text files (here $variants and $annotation). I want to check if the value from column 2 in $variants lies between the values from column 2 and 3 in $annotation. If this is true then the value from column 1 in $annotation should be added to a new column in $variants.
This is how my sample input files look like
$annotationrepresents a tab-delimited text file
These values can be overlapping and cannot be perfectly sorted, since I'm working with a circular genome
C0 C1 C2
gene1 0 100
gene2 500 1000
gene3 980 1200
gene4 1500 5
$variants represents a tab-delimited text file
C0 C1
... 5
... 10
... 100
... 540
... 990
The output should look like this ($variants with two other columns added)
C0 C1 C2 C3
... 5 gene1 gene4
... 10 gene1
... 100 gene1
... 540 gene2
... 990 gene2 gene3
This is how my script looks like for the moment
my %hash1=();
while(<$annotation>){
my #column = split(/\t/); #split on tabs
my $keyfield = $column[1] && $column[2]; # I need to remember values from two columns here. How do I do that?
}
while(<$variants>){
my #column=split(/\t/); # split on tabs
my $keyfield = $column[1];
if ($hash1{$keyfield} >= # so the value in column[1] should be between the values from column[1] & [2] in $annotation
push # if true then add values from column[0] in $annotation to new column in $variants
}
So my biggest problems are how to remember two values in a file using hashes and how to put a value from one file to a column in another file. Could someone help me with this?

If the input files are not large and the positions are not too high, you can use arrays to represent all positions:
#!/usr/bin/perl
use warnings;
use strict;
sub skip_header {
my $FH = shift;
<$FH>;
}
open my $ANN, '<', 'annotation' or die $!;
my $max = 0;
while (<$ANN>) {
$_ > $max and $max = $_ for (split)[1, 2];
}
seek $ANN, 0, 0; # Rewind the file back.
my $circular;
my #genes;
while (<$ANN>) {
my ($gene, $from, $to) = split;
if ($from <= $to) {
$genes[$_] .= "$gene " for $from .. $to;
} else {
$circular = 1;
$genes[$_] .= "$gene " for 0 .. $to, $from .. $max + 1;
}
}
chop #genes;
open my $VAR, '<', 'variants' or die $!;
skip_header($VAR);
while (<$VAR>) {
next if /^\s*#/;
chomp;
my ($str, $pos) = split;
$pos = $#genes if $circular and $pos > $#genes;
print "$_ ", $genes[$pos] // q(), "\n";
}

No hashing needed at all. This example expects the annotations to be sorted and not overlapping, it also works only if all the values from variants should be printed.
#!/usr/bin/perl
use warnings;
use strict;
open my $VAR, '<', 'variants' or die $!;
<$VAR>; # skip header
my ($str, $pos) = split ' ', <$VAR>;
open my $ANN, '<', 'annotation' or die $!;
<$ANN>; # skip header
while (<$ANN>) {
my ($gene, $from, $to) = split;
while ($from <= $pos and $pos <= $to) {
print "$str $pos $gene\n";
($str, $pos) = split ' ', <$VAR> or last;
}
}

parse a tab delimited data using perl

I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.

The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1

This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)

You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.

Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Compare three files based on columns using Perl - perl

Related

Join element in an array and separate with space

Using perl to cycle through a list of values (x) and compare to another file with value ranges

Parsing out text from string

if value lies between two values then add another value to corresponding line

parse a tab delimited data using perl

Categories

Resources