sort array and remove duplicates in specific columns Perl - perl

I would like to remove duplicates rows in col 0 of an array in such a way that only the max values in col 1 remain. Data is tab delimited. There are 16 columns.
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_19 12 exon20 ...
sample2_EGFR_19 20 exon19 ...
sample3_EGFR_20 65 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample1_EGFR_20 12 exon20 ...
sample3_EGFR_20 125 exon20 ...
Desired output:
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
I've started with tab delimited text files that I split and populated an array. Then i use a hash and sort by keys. The final output I get the data correctly sorted, however, the duplicates are not removed. How do I remove lines that are now blank in the first column? Thanks
sample1_EGFR_19 53 exon19 ...
12 exon20 ...
sample2_EGFR_19 25 exon12 ...
20 exon19 ...
sample3 EGFR_20 125 exon20 ...
65 exon20 ...
sample1 EGFR_20 12 exon20 ...
Please suggest a straight forward method to accomplish his. Thanks
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use List::Util 'first';
use Data::Dumper;
my $filename = "/data/Test/output.txt";
my $output_filename = "/data/Test/output_changed.txt";
my #resultarray;
my #sorted;
open( TXT2, "$filename" );
while ( <TXT2> ) {
push( #resultarray, $_ );
}
close( TXT2 );
foreach ( #resultarray ) {
chop( $_ );
}
foreach ( #resultarray ) {
print( $_);
chomp( $_ );
my ( $key, $val ) = split /\t/, $_, 2;
push #{ $result_hash{$key} }, $val;
}
foreach ( sort { $result_hash{$a} <=> $result_hash{$b} } keys %result_hash ) {
push( #final_array, $_ . "\t" . join "\t", #{ $result_hash{$_} } );
}
undef %{result_hash};
foreach ( #final_array ) {
chomp( $_ );
print( $_);
}
for ( 0 .. #final_array - 1 ) {
my $myuniquearray = $final_array[$_];
open( MYFILE, ">>$output_filename" ); ##opens files with header and adds the rest of the lines.
print MYFILE $myuniquearray . "\n";
close( MYFILE );
}

This is a fairly straightforward UNIX one-liner. Why the requirement to write it in Perl?
$ sort -k1,1 -k2,2rn /data/Test/output.txt | awk '!seen[$1]++' | tee /data/Test/output_changed.txt
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
This sorts it by the first column ascending and by the second column descending and numeric, then uses awk to select the first line from each group. If that awk statement is too confusing, it has the same function as awk 'x != $1 { print; x = $1 }'. (tee writes the lines to the file and displays the output to the terminal.)
If you really must use Perl, here's a simple solution to the described problem:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_func {
# sort by the first col asc and then by the second col desc and numeric
$a->[0] cmp $b->[0] || $b->[1] <=> $a->[1]
}
my %seen;
print
map join("\t", #$_), # re-join the fields with tabs into the original line
grep !$seen{$_->[0]}++, # select the first line of each sorted group
sort sort_func # sort lines using the above sort function
map [split /\t/, $_, 3], # split by tabs so we can sort by the first two fields
<>; # read lines from stdin or the filename given by ARGV[0]
Mark the file executable and use it like so:
./sortlines.pl /data/Test/output.txt >/data/Test/output_changed.txt

Related

Perl extract columns from two files based on condition

I have 2 files, say file1 and file2.
file1.txt
RAC1 GK1 111
RAC2 GK2 222
RAC1 GK3 333
RAC1 GK4 222
RAC2 GK5 111
file2.txt
R1,PAAE,222,TESTA,COLA,NO
R2,RWWG,111,TESTB,COLM,YES
R3,TDAS,444,TESTC,COLZ,NO
I am comparing 2 files and trying to extract data from them. Condition here is if Column3 value of file1 matches with Column3 value of file2 then print the following output -
RAC1,GK1,111,R2,RWWG,TESTB,COLM,YES
RAC2,GK5,111,R2,RWWG,TESTB,COLM,YES
RAC2,GK2,222,R1,PAAE,TESTA,COLA,NO
RAC1,GK4,222,R1,PAAE,TESTA,COLA,NO
I have written a script for the same, by taking file1 column2 value as key. But this column value doesn't exists in file2. So comparison is not working.
Even I am not able to take column3(from file1) as key, because its having duplicated values.
Code below -
my %hash1 = ();
open(FH1, "file1.txt");
while(<FH1>){
chomp($_);
my #val = split(' ', $_);
$hash1{$val[1]}{'RAC_VAL'} = $val[0];
$hash1{$val[1]}{'ID'} = $val[2];
}
#print Dumper(\%hash1);
open(FH2, "file2.txt");
while(<FH2>){
chomp($_);
my #array = split(',', $_);
print "$hash1{$array[2]}{'RAC_VAL'},,$hash1{$array[2]}{'ID'},$array[0],$array[1],$array[3],$array[4],$array[5]\n" if(exists $hash1{$array[2]}{'ID'});
}
Please help me to get output for above data files based on the above said condition.
Here is an example using array of arrays as values in %hash1 (since the keys are not unique):
use feature qw(say);
use strict;
use warnings;
my %hash1;
open(FH1, "file1.txt");
while(<FH1>){
chomp($_);
my #val = split(' ', $_);
push #{ $hash1{$val[2]} }, [ #val[0,1] ];
}
open(FH2, "file2.txt");
while(<FH2>){
chomp($_);
my #array = split(',', $_);
if ( exists $hash1{$array[2]} ) {
for my $item ( #{ $hash1{$array[2]} } ) {
say join ',', #$item, #array[0,1,3,4,5];
}
}
}
Output:
RAC2,GK2,R1,PAAE,TESTA,COLA,NO
RAC1,GK4,R1,PAAE,TESTA,COLA,NO
RAC1,GK1,R2,RWWG,TESTB,COLM,YES
RAC2,GK5,R2,RWWG,TESTB,COLM,YES

In perl, I am trying to print the ENV variables and their values in 2 different columns but its not working

my $longest=0;
foreach my $key ( keys %ENV ) {
my $key_length = length( $key );
$longest = $key_length if $key_length > $longest;
}
foreach my $key ( sort keys %ENV ) {
printf "%-${longest}s %s\n", $key, $ENV{$key};
This does print the ENV variable name and value in two separate columns ;
But the ENV values that are long, wrap around :(
Any pointers on how I can ensure that the second column doesn't wrap around to first columns space ?
Output I am getting
xyz 123
abc 456,
789
Desired output
xyz 123
abc 456,
789
use Text::Wrap qw( wrap );
# local $Text::Wrap::columns = ...; # Default = 76
my $indent = " " x ($longest + 1);
for my $key ( sort keys %ENV ) {
printf("%-${longest}s %s\n", $key, wrap("", $indent, $ENV{$key}));
}
Alternatively, one could probably use Perl6::Form (a Perl5 module that provides Perl6-style forms).

Getting the summation of the 1,2 and 3 digits numbers from text file in perl

Suppose a set of numbers is given in a file number_list.txt.
Find the sum of the numbers from the file in the following categories:
Sum of all 1 digits numbers
Sum of all 2 digits numbers
Sum of all 3 digits numbers
Sum of all numbers starting with a digit 7
Sum of all number ending with a digit 8
write code in perl to find the above sums
example:
If we have "number_list.txt"
23
258
1
24
57
76
85
72
4
654
958
6
46
358
Then we need to get answer like this
Sum of all 1 digits numbers
1 + 4 + 6 = 11
Sum of all 2 digits numbers
23 + 24 + 57 + 76 + 85 + 72 + 46 = 383
Sum of all 3 digits numbers
258 + 654 + 958 + 358 = 2228
Sum of all numbers starting with a digit 7
76 + 72 = 148
Sum of all number ending with a digit 8
258 + 358 + 958 = 1574
And I have done so far.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "numbers.txt";
open( my $fh, '<', $filename ) or die "Can't open $filename: $!";
my #array1;
my #array2;
my #array3;
my #array4;
print "\n \n";
while ( my $line = <$fh> ) {
#if ( length($line) == 1)
#{
# push (#array1, $line);
#}
if ( length($line) == 2)
{
push (#array2, $line);
}
if ( length($line) == 3)
{
push (#array3, $line);
}
if ( length($line) == 4)
{
push (#array4, $line);
}
}
#print "\n Sum of all digits numbers\n \n";
#for each (#array1) {
# print "$_";
# }
my $sum1 = 0;
my $sum2 = 0;
my $sum3 = 0;
print "\n \n Sum of all 1 digits numbers of:\n \n";
for each my $num2 (#array2) {
print "$num2";
$sum1 = $sum1 + $num2;
}
print "\n Sum = $sum1";
print "\n \n Sum of all 2 digits numbers of:\n \n";
for each my $num3 (#array3) {
print "$num3";
$sum2 = $sum2 + $num3;
}
print "\n Sum = $sum2";
print "\n \n Sum of all 3 digits numbers of:\n \n";
foreach my $num4(#array4) {
print "$num4";
$sum3 = $sum3 +$num4;
}
print "\n Sum = $sum3";
So I have trouble with to make this program in simple way. Is there any simple method have to do this program ?
And also I have trouble with getting
Sum of all numbers starting with a digit 7
Sum of all number ending with a digit 8
Firstly, import sum from List::Util to make summing arrays of numbers easier.
If it's a relatively small file (which it probably is), things are a lot easier if you pull them all into an array first and grep them as you need them
so after you have opened you file...
use List::Util 'sum';
my #nums = <$fh>;
chomp #nums; # Remove trailing newlines;
my #values;
print "Sum of all 3 digits numbers\n";
#values = grep { length == 3 } #nums;
print join( ' + ', #values ), ' = ', sum( #values ), "\n";
print "Sum of all numbers starting with a digit 7\n";
#values = grep { index( $_, 7 ) == 0 } #nums;
print join(' + ', #values), ' = ', sum( #values ), "\n";
print "Sum of all numbers ending with a digit 8\n";
#values = grep { index( reverse($_), 8 ) == 0 } #nums;
print join(' + ', #values), ' = ', sum( #values ), "\n";
Ideally you'd put any duplicate code (like that printing of the sums) inside a sub. Every time you retype the same code, you increase your chances of making an error.
If you want to do it by streaming through the file, then you will have to keep track of multiple arrays as you go
my (#len_3, #len_4, #len_5, #start_7, #end_8);
while (my $n = <$fh>) {
if ( length $n == 3 ) {
push #len_3, $n;
}
# ...
if ( index($n, 7) == 0 ) {
push #start_7, $nl=;
}
}
print "Sum of all 3 digits numbers\n";
print join(' + ', #len_3), ' = ', sum( #len_3 ), "\n";
# ...
The more 'cases' you have, the more arrays you have to keep track of. There are better ways to do this - such as storing array references as hash values - but if your just learning, that may be a little confusing right now.
I'd tackle it like this:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw ( sum );
my %numbers;
my %startswith;
open( my $fh, '<', "numbers.txt") or die $!;
while (<$fh>) {
my ( $num, $start_num ) = m|((\d)\d*)|;
push( #{ $numbers{length($num)} }, $num );
push( #{ $startswith{$start_num} }, $num );
}
close ( $fh );
print Dumper \%numbers;
print Dumper \%startswith;
foreach my $len ( sort keys %numbers ) {
print "Sum of all $len digits numbers:\n";
print join( "+", #{ $numbers{$len} } ), "\n";
print sum ( #{$numbers{$len}}),"\n";
}
foreach my $first ( sort keys %startswith ) {
print "Sum of all numbers starting with $first:\n";
print join( "+", #{ $startswith{$first} } ), "\n";
print sum ( #{ $startswith{$first} } ), "\n";
}
This gives output of:
Sum of all 1 digits numbers:
1+4+6
11
Sum of all 2 digits numbers:
23+24+57+76+85+72+46
383
Sum of all 3 digits numbers:
258+654+958+358
2228
Sum of all numbers starting with 1:
1
1
Sum of all numbers starting with 2:
23+258+24
305
Sum of all numbers starting with 3:
358
358
Sum of all numbers starting with 4:
4+46
50
Sum of all numbers starting with 5:
57
57
Sum of all numbers starting with 6:
654+6
660
Sum of all numbers starting with 7:
76+72
148
Sum of all numbers starting with 8:
85
85
Sum of all numbers starting with 9:
958
958
And will implicitly support arbitrary length numbers. (You could use List::Util for the sum, but I thought I'd offer a non-module.
You can done all with regex.
while( <$fh> ) {
$sum1+= $_ if /^\d$/;
$sum2+= $_ if /^\d{2}$/;
$sum3+= $_ if /^\d{3}$/;
$sum7+= $_ if /^7\d*$/;
$sum8+= $_ if /^\d*8$/;
}
if you put number to list, you can print them as:
local $" = ' + ';
print "#sum1 = $sum1";

Print header row and each row in file the file, side by side in columns

Hi I have a tsv file that i am trying to print the header row and each line of the file side by side ie in columns.
Unfortunatley i am bit confused on how to join the lines in a print statement.
#!/usr/bin/perl
use strict;
use warnings;
local $" = "'\n'";
my #temp;
while (<DATA>) {
chomp;
my #columns = join "\t", $_;
push #temp, #columns;
}
my $Header_row = shift (#temp);
my #head;
my $abc;
my #abc = split(/\t/,$Header_row);
for my $abc(#abc) {
push #head, $abc ."\n";
}
my #roows;
my $elements;
foreach (#temp){
chomp;
my $line = $_;
my #elements = split ("\t", $line);
for $elements(#elements){
push #roows, $elements ."\n";
}
}
#print #head, $abc ."\n";
#print #roows, $elements ."\n";
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
required Output
OUTPUT
Year 88
Tonn 61
Class T
Cargo Rice
Type Truck
Year 89
Tonn 55
Class G
Cargo Corn
Type Train
Year 92
Tonn 93
Class S
Cargo Peas
Type Ship
Based on your source, this should do the trick:
#!/usr/bin/env perl
use strict;
use warnings;
#read the header line into #header;
my $header_line = <DATA>;
chomp $header_line;
chomp ( my #header = split ( ' ', $header_line );
#iteraate data fh
while ( <DATA> ) {
#strip linefeed
chomp;
#read this row into a hash
my %row; #row{#header} = split;
#print this hash in the same order as the header.
#note - $_ is set to each element of header in turn when doing this.
print "$_\t$row{$_}\n" for #header;
#insert extra linefeed
print "\n";
}
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
Note - you can condense further that 'read headers' to:
chomp ( my #header = split ( ' ', <DATA> ) );
Which does the same thing, but might be a bit harder to follow.
There's really no need to read all the lines into #temp before looping through to print them out. It would be more efficient to read just the first line to get the headings and then loop through the remaining lines printing them immediately:
#!/usr/bin/perl
use strict;
use warnings;
my #temp;
my $line = <DATA>;
chomp($line);
my #head = split "\t", $line;
foreach $line (<DATA>) {
chomp($line);
my #elements = split ("\t", $line);
foreach my $i (0..$#head) {
print $head[$i], "\t", $elements[$i], "\n";
}
print "\n";
}
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
The print line could also be written as:
print "$head[$i]\t$elements[$i]\n";
I just thought it was a little clearer to separate out all the parts.

Compare three files based on columns using Perl

I have three files, and I need to match the first column of file 1 to the first column of file 2 and then match the second column of file 1 with the first column of file 3.
file 1:
fji01dde AIDJFMGKG
dlp02sle VMCFIJGM
cmr03lsp CKEIFJ
and so on...
file 2:
fji01dde 25 30
dlp02sle 40 50
cmr03lsp 60 70
and so on...
file 3:
AIDJFMGKG
CKEIFJ
output needs to be:
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
and so on...
I only want lines that are common in all three files.
The below code results in the following output:
AIDJFMGKG
CKEIFJ
fji01dde 25
dlp02sle 40
cmr03lsp 60
#!/usr/bin/env perl
use strict;
use warnings;
my %data;
while (<>) {
my ( $key, $value ) = split;
push( #{ $data{$key} }, $value );
}
foreach my $key ( sort keys %data ) {
if ( #{ $data{$key} } >= #ARGV ) {
print join( "\t", $key, #{ $data{$key} } ), "\n";
}
}
Any ideas? Thanks in advance!
OK, looking at it - your problem is with that split - because by default, it splits on whitespace. Your second file has 3 fields by that yardstick, not two.
But also - you're not actually crossreferecing the same things, so your while ( <> ) { loop isn't going to do the trick.
In file 1 - you want to check for the value.
In file2, you're checking the key (and appending the values).
In file3, you have no value, just a key.
So with that in mind:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
#read file1 into a hash - but invert is it's value => key instead:
# 'CKEIFJ' => 'cmr03lsp',
# etc.
open( my $file1, '<', "file1.txt" ) or die $!;
my %file1_content = map { reverse split } <$file1>;
close($file1);
print Dumper \%file1_content;
#read file 2 - read keys, store the values.
#split _2_ fields, so we keep both numbers as a substring:
#e.g.:
# 'cmr03lsp' => '60 70
#',
open( my $file2, '<', "file2.txt" ) or die $!;
my %file2_content = map { split( " ", $_, 2 ) } <$file2>;
close($file2);
print Dumper \%file2_content;
#then iterate file 3, checking if:
#file1 has a matching 'key' (but inverted - as a value)
#file2 has a cross reference.
open( my $file3, '<', "file3.txt" ) or die $!;
while ( my $line = <$file3> ) {
chomp $line;
if ( $file1_content{$line}
and $file2_content{ $file1_content{$line} } )
{
print
"$file1_content{$line} $line $file2_content{$file1_content{$line}}";
}
}
close($file3);
This prints (excluding the "dumper" output):
fji01dde AIDJFMGKG 25 30
cmr03lsp CKEIFJ 60 70
When I run this code, I get an error message: "Odd number of elements in hash assignment at line 10." Also, the columns in these files are separated by tabs.
Not with that sample data you don't. But yes - if your first file has more than two words per line, this will happen.
You can unroll that loop into a while loop:
while ( <$file1> ) {
my #fields = split;
warn "Too many fields on line $. \n" if #fields > 2;
$file1_data{$fields[1]} = $fields[0];
}