My question is similar to this question posted earlier.
I am having many files which I need to merge them based on the presence or absence of the first column ID, but while merging I am getting lots of empty values in my output file, I want those empty values to be zero if it is not present in another file. The example below is based on only two files content, but I have many sample files like this format (tabular).
For example:
File1
ID Value
123 1
231 2
323 3
541 7
File2
ID Value
541 6
123 1
312 3
211 4
Expected Output:
ID File1 File2
123 1 1
231 2 0
323 3 0
541 7 6
312 0 3
211 0 4
Obtaining Output:
ID File1 File2
123 1 1
231 2
323 3
541 7 6
312 undef 3
211 undef 4
As you can see above I am getting output but in file2 column, it's not adding zero or leaving empty and in case of file1 column it is having undef value. I have checked undef values and then my final output gives zeros in place of undef values but still I am having those empty spaces. Please find my code below (hardcoded only for two files).
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
my $path = "/home/pranjay/Projects/test";
my #files = ("s1.txt","s2.txt");
my %classic_com;
my $cnt;
my $classic_txt;
my $sample_cnt = 0;
my $classic_txtcomb = "test_classic.txt";
open($classic_txt,">$path/$classic_txtcomb") or die "Couldn't open file
$classic_txtcomb for writing,$!";
print $classic_txt "#ID\t"."file1\tfile2\n";
foreach my $file(#files){
$sample_cnt++;
print "$sample_cnt\n";
open($cnt,"<$path/$file")or die "Couldn't open file $file for reading,$!";
while(<$cnt>){
chomp($_);
my #count = ();
next if($_=~/^ID/);
my #record=();
#record=split(/\t/,$_);
my $scnt = $sample_cnt -1;
if((exists($classic_com{$record[0]})) and ($sample_cnt > 0)){
${$classic_com{$record[0]}}[$scnt]=$record[1];
}else{
$count[$scnt] = "$record[1]";
$classic_com{$record[0]}= [#count];
}
}
}
my %final_txt=();
foreach my $key ( keys %classic_com ) {
#print "$key: ";
my #val = #{ $classic_com{$key} };
my #v;
foreach my $i ( #val ) {
if(not defined($i)){
$i = 0;
push(#v, $i);
}else{
push(#v, $i);
next;
}
}
$final_txt{$key} = [#v];
}
#print Dumper %classic_com;
while(my($key,$value)=each(%final_txt)){
my $val=join("\t", #{$value});
print $classic_txt "$key\t"."#{$value}"."\n";
}
Just read the input files into a hash of arrays. The topmost key is the ID, each inner array contains the value for file i on the i-th position. When printing, use the // defined-or operator to replace undefs with zeroes:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my %merged;
my $file_tally = 0;
while (my $file = shift) {
open my $in, '<', $file or die "$file: $!";
<$in>; # skip the header
while (<$in>) {
my ($id, $value) = split;
$merged{$id}[$file_tally] = $value;
}
++$file_tally;
}
for my $value (keys %merged) {
my #values = #{ $merged{$value} };
say join "\t", $value, map $_ // 0, #values[0 .. $file_tally - 1];
}
program.pl
my %val;
/ (\d+) \s+ (\d+) /x and $val{$1}{$ARGV} = $2 while <>;
pr( 'ID', my #f = sort keys %{{map%$_,values%val}} );
pr( $_, map$_//0, #{$val{$_}}{#f} ) for sort keys %val;
sub pr{ print join("\t",#_)."\n" }
Run:
perl program.pl s1.txt s2.txt
ID s1.txt s2.txt
123 1 1
211 0 4
231 2 0
312 0 3
323 3 0
541 7 6
Related
Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
I have 2 tab-delimited files formatted similar to this:
file 1
A 100 90 PASS
B 89 80 PASS
C 79 70 PASS
D 69 60 FAIL
F 59 0 FAIL
file 2
Randy 80
Denis 44
Earl 97
I want to take the values from column 2 in file 2 and compare them with the ranges given between columns 2 and 3 of file 1. Then I want to create a new file that combines this data, printing columns 1 and 2 from file 2 and columns 1 and 4 from file 1:
file 3
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
I want to implement this using awk or perl.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"}
FNR==NR {
a[$0] = $2
next
}
{
for (i in a)
if ($2>=a[i] && $3<=a[i])
print i, $1, $4
}' file2 file1
Earl 97 A PASS
Randy 80 B PASS
Denis 44 F FAIL
In perl, I'd probably do something like this:
#!/usr/bin/env perl
use strict;
use warnings 'all';
use Data::Dumper;
open ( my $grades_in, '<', "file1.txt" ) or die $!;
my #grade_lookup = map { [split] } <$grades_in>;
print Dumper \#grade_lookup;
close ( $grades_in );
open ( my $people, '<', "file2.txt" ) or die $!;
while (<$people>) {
chomp;
my ( $person, $score ) = split;
my ( $grade ) = grep { $_ -> [1] >= $score
and $_ -> [2] <= $score } #grade_lookup;
print join " ", $person, $score, $grade -> [0], $grade -> [3], "\n";
}
close ( $people );
output:
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
In Perl
use strict;
use warnings 'all';
use autodie;
use List::Util 'first';
my #grades = do {
open my $fh, '<', 'file1.txt';
map [ split ], <$fh>;
};
open my $fh, '<', 'file2.txt';
while ( <$fh>) {
my ($name, $score) = split;
my $grade = first { $_->[2] <= $score } #grades;
print "$name $score #$grade[0,3]\n";
}
output
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
I have a table having the following structure
gene transcript exon length
A NM_1 1 10
A NM_1 2 5
A NM_1 3 20
A NM_2 1 10
A NM_2 2 5
A NM_2 3 50
B NM_5 1 10
... ... ... ...
So basically, the table consists of a column with all human genes. The second column contains the transcript name. The same gene can have multiple transcripts. The third column contains an exon number. Every gene consists of multiple exons. The fourth column contains the length of each exon.
Now I want to create a new table looking like this:
gene transcript length
A NM_2 65
B NM_5 10
... ... ...
So what I basically want to do is find the longest transcript for each gene.
This means that when there are multiple transcripts (column transcript) for each gene (column gene), I need to make the sum of the values in the length column for all the exons of the transcript of that gene.
So in the example there are two transcripts for gene A: NM_1 and NM_2. Each has three exons. The sum of these three values for NM_1 = 10+5+20 = 35, for NM_2 it's 10+5+50 = 65. So for gene A, NM_2 is the longest transcript, so I want to put this in the new table. For gene B there is only 1 transcript, with one exon of length 10. So in the new table, I just want the length of this transcript reported.
I've worked with hashes before, so I thought of storing 'gene' and 'transcript' as two different keys:
#! /usr/bin/perl
use strict;
use warnings;
open(my $test,'<',"test.txt") || die ("Could not open file $!");
open(my $output, '+>', "output.txt") || die ("Can't write new file: $!");
# skip the header of $test # I know how to do this
my %hash = ();
while(<$test>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0]; #gene name
my $keyfield2 = $cols[1]; # transcript name
push #{ $hash{$keyfield} }, $keyfield2;
...
Given what you're trying to do, I'd be thinking something like this:
use strict;
use warnings;
my %genes;
my $header_line = <DATA>;
#read the data
while (<DATA>) {
my ( $gene, $transcript, $exon, $length ) = split;
$genes{$gene}{$transcript} += $length;
}
print join( "\t", "gene", "transcript", "length_sum" ), "\n";
foreach my $gene ( keys %genes ) {
#sort by length_sum, and 'pop' the top of the list.
my ($longest_transcript) =
( sort { $genes{$gene}{$b} <=> $genes{$gene}{$a} or $a cmp $b }
keys %{ $genes{$gene} } );
print join( "\t",
$gene, $longest_transcript, $genes{$gene}{$longest_transcript} ),
"\n";
}
__DATA__
gene transcript exon length
A NM_1 1 10
A NM_1 2 5
A NM_1 3 20
A NM_2 1 10
A NM_2 2 5
A NM_2 3 50
B NM_5 1 10
output
gene transcript length_sum
B NM_5 10
A NM_2 65
This is made much less untidy using the nmax_by (numeric maximum by) function from List::UtilsBy. This program accumulates the total length in a hash and then picks out the longest transcript for each gene using nmax_by.
I presume you're able to open the input file on $fh instead of using the DATA handle? Or you could pass the path to the input file on the command line and just use <> instead of <$fh>without explicitly opening anything.
use strict;
use warnings;
use List::UtilsBy qw/ nmax_by /;
my $fh = \*DATA;
<$fh>; # Drop header line
my %genes;
while ( <$fh> ) {
my ($gene, $trans, $exon, $len) = split;
$genes{$gene}{$trans} += $len;
}
my $fmt = "%-7s%-14s%-s\n";
printf $fmt, qw/ gene transcript length /;
for my $gene ( sort keys %genes ) {
my $trans = nmax_by { $genes{$gene}{$_} } keys %{ $genes{$gene} };
printf ' '.$fmt, $gene, $trans, $genes{$gene}{$trans};
}
__DATA__
gene transcript exon length
A NM_1 1 10
A NM_1 2 5
A NM_1 3 20
A NM_2 1 10
A NM_2 2 5
A NM_2 3 50
B NM_5 1 10
output
gene transcript length
A NM_2 65
B NM_5 10
Update
Here's a much shortened version of nmax_by that will work for you to test. You can add this at the top of the program, or if you'd rather put it at the end then you need to pre-declare it with sub nmax_by(&#); at the top because it has a prototype
sub nmax_by(&#) {
my $code = shift;
my ($max, $maxval);
for ( #_ ) {
my $val = $code->($_);
($max, $maxval) = ($_, $val) unless defined $maxval and $maxval >= $val;
}
$max;
}
I'm new in perl. I have below text file and from there I want only one Time column and next columns are values. How can I create a text file with my desire output in perl.
Time Value Time Value Time Value
1 0.353366497 1 0.822193251 1 0.780866396
2 0.168834182 2 0.865650713 2 0.42429447
3 0.323540698 3 0.865984245 3 0.856875894
4 0.721728497 4 0.634773162 4 0.563059042
5 0.545131335 5 0.029808531 5 0.645993399
6 0.143720835 6 0.949973296 6 0.14425803
7 0.414601876 7 0.53421424 7 0.826148814
8 0.194818367 8 0.942334356 8 0.837107013
9 0.291448263 9 0.242588271 9 0.939609775
10 0.500159997 10 0.428897293 10 0.41946448
I've tried below code:
use strict;
use warnings;
use IO::File;
my $result;
my #files = (q[1.txt],q[2.txt],q[3.txt]);
my #fhs = ();
foreach my $file (#files) {
my $fh = new IO::File $file, O_RDONLY;
push #fhs, $fh if defined $fh;
}
while(1) {
my #lines = map { $_->getline } #fhs;
last if grep { not defined $_ } #lines[0..(#fhs-1)];
my #result=join(qq[\t], map { s/[\r?\n]+/ /g; $_ } #lines ) . qq[\r\n];
open (MYFILE, '>>Result.txt');
print (MYFILE "#result");
close (MYFILE);
}
I'd go with split.
use warnings;
use strict;
open (my $f, '<', 'your-file.dat') or die;
while (my $line = <$f>) {
my #elems = split ' ', $line;
print join "\t", #elems[0,1,3,5];
print "\n";
}
This is a one-liner; no need to write a script:
$ perl -lanE '$,="\t"; say #F[0,1,3,5]' 1.txt 2.txt 3.txt
If you like, you can shorten it to:
$ perl -lanE '$,="\t"; say #F[0,1,3,5]' [123].txt
Right now, you're just concatenating the lines of the files together. If that doesn't give you the output you like, you need to chop some columns out.
Since your output looks like you have tab delimited files as input, I split the lines coming in by tabs. And since you only wanted the second column, I only take the column at the first offset from the split.
my $line_num = 0;
while(1) {
my #lines = map { $_->getline } #fhs;
last if grep { not defined $_ } #lines[0..$#fhs];
$line_num++;
my #rows = map { [ split /\t/ ] } #lines;
my $time_val = $rows[0][0];
die "Time values are not all equal on line #$line_num!"
if grep { $time_val != $_->[0] } #rows
;
my $result = join( q[\t], $time_val, map { $_->[1] } #rows );
open (MYFILE, '>>Result.txt');
print (MYFILE "$result\n");
close (MYFILE);
}
Of course, there is no reason to do custom coding to split delimited columns:
use Text::CSV;
...
my $csv = Text::CSV->new( { sep_char => "\t" } );
while(1) {
my #rows = map { $csv->getline( $_ ) } #fhs;
last if grep { not defined $_ } #rows[0..$#fhs];
my ( $time_val, #time_vals ) = map { $_->[0] } #rows;
my #values = map { $_->[1] } #rows;
die "Time values are not all equal on line #$line_num!"
if grep { $time_val != $_ } #time_vals
;
my $result = join( q[\t], $time_val, #values );
...
}
use strict;
use warnings;
open(FH,"<","a.txt");
print "=========== A File content =========== \n";
my $a = `cat a.txt`;
print "$a\n";
my #temp = <>;
my (#arr, #entries, #final);
foreach ( #temp ) {
#arr = split ( " ", $_ );
push #entries, #arr;
}
close FH;
my #entries1 = #entries;
for(my $i = 7; $i<=$#entries; $i=$i+2) {
push #final, $entries[$i];
}
my $size = scalar #final;
open FH1, ">", "b.txt";
print FH1 "Time \t Value\n";
for(my $i = 0; $i < $size; $i++) {
my $j = $i+1;
print FH1 "$j \t $final[$i]\n";
}
close FH1;
print "============ B file content ===============\n";
my $b = `cat b.txt`;
print "$b";
O/P:
=========== A File content ===========
Time Value Time Value Time Value
1 0.353366497 1 0.822193251 1 0.780866396
2 0.168834182 2 0.865650713 2 0.42429447
3 0.323540698 3 0.865984245 3 0.856875894
4 0.721728497 4 0.634773162 4 0.563059042
5 0.545131335 5 0.029808531 5 0.645993399
6 0.143720835 6 0.949973296 6 0.14425803
7 0.414601876 7 0.53421424 7 0.826148814
8 0.194818367 8 0.942334356 8 0.837107013
9 0.291448263 9 0.242588271 9 0.939609775
10 0.500159997 10 0.428897293 10 0.41946448
============ B file content ===============
Time Value
1 0.353366497
2 0.822193251
3 0.780866396
4 0.168834182
5 0.865650713
6 0.42429447
7 0.323540698
8 0.865984245
9 0.856875894
10 0.721728497
11 0.634773162
12 0.563059042
13 0.545131335
14 0.029808531
15 0.645993399
16 0.143720835
17 0.949973296
18 0.14425803
19 0.414601876
20 0.53421424
21 0.826148814
22 0.194818367
23 0.942334356
24 0.837107013
25 0.291448263
26 0.242588271
27 0.939609775
28 0.500159997
29 0.428897293
30 0.41946448
I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.
The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)
You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.
Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.