Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
A_1 18.66530716 0 10.45969216 52.71893547 40.04726048 32.16758825 38.27754435
A_2 25.19816467 0 12.5516306 37.95763354 28.39714834 25.7340706 37.581589
A_3 61.5006053 0 6.807664053 4.57493135 23.69514333 9.304974679 29.44245014
A_4 46.71317515 4.988346264 21.47872616 36.08568845 7.47600779 18.34871344 75.02919728
A_5 38.12488272 0 0 28.71499464 19.82997811 19.46785483 66.33787183
A_6 44.16019386 3.313750449 10.70121259 38.35466425 8.691025042 13.40792311 42.72152213
B_1 38.39720331 13.32601073 0 19.28006783 9.985810405 9.803455466 95.44530538
B_2 46.53021582 1.899838598 24.54086634 13.74342921 24.20186228 6.988206544 47.62545788
B_3 48.42890507 0 6.0308135 20.26433556 20.99119304 10.30393217 64.20344867
A_7 32.10687649 0 20.56239825 23.03079775 9.542753971 10.5395511 44.46513374
B_4 34.82673166 0 6.122746633 39.08916191 8.524472297 14.64540603 54.99744731
B_5 32.49685303 2.910517165 15.66506159 35.79294964 8.723952928 10.7058016 52.11522135
B_6 30.38974634 0 0 30.51870034 10.53778987 17.24225836 50.36058827
B_7 59.60856159 0 8.097826192 19.0468412 2.818575518 11.06841746 10.77608287
A_8 36.07790915 6.260541956 0 31.70212496 14.07396097 4.605650219 67.26011453
C_1 0 17.27445836 0 382.0309737 1.849224149 0 0
C_2 344.0389416 119.4010562 32.13217433 0 22.36821531 285.4766232 21.37974841
C_3 235.5547989 37.86357293 22.23167043 2.490045661 2.579360621 30.38709443 14.79226135
C_4 0 2.801263518 0 334.3615367 0 0 0
C_5 9.397916894 128.2900334 187.2504332 25.16745451 22.81140838 14.39668285 0
Here is the data matrix. Row is variable and column is sample ID.
A_1 - A_8 is clusterA, B_1 - B_7 is clusterB, C_1 - C_5 is clusterC.
Now I wanna calculate the mean or median of A_1 - A_8 as the value of clusterA, getting the median result as:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.10139593 0 10.58045238 33.89390671 16.95196954 15.87831827 43.59332793
Could anyone help me solve this problem using perl script?
Calculate both mean and median:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw(sum);
use POSIX qw(floor ceil);
my %data = ();
my %avg = ();
my %median = ();
while (<>) {
next if $. == 1;
my #fields = split;
my $cluster = substr($fields[0],0,1);
$data{$cluster} = [] unless exists($data{$cluster});
push #{$data{$cluster}}, [ #fields[1..$#fields] ];
}
for my $cluster (keys(%data)) {
for my $sampleNo (0..scalar(#{$data{$cluster}[0]})-1) {
my #samples = map { $_->[$sampleNo] } #{$data{$cluster}};
my $cnt = #samples;
$avg{$cluster}[$sampleNo] = sum(#samples)/$cnt;
my #sorted = sort #samples;
$median{$cluster}[$sampleNo] = (#sorted[floor(($cnt+1)/2)-1] +
#sorted[ceil(($cnt+1)/2)-1])/2;
}
}
print "Mean\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } #{$avg{$cluster}})),"\n";
}
print "Median\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } #{$median{$cluster}})),"\n";
}
Output:
perl test.pl <sample.txt
Mean
A 37.818389312 1.820329834 10.320165477 31.642471301 18.969159754 16.697040778 50.139427875
B 41.525459546 2.590909499 8.636759179 25.390783670 12.254808048 11.536782519 53.646221676
C 117.798331479 61.126076882 48.322855592 148.810002114 9.921641692 66.052080096 7.234401952
Median
A 37.101395935 0.000000000 11.626421595 37.021660995 34.222204410 22.600962715 43.593327935
B 38.397203310 0.000000000 24.540866340 20.264335560 24.201862280 14.645406030 52.115221350
C 235.554798900 17.274458360 187.250433200 25.167454510 2.579360621 14.396682850 0.000000000
Here is an example of how you can calculate the medians of the clusters:
use feature qw(say);
use strict;
use warnings;
my $fn = 'data.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %clusters;
while (my $line = <$fh>) {
chomp $line;
my ($id, #cols) = split " ", $line;
die "Bad format" if !#cols;
if ( $id =~ /^([A-Za-z]+)_/ ) {
$id = $1;
}
else {
die "Bad ID";
}
if (!exists $clusters{$id} ) {
$clusters{$id} = [];
}
my $samples = $clusters{$id};
for my $i (0..$#cols) {
push #{ $samples->[$i] }, $cols[$i];
}
}
close $fh;
print $header;
for my $id (sort keys %clusters) {
my $samples = $clusters{$id};
my #items;
push #items, sprintf "cluster%s", $id;
for my $sample (#$samples) {
my $median = calculate_median( $sample );
push #items, $median;
}
say join "\t", #items;
}
sub calculate_median {
my ( $sample ) = #_;
my #sorted = sort {$a <=> $b} #$sample;
my $N = scalar #sorted;
my $i = int ($N/2);
if ( $N % 2 == 0 ) {
my $m1 = $sorted[$i-1];
my $m2 = $sorted[$i];
return ($m1 + $m2)/2;
}
else {
return $sorted[$i];
}
}
Output:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.101395935 0 10.580452375 33.893906705 16.95196954 15.878318275 43.593327935
clusterB 38.39720331 0 6.122746633 20.26433556 9.985810405 10.7058016 52.11522135
clusterC 9.397916894 37.86357293 22.23167043 25.16745451 2.579360621 14.39668285 0
I'm trying to use some sort of pointers in perl so that I can look at two at two files that are sorted in alphabetical order and match things in both the files if they have the same name in the first column. The way i'm searching through each file though is I'm looking at which lines first column is lower in alphabetical order and then moving the pointer on that file to the next line. Somewhat similar to the pointers in merge sort. The code below is an example of what I want.
Using these two files.
set1
apple 17 20
boombox 23 29
carl 25 29
cat 22 33
dog 27 44
set2
ants yes
boombox no
carl yes
dentist yes
dice no
dog no
I can make a script that does something like this
($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
#set1 = split;
while ($name < set1[0]){
($name, $affirmation) = next line in set2; # part i'm confused about I just kind of need some sort of command of something that will do this
}
if ($name = $set[0]{
print #set1, $affirmation;
}
This is how I would run it
./script.txt set1
I would end up with
boombox 23 29 no
carl 25 29 yes
dog 27 44 no
.
.
Edit:
I tried some code in some of the answers to see if I could make some functional code out of it but I seem to be running into problems, and some of the syntax in the answers I could not understand so I'm having a lot of trouble figuring out how to debug or solve this.
This is my specific example using the folllowing two text files
text.txt
Apples 0 -1 -1 0 0 0 0 -1
Apricots 0 1 1 0 0 0 0 1
Fruit 0 -1 -1 0 0 0 0 -1
Grapes 0 -2 -1 0 0 0 0 -2
Oranges 0 1 1 0 0 0 0 -1
Peaches 0 -2 -1 0 0 0 0 -2
text2.txt
Apples CHR1 + 1167628 1170420 1 1 N
Apricots CHR1 - 2115898 2144159 1 1 N
Oranges CHR1 - 19665266 19812066 1 1 N
Noidberry CHR1 - 1337728 1329993 1 1 N
Peaches CHR1 - 1337275 1342693 1 1 N
And this script
script.pl
#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
chomp $outer;
#CopyNumber = split(' ', $outer);
($title, $title2) = split('\|', $CopyNumber[0]);
#print 'title: ',$title,' title2: ',$title2,"\n";
my $numLoss = 0;
my $deepLoss = 0;
for ($i = 1; $i <= $#CopyNumber; $i++){
#print "$CopyNumber[$i], $#CopyNumber, $i, \n";
if ($CopyNumber[$i] < 0){
$numLoss = $numLoss + 1;
if ($CopyNumber[$i] <-1){
$deepLoss = $deepLoss + 1;
}
}
}
if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
INNER: while (my $inner = <$fh2>){
($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
$Chrom =~ s/CHR/hs/ee;
my $cmp = ($GeneSym cmp $title);
next OUTER if $cmp < 0;
if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
}
}
If I run ./script.pl text.txt text2.txt I should get this printed into Number.txt
//corresponding to columns 2,4,5 of text2.txt and the last column being the percentage of columns which have a number lower than 0
hs1 1167628 1170420 0.375 //For Apples
hs1 2115898 2144159 0 //For Apricots
hs1 19665266 19812066 0.125 //For Oranges
hs1 1337275 1342693 0.375 //For Peaches
Instead I get this
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 1337275 1342693 0.375
So I'm just getting an error where
hs1 19665266 19812066 0.125 //For Oranges
isn't printing
Quite like you state, with: use cmp for comparison, split line into two terms.
For each line of FILE1 file go through lines of FILE2 file, exiting when a match is found. Once the FILE2 overshoots alphabetically move to the next line of FILE1.
use warnings 'all';
use strict;
sub process {
my ($name, $affirm_1, $affirm_2) = #_;
print "$name $affirm_1 $affirm_2\n";
}
my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
chomp $line1;
my ($name_1, $affirm_1) = split ' ', $line1, 2;
if ($name_2) {
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
FILE2: while (my $line2 = <$fh2>) {
chomp $line2;
($name_2, $affirm_2) = split ' ', $line2, 2;
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
}
Comments on a few remaining details.
Once a FILE2 line "overshoots," in the next iteration of FILE1 we need to first check that line, before entering the FILE2 loop to iterate over its remaining lines. For the first FILE1 line the $name_2 is still undef thus if ($name_2).
Updated for edited post.
use warnings 'all';
use strict;
sub process_line {
my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = #_;
my ($numLoss, $deepLoss) = calc_loss($rline);
$Chrom =~ s/CHR/hs/;
print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
print $deep (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}
sub calc_loss {
my ($rline) = #_;
my ($numLoss, $deepLoss) = (0, 0);
for my $i (1.. $#$rline) {
$numLoss += 1 if $rline->[$i] < 0;
$deepLoss += 1 if $rline->[$i] < -1;
}
return $numLoss, $deepLoss;
}
my ($Number, $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number or die "Can't open $Number: $!";
open my $deep, '>', $NumberDeep or die "Can't open $NumberDeep: $!";
my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap);
FILE1: while (my $line1 = <$fh1>) {
next if $line1 =~ /^\s*$/;
chomp $line1;
my #line = split ' ', $line1;
if ($GeneSym) {
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \#line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
FILE2: while (<$fh2>) {
next if /^\s*$/;
chomp;
($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \#line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
}
This produces the desired output with given sample files. Some shortcuts are taken, please let me know if comments would be helpful. Here are a few
Much error checking should be added around.
I assume the first field of FILE1 to be used as it stands. Otherwise changes are needed.
Processing is split into two functions, calculations being separate. This is not necessary.
$#$rline is the index of the last element of $rline arrayref. If this is too much syntax to stomach use #$rline - 1, for example as (0..#$rline-1)
Some comments on the code posted in the question:
Always, always, please use warnings; (and use strict;)
loop over indices is best written foreach my $i (0..$#array)
The regex modifier /ee is very involved. There is absolutely no need for it here.
You're right. It's exactly like a merge sort, except only matching lines are output.
sub read_and_parse1 {
my ($fh) = #_;
defined( my $line = <$fh> )
or return undef;
my ($id, #copy) = split(' ', $line); # Use split(/\t/, $line) if tab-separated data
my ($gene_sym) = split(/\|/, $id);
return [ $gene_sym, #copy ];
}
sub read_and_parse2 {
my ($fh) = #_;
defined( my $line = <$fh> )
or return undef;
return [ split(' ', $line) ]; # Use split(/\t/, $line) if tab-separated data
}
my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
my $cmp = $fields1->[0] cmp $fields2->[0];
if ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
else {
my ($gene_sym, #copy) = #$fields1;
my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = #$fields2;
$chrom =~ s/^CHR/hs/;
my $num_loss = grep { $_ < 0 } #copy;
my $deep_loss = grep { $_ < -1 } #copy;
print($single_fh join("\t", $chrom, $start, $stop, $num_loss/#copy ) . "\n");
print($deep_fh join("\t", $chrom, $start, $stop, $deep_loss/#copy ) . "\n");
$fields1 = read_and_parse1($fh1);
$fields2 = read_and_parse2($fh2);
}
}
Output:
$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375
$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25
Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
I have wrote the following program:
use strict;
use warnings;
use 5.010;
my $nodesNumber = 100 ;
my $communitiesNumber = 10;
my $prob_communities = 0.3;
for my $i (1 .. $nodesNumber){
for my $j (1 .. $communitiesNumber){
my $random_number=rand();
if ($prob_comunities > $random_number){
say "$i $j";
}
}
}
This program gives as output a list of two columns of integers as:
1 2
1 4
2 2
2 5
2 7
...
I would like to create a vector in which the first element in the left column is counted once and the right column elements represents the value of the vector's components. I would like the output to look like:
vector[0][0]= 1
vector[0][1]= 2
vector[0][2]= 4
vector[1][0]= 2
vector[1][1]= 2
vector[1][2]= 5
vector[1][3]= 7
Any help?
#!/usr/bin/env perl
# file: build_vector.pl
use strict;
use warnings;
my #vector; # the 2-d vector
my %mark; # mark the occurrence of the number in the first column
my $index = -1; # first dimensional index of the vector
while (<>) {
chomp;
my ($first, $second) = split /\s+/;
next if $second eq '';
if (not exists $mark{$first}) {
$mark{ $first } = ++$index;
push #{ $vector[$index] }, $first;
}
push #{ $vector[$index] }, $second;
}
# dump results
for my $i (0..$#vector) {
for my $j (0..$#{ $vector[$i] }) {
print "$vector[$i][$j] ";
}
print "\n";
}
This script will processing the output of your script and build the vector in #vector. If your script has filename generator.pl, you can call:
$ perl generator.pl | perl build_vector.pl
UPDATE:
use strict;
use warnings;
my $nodesNumber = 100 ;
my $communitiesNumber = 10;
my $prob_communities = 0.3;
my #vector; # the 2-d vector
my %mark; # mark the occurrence of the number in the first column
my $index = -1; # first dimensional index of the vector
for my $i (1 .. $nodesNumber){
for my $j (1 .. $communitiesNumber){
my $random_number=rand();
if ($prob_communities > $random_number){
if (not exists $mark{$i}) {
$mark{ $i } = ++$index;
push #{ $vector[$index] }, $i;
}
push #{ $vector[$index] }, $j;
}
}
}
# dump results
for my $i (0..$#vector) {
for my $j (0..$#{ $vector[$i] }) {
print "$vector[$i][$j] ";
}
print "\n";
}
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use Const::Fast;
use Math::Random::MT;
const my $MAX_RAND => 10;
my $rng = Math::Random::MT->new;
my #v = map {
my $l = $rng->irand;
[ map 1 + int($rng->rand($MAX_RAND)), 0 .. int($l) ];
} 1 .. 5;
use YAML;
print Dump \#v;
I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.
The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)
You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.
Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.