Using pointers to lines in a file in perl - perl

I'm trying to use some sort of pointers in perl so that I can look at two at two files that are sorted in alphabetical order and match things in both the files if they have the same name in the first column. The way i'm searching through each file though is I'm looking at which lines first column is lower in alphabetical order and then moving the pointer on that file to the next line. Somewhat similar to the pointers in merge sort. The code below is an example of what I want.
Using these two files.
set1
apple 17 20
boombox 23 29
carl 25 29
cat 22 33
dog 27 44
set2
ants yes
boombox no
carl yes
dentist yes
dice no
dog no
I can make a script that does something like this
($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
#set1 = split;
while ($name < set1[0]){
($name, $affirmation) = next line in set2; # part i'm confused about I just kind of need some sort of command of something that will do this
}
if ($name = $set[0]{
print #set1, $affirmation;
}
This is how I would run it
./script.txt set1
I would end up with
boombox 23 29 no
carl 25 29 yes
dog 27 44 no
.
.
Edit:
I tried some code in some of the answers to see if I could make some functional code out of it but I seem to be running into problems, and some of the syntax in the answers I could not understand so I'm having a lot of trouble figuring out how to debug or solve this.
This is my specific example using the folllowing two text files
text.txt
Apples 0 -1 -1 0 0 0 0 -1
Apricots 0 1 1 0 0 0 0 1
Fruit 0 -1 -1 0 0 0 0 -1
Grapes 0 -2 -1 0 0 0 0 -2
Oranges 0 1 1 0 0 0 0 -1
Peaches 0 -2 -1 0 0 0 0 -2
text2.txt
Apples CHR1 + 1167628 1170420 1 1 N
Apricots CHR1 - 2115898 2144159 1 1 N
Oranges CHR1 - 19665266 19812066 1 1 N
Noidberry CHR1 - 1337728 1329993 1 1 N
Peaches CHR1 - 1337275 1342693 1 1 N
And this script
script.pl
#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
chomp $outer;
#CopyNumber = split(' ', $outer);
($title, $title2) = split('\|', $CopyNumber[0]);
#print 'title: ',$title,' title2: ',$title2,"\n";
my $numLoss = 0;
my $deepLoss = 0;
for ($i = 1; $i <= $#CopyNumber; $i++){
#print "$CopyNumber[$i], $#CopyNumber, $i, \n";
if ($CopyNumber[$i] < 0){
$numLoss = $numLoss + 1;
if ($CopyNumber[$i] <-1){
$deepLoss = $deepLoss + 1;
}
}
}
if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
INNER: while (my $inner = <$fh2>){
($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
$Chrom =~ s/CHR/hs/ee;
my $cmp = ($GeneSym cmp $title);
next OUTER if $cmp < 0;
if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
}
}
If I run ./script.pl text.txt text2.txt I should get this printed into Number.txt
//corresponding to columns 2,4,5 of text2.txt and the last column being the percentage of columns which have a number lower than 0
hs1 1167628 1170420 0.375 //For Apples
hs1 2115898 2144159 0 //For Apricots
hs1 19665266 19812066 0.125 //For Oranges
hs1 1337275 1342693 0.375 //For Peaches
Instead I get this
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 1337275 1342693 0.375
So I'm just getting an error where
hs1 19665266 19812066 0.125 //For Oranges
isn't printing

Quite like you state, with: use cmp for comparison, split line into two terms.
For each line of FILE1 file go through lines of FILE2 file, exiting when a match is found. Once the FILE2 overshoots alphabetically move to the next line of FILE1.
use warnings 'all';
use strict;
sub process {
my ($name, $affirm_1, $affirm_2) = #_;
print "$name $affirm_1 $affirm_2\n";
}
my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
chomp $line1;
my ($name_1, $affirm_1) = split ' ', $line1, 2;
if ($name_2) {
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
FILE2: while (my $line2 = <$fh2>) {
chomp $line2;
($name_2, $affirm_2) = split ' ', $line2, 2;
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
}
Comments on a few remaining details.
Once a FILE2 line "overshoots," in the next iteration of FILE1 we need to first check that line, before entering the FILE2 loop to iterate over its remaining lines. For the first FILE1 line the $name_2 is still undef thus if ($name_2).
Updated for edited post.
use warnings 'all';
use strict;
sub process_line {
my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = #_;
my ($numLoss, $deepLoss) = calc_loss($rline);
$Chrom =~ s/CHR/hs/;
print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
print $deep (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}
sub calc_loss {
my ($rline) = #_;
my ($numLoss, $deepLoss) = (0, 0);
for my $i (1.. $#$rline) {
$numLoss += 1 if $rline->[$i] < 0;
$deepLoss += 1 if $rline->[$i] < -1;
}
return $numLoss, $deepLoss;
}
my ($Number, $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number or die "Can't open $Number: $!";
open my $deep, '>', $NumberDeep or die "Can't open $NumberDeep: $!";
my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap);
FILE1: while (my $line1 = <$fh1>) {
next if $line1 =~ /^\s*$/;
chomp $line1;
my #line = split ' ', $line1;
if ($GeneSym) {
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \#line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
FILE2: while (<$fh2>) {
next if /^\s*$/;
chomp;
($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \#line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
}
This produces the desired output with given sample files. Some shortcuts are taken, please let me know if comments would be helpful. Here are a few
Much error checking should be added around.
I assume the first field of FILE1 to be used as it stands. Otherwise changes are needed.
Processing is split into two functions, calculations being separate. This is not necessary.
$#$rline is the index of the last element of $rline arrayref. If this is too much syntax to stomach use #$rline - 1, for example as (0..#$rline-1)
Some comments on the code posted in the question:
Always, always, please use warnings; (and use strict;)
loop over indices is best written foreach my $i (0..$#array)
The regex modifier /ee is very involved. There is absolutely no need for it here.

You're right. It's exactly like a merge sort, except only matching lines are output.
sub read_and_parse1 {
my ($fh) = #_;
defined( my $line = <$fh> )
or return undef;
my ($id, #copy) = split(' ', $line); # Use split(/\t/, $line) if tab-separated data
my ($gene_sym) = split(/\|/, $id);
return [ $gene_sym, #copy ];
}
sub read_and_parse2 {
my ($fh) = #_;
defined( my $line = <$fh> )
or return undef;
return [ split(' ', $line) ]; # Use split(/\t/, $line) if tab-separated data
}
my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
my $cmp = $fields1->[0] cmp $fields2->[0];
if ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
else {
my ($gene_sym, #copy) = #$fields1;
my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = #$fields2;
$chrom =~ s/^CHR/hs/;
my $num_loss = grep { $_ < 0 } #copy;
my $deep_loss = grep { $_ < -1 } #copy;
print($single_fh join("\t", $chrom, $start, $stop, $num_loss/#copy ) . "\n");
print($deep_fh join("\t", $chrom, $start, $stop, $deep_loss/#copy ) . "\n");
$fields1 = read_and_parse1($fh1);
$fields2 = read_and_parse2($fh2);
}
}
Output:
$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375
$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25

Related

perl script to find matching lines in two files

I have two files that look like (below) and want to find the fields from the first in the second file, but print every field of the second.
#rs116801199 720381
#rs138295790 16057310
#rs131531 16870251
#rs131546 16872281
#rs140375 16873251
#rs131552 16873461
and
#--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1
#1 rs12565286 721290 0.028 1.000 1.000 2 0.370 0.934 0.000
#1 rs3094315 752566 0.432 1.000 1.000 2 0.678 0.671 0.435
#--- rs3131972 752721 0.353 0.906 0.938 0 -1 -1 -1
#--- rs61770173 753405 0.481 0.921 0.950 0 -1 -1 -1
My script looks like:
#! perl -w
my $file1 = shift#ARGV;
my #filtered_snps;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
#next if ($L[0] =~ m/peak/);
push #filtered_snps,[$L[0],$L[1]];
}
close IN;
my $file2 = shift#ARGV;
my #snps;
open (IN, $file2);
while (<IN>){
my#L=split;
foreach (#filtered_snps){
if (($L[1] eq ${$_}[0]) && ($L[2] == ${$_}[1])) {
print "#L\n";
next;
}
}
}
I am getting no output, when I should be finding every line from file 1. I've also tried grep with no success.
In first while you are assigning to wrong array, you meant #L here.
Then you have pretty different strings in your first array (from first file) and in other. Try to print them both out in your for-iteration. You'll see they can't match.
Create a hash table of the items from the first file, then iterate over the second file and check if that rs-name exists... I'm also confirming that the number matches the name.
use strict;
use warnings;
my %hash;
my $regex = qr/#.* *(rs\d+) (\d+) *.*/;
open my $file1, '<', shift #ARGV;
while (<$file1>) {
my ($name, $num) = $_ =~ $regex;
$hash{$name} = $num;
}
close $file1;
open my $file2, '<', shift #ARGV;
while (<$file2>) {
my ($name, $num) = $_ =~ $regex;
print if (exists $hash{$name} and $hash{$name} = $num)
}
close $file2;

Pass lines from 2 files to same subroutine

I'm in the process of learning how to use perl for genomics applications. I am trying to clean up paired end reads (1 forward, 1 reverse). These are stored in 2 files, but the lines match. What I'm having trouble doing is getting the relevant subroutines to read from the second file (the warnings I get are for uninitialized values).
These files are set up in 4 line blocks(fastq) where the first line is a run ID, 2nd is a sequence, 3rd is a "+", and the fourth holds quality values for the sequence in line 2.
I had no real trouble with this code when it was applied only for one file, but I think I'm misunderstanding how to handle multiple files.
Any guidance is much appreciated!
My warning in this scenario is as such : Use of uninitialized value $thisline in subtraction (-) at ./pairedendtrim.pl line 137, line 4.
#!/usr/bin/perl
#pairedendtrim.pl by AHU
use strict;
use warnings;
die "usage: readtrimmer.pl <file1> <file2> <nthreshold> " unless #ARGV == 3;
my $nthreshold = "$ARGV[2]";
open( my $fastq1, "<", "$ARGV[0]" );
open( my $fastq2, "<", "$ARGV[1]" );
my #forline;
my #revline;
while ( not eof $fastq2 and not eof $fastq1 ) {
chomp $fastq1;
chomp $fastq2;
$forline[0] = <$fastq1>;
$forline[1] = <$fastq1>;
$forline[2] = <$fastq1>;
$forline[3] = <$fastq1>;
$revline[0] = <$fastq2>;
$revline[1] = <$fastq2>;
$revline[2] = <$fastq2>;
$revline[3] = <$fastq2>;
my $ncheckfor = removen( $forline[1] );
my $ncheckrev = removen( $revline[1] );
my $fortest = 0;
if ( $ncheckfor =~ /ok/ ) { $fortest = 1 }
my $revtest = 0;
if ( $ncheckrev =~ /ok/ ) { $revtest = 1 }
if ( $fortest == 1 and $revtest == 1 ) { print "READ 1 AND READ 2" }
if ( $fortest == 1 and $revtest == 0 ) { print "Read 1 only" }
if ( $fortest == 0 and $revtest == 1 ) { print "READ 2 only" }
}
sub removen {
my ($thisline) = $_;
my $ntotal = 0;
for ( my $i = 0; $i < length($thisline) - 1; $i++ ) {
my $pos = substr( $thisline, $i, 1 );
#print "$pos\n";
if ( $pos =~ /N/ ) { $ntotal++ }
}
my $nout;
if ( $ntotal <= $nthreshold ) #threshold for N
{
$nout = "ok";
} else {
$nout = "bad";
}
return ($nout);
}
The parameters to a subroutine are in #_, not $_
sub removen {
my ($thisline) = #_;
I have a few other tips for you as well:
use autodie; anytime that you're doing file processing.
Assign the values in #ARGV to variables first thing. This quickly documents what the hold.
Do not chomp a file handle. This does not do anything. Instead apply chomp to the values returned from reading.
Do not use the strings ok and bad as boolean values.
tr can be used to count the number times a character is in a string.
The following is a cleaned up version of your code:
#!/usr/bin/perl
#pairedendtrim.pl by AHU
use strict;
use warnings;
use autodie;
die "usage: readtrimmer.pl <file1> <file2> <nthreshold> " unless #ARGV == 3;
my ( $file1, $file2, $nthreshold ) = #ARGV;
open my $fh1, '<', $file1;
open my $fh2, '<', $file2;
while ( not eof $fh2 and not eof $fh1 ) {
chomp( my #forline = map { scalar <$fh1> } ( 1 .. 4 ) );
chomp( my #revline = map { scalar <$fh2> } ( 1 .. 4 ) );
my $ncheckfor = removen( $forline[1] );
my $ncheckrev = removen( $revline[1] );
print "READ 1 AND READ 2" if $ncheckfor and $ncheckrev;
print "Read 1 only" if $ncheckfor and !$ncheckrev;
print "READ 2 only" if !$ncheckfor and $ncheckrev;
}
sub removen {
my ($thisline) = #_;
my $ntotal = $thisline =~ tr/N/N/;
return $ntotal <= $nthreshold; #threshold for N
}

if value lies between two values then add another value to corresponding line

This is a description of my problem: I have two text files (here $variants and $annotation). I want to check if the value from column 2 in $variants lies between the values from column 2 and 3 in $annotation. If this is true then the value from column 1 in $annotation should be added to a new column in $variants.
This is how my sample input files look like
$annotationrepresents a tab-delimited text file
These values can be overlapping and cannot be perfectly sorted, since I'm working with a circular genome
C0 C1 C2
gene1 0 100
gene2 500 1000
gene3 980 1200
gene4 1500 5
$variants represents a tab-delimited text file
C0 C1
... 5
... 10
... 100
... 540
... 990
The output should look like this ($variants with two other columns added)
C0 C1 C2 C3
... 5 gene1 gene4
... 10 gene1
... 100 gene1
... 540 gene2
... 990 gene2 gene3
This is how my script looks like for the moment
my %hash1=();
while(<$annotation>){
my #column = split(/\t/); #split on tabs
my $keyfield = $column[1] && $column[2]; # I need to remember values from two columns here. How do I do that?
}
while(<$variants>){
my #column=split(/\t/); # split on tabs
my $keyfield = $column[1];
if ($hash1{$keyfield} >= # so the value in column[1] should be between the values from column[1] & [2] in $annotation
push # if true then add values from column[0] in $annotation to new column in $variants
}
So my biggest problems are how to remember two values in a file using hashes and how to put a value from one file to a column in another file. Could someone help me with this?
If the input files are not large and the positions are not too high, you can use arrays to represent all positions:
#!/usr/bin/perl
use warnings;
use strict;
sub skip_header {
my $FH = shift;
<$FH>;
}
open my $ANN, '<', 'annotation' or die $!;
my $max = 0;
while (<$ANN>) {
$_ > $max and $max = $_ for (split)[1, 2];
}
seek $ANN, 0, 0; # Rewind the file back.
my $circular;
my #genes;
while (<$ANN>) {
my ($gene, $from, $to) = split;
if ($from <= $to) {
$genes[$_] .= "$gene " for $from .. $to;
} else {
$circular = 1;
$genes[$_] .= "$gene " for 0 .. $to, $from .. $max + 1;
}
}
chop #genes;
open my $VAR, '<', 'variants' or die $!;
skip_header($VAR);
while (<$VAR>) {
next if /^\s*#/;
chomp;
my ($str, $pos) = split;
$pos = $#genes if $circular and $pos > $#genes;
print "$_ ", $genes[$pos] // q(), "\n";
}
No hashing needed at all. This example expects the annotations to be sorted and not overlapping, it also works only if all the values from variants should be printed.
#!/usr/bin/perl
use warnings;
use strict;
open my $VAR, '<', 'variants' or die $!;
<$VAR>; # skip header
my ($str, $pos) = split ' ', <$VAR>;
open my $ANN, '<', 'annotation' or die $!;
<$ANN>; # skip header
while (<$ANN>) {
my ($gene, $from, $to) = split;
while ($from <= $pos and $pos <= $to) {
print "$str $pos $gene\n";
($str, $pos) = split ' ', <$VAR> or last;
}
}

Perl Script for get values range and without duplicate from Symmetric Matrix file

Perl Script for get range value from matrix file
Matrix.txt(content inside)(tap(\t) and newline(\n))
\t100050\t100070\t100100\t100200\t100300\n
100050\t1\t0.0890344\t0.361651\t0.266263\t0.368639\n
100070\t0.0890344\t1\t0.0873663\t0.0267854\t0.148069\n
100100\t0.361651\t0.0873663\t1\t0.0423538\t0.269991\n
100200\t0.266263\t0.0267854\t0.0423538\t1\t0.215814\n
100300\t0.368639\t0.148069\t0.269991\t0.215814\t1
martix file like
--------100050 100070 100100 100200 100300
100050 1 0.0890344 0.361651 0.266263 0.368639
100070 0.0890344 1 0.0873663 0.0267854 0.148069
100100 0.361651 0.0873663 1 0.0423538 0.269991
100200 0.266263 0.0267854 0.0423538 1 0.215814
100300 0.368639 0.148069 0.269991 0.215814 1
I need only value range(0.3 to 1) with two head label (if less then 0.3 not print)
matrix is symmetric: I.e. $m[$i][$j] == $m[$j][$i] for all indices $i and $j.
$m[$i][$j] == $m[$j][$i] value same
no need print $j $i $v if $i $j $v already exit
I.e 100050 100100 0.361651 so need to print (100100 100050 0.361651)
output.txt
Label1 label2 value
100050 100050 1
100050 100100 0.361651
100050 100300 0.368639
100070 100070 1
100100 100100 1
100200 100200 1
100300 100300 1
use strict;
use warnings;
my ($dummy, #headers) = split(/\s+/, <DATA>);
my %seen;
while (<DATA>) {
my ($head, #v) = split;
for (my $i = 0; $i < #v; $i++) {
printf "%10s %10s %8.2f\n",
$head, $headers[$i], $v[$i] if $v[$i] >= 0.3 and not $seen{
join(":", sort ($head, $headers[$i]))
}++;
}
}
__DATA__
-------- 100050 100070 100100 100200 100300
100050 1 0.0890344 0.361651 0.266263 0.368639
100070 0.0890344 1 0.0873663 0.0267854 0.148069
100100 0.361651 0.0873663 1 0.0423538 0.269991
100200 0.266263 0.0267854 0.0423538 1 0.215814
100300 0.368639 0.148069 0.269991 0.215814 1
Lengthy oneliner,
perl -anE 'if(!#h){#h=#F;next} $l{$F[0]}{$h[$_]} = $F[$_] for 1..$#h }{shift#h; $_->[2]<0.3 or say "#$_" for map {$t=$_; map [$t,$_,$l{$t}{$_}], #h}#h' file
or more explicit version
# opening the file
open my $fh, "<", "file" or die $!;
my #header;
my %matrix;
while (my $line = <$fh>) {
my ($label, #F) = split /\s+/, $line; # split the line into fields
if (!#header) {
#header = #F;
next;
}
# assign the fields through a hash slice
#{ $matrix{$label} }{#header} = #F;
}
close $fh;
my #arr = map {
my $label = $_;
map [ $label, $_, $matrix{$label}{$_} ], #header;
} #header;
for my $el (#arr) {
print "#$el\n" if $el->[2] >= 0.3;
}

parse a tab delimited data using perl

I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.
The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)
You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.
Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.