Background: I have a Perl script that I wrote to go through two files. The basic point of the script is to identify overlaps between one list of coordinates, defining the beginnings and ends of randomly selected chromosomal segments, and a second list of coordinates, defining the beginnings and endings of actual gene transcripts.
The first input file contains three columns. The first is for the chromosome number, and the second and third are the proximal and distal coordinates, in base pairs, of the randomly selected regions. For eg,
chr1 1100349 2035647
chr1 47837656 736474584
. . .
. . .
. . .
The second input file contains four columns: chromosome number, proximal coordinate, distal coordinate, and the name of the gene. For eg,
chr1 1588354 2283765 geneA
chr1 55943837 787653743 geneB
Here is a set of test files I used to start off with. First set.
chr1 1 10
chr1 5 10
chr1 5 15
chr1 14 15
chr1 100 101
chr1 11 17
Second set.
chr1 1 5 geneA
chr1 7 10 geneB
chr1 12 16 geneC
chr1 18 21 geneD
chr10 126602211 126609396 B4galnt1
The script reads off the first line from the first list, then reads through all the lines of the second list, and prints for me whether and how the first coordinate pair overlaps with the second coordinate pair (Is the first coordinate pair outside the second pair? Is the first pair inside or overlapping with the second?) Then, the script goes back and reads off the second line from the first list, and repeats the process. The first file has 200,000 lines. The second several thousand. It is running now overnight.
The problem: When the script determines the relationship between the first and second coordinate pairs, it prints out a line to an output file. Not all these print statements need to be sent to output, so I tried to comment them out. However, when I did this, none of the print statements sending information to the output file got printed. Statements are printed to the screen, though, just not to the output file. The script is running, but all the print to output statements are being used, so the output file is getting huge. If the script would just print to output for only those coordinates that overlap, the output file would be very, very much smaller. At present, the output file is now 2,131,294 KB! And that's only up to chromosome 11. There are eight more to go through, albeit smaller ones, but the file size is still going to expand greatly.
Updated information: This is edited in after my original posting. To be more precise, it is only when I comment out the first print $output "..."; statement that is inside the loop (the very first statement is to print a header, and this is before the loop) that the script fails to print anything, even when all the others are left alone (not commented).
In case it matters: I wrote the script on my Mac, using Fraise, but I am running it on a PC, the script contained in a Notepad text file.
Here's the script: Note: there are many print statements in the file, many commented out. The print statements of interest are those printing to the output file. Those are the ones that, when one or more are commented out, wind up never sending information to the output file. Those statements look like:
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
The actual script:
#!/bin/usr/perl
use strict; use warnings;
#############
## findGenes_after_ASboot_v5.pl
#############
#############
# After making a big list of randomly placed intervals,
# this script uses RefGene.txt file and identifies the
# the gene symbols encompassed or overlapped by each random interval
#############
unless(scalar #ARGV == 2) {
# $0 name of the program being executed;
print "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n";
exit;
}
#for ( my $i = 0; $i < 25; $i++ ){
# print "#########################################\n";
#}
open( my $positions, "<", $ARGV[0] ) or die;
open( my $RefGene, "<", $ARGV[1] ) or die;
open( my $output, ">>", "output.txt") or die;
# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";
my $pos_count = 1;
my $ref_count = 1;
for my $position_line (<$positions>) {
#print "$position_line";
my #posline = split('\t', $position_line);
#print "$posline[0]\t$posline[1]\t$posline[2]";
open( my $RefGene, "<", $ARGV[1] ) or die;
for my $ref (<$RefGene>){
#print "\t$ref";
my #refline = split('\t', $ref);
# print "\t$refline[0]\t$refline[1]\t$refline[2]\t$refline[3]";
chomp $posline[2];
chomp $refline[3];
if ( $posline[0] eq $refline[0] ){
#print "\tchr match\n";
# am i entirely prox to a gene?
if ( $posline[2] < $refline[1] ){
#print "too proximal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
$ref_count++;
next;
}
# am i entirely distal to a gene?
elsif ( $posline[1] > $refline[2] ){
#print "too distal\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
#the following print statement is one I'd like to be able to comment out
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
$ref_count++;
next;
}
# am i completely inside a gene?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] <= $refline[2] ){
#print "inside\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
$ref_count++;
next;
}
# am i proximally overlapping?
elsif ( $posline[1] < $refline[1] &&
$posline[2] <= $refline[2] ){
#print "proximal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
$ref_count++;
next;
}
# am i distally overlapping?
elsif ( $posline[1] >= $refline[1] &&
$posline[2] > $refline[2] ){
#print "distal overlap\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
$ref_count++;
next;
}
else {
#print "encompassing\n";
print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
$ref_count++;
next;
}
} # if a match with chr
else {
next;
}
} # for each reference
$pos_count++;
} # for each position
Data Files:
http://www.filedropper.com/proxdistalpositionsofrandompositions
http://www.filedropper.com/modifiedrefgene
Some output: http://www.filedropper.com/output_17
I see two potential flaws in your code:
Always use while when processing a file instead of for.
Whenever you use the latter, you're actually loading the entire file into memory versus just doing line by line processing. If you're actually able to support doing that though, you should go ahead and load your smaller file entirely and just iterate on the lines.
Split on "\t" not on '\t'.
The latter is almost certainly a bug, unless you really do use a 2 character delimiter for your data.
Anyway, I've cleaned up your code considerably. Removing duplicated lines etc. It's likely that a lot of these changes may either not work (as it's untested) or not be what you want. However, if you go through the code, perhaps it will give you ideas at the very least:
#!/bin/usr/perl
use strict;
use warnings;
use autodie;
#############
## findGenes_after_ASboot_v5.pl
#############
#############
# After making a big list of randomly placed intervals,
# this script uses RefGene.txt file and identifies the
# the gene symbols encompassed or overlapped by each random interval
#############
die "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"
if #ARGV != 2;
open my $positions, "<", $ARGV[0];
# Cache file by key
my %refgenes;
open my $RefGene, "<", $ARGV[1];
while (<$RefGene>) {
chomp;
my #cols = split "\t";
push #{$refgenes{$cols[0]}}, \#cols;
}
open my $output, ">>", "output.txt";
# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";
my $pos_count = 1;
my $ref_count = 1;
while (my $position_line = <$positions>) {
chomp $position_line;
my #posline = split "\t", $position_line;
# Only iterate on matching refs
for my $ref (#{ $refgenes{$posline[0]} }) {
my #refline = #$ref;
my $desc = join "\t", ($posline[0], $pos_count, #posline[1,2], $refline[0], $ref_count, #refline[1,2,3]);
my $message = '';
# am i entirely prox to a gene?
if ( $posline[2] < $refline[1] ){
$message = 'too proximal';
# am i entirely distal to a gene?
} elsif ( $posline[1] > $refline[2] ) {
$message = 'too distal';
# am i completely inside a gene?
} elsif ( $posline[1] >= $refline[1] && $posline[2] <= $refline[2] ) {
$message = 'inside';
# am i proximally overlapping?
} elsif ( $posline[1] < $refline[1] && $posline[2] <= $refline[2] ) {
$message = 'proximal overlap';
# am i distally overlapping?
} elsif ( $posline[1] >= $refline[1] && $posline[2] > $refline[2] ) {
$message = 'distal overlap';
} else {
$message = 'encompassing';
}
print "$desc\t$message\n";
print $output "$desc\t$message\n";
$ref_count++;
} # for each reference
$pos_count++;
} # for each position
Related
I've got the follow function inside a perl script:
sub fileSize {
my $file = shift;
my $opt = shift;
open (FILE, $file) or die "Could not open file $file: $!";
$/ = ">";
my $junk = <FILE>;
my $g_size = 0;
while ( my $rec = <FILE> ) {
chomp $rec;
my ($name, #seqLines) = split /\n/, $rec;
my $sec = join('',#seqLines);
$g_size+=length($sec);
if ( $opt == 1 ) {
open TMP, ">>", "tmp" or die "Could not open chr_sizes.log: $!\n";
print TMP "$name\t", length($sec), "\n";
}
}
if ( $opt == 0 ) {
PrintLog( "file_size: $g_size", 0 );
}
else {
print TMP "file_size: $g_size\n";
close TMP;
}
$/ = "\n";
close FILE;
}
Input file format:
>one
AAAAA
>two
BBB
>three
C
I have several input files with that format. The line beginning with ">" is the same but the other lines can be of different length. The output of the function with only one file is:
one 5
two 3
three 1
I want to execute the function in a loop with this for each file:
foreach my $file ( #refs ) {
fileSize( $file, 1 );
}
When running the next iteration, let's say with this file:
>one
AAAAABB
>two
BBBVFVF
>three
CS
I'd like to obtain this output:
one 5 7
two 3 7
three 1 2
How can I modify the function or modify the script to get this? As can be seen, my function append the text to the file
Thanks!
I've left out your options and the file IO operations and have concentrated on showing a way to do this with an array of arrays from the command line. I hope it helps. I'll leave wiring it up to your own script and subroutines mostly up to to you :-)
Running this one liner against your first data file:
perl -lne ' $name = s/>//r if /^>/ ;
push #strings , [$name, length $_] if !/^>/ ;
END { print "#{$_ } " for #strings }' datafile1.txt
gives this output:
one 5
two 3
three 1
Substituting the second version or instance of the data file (i.e where record one contains AAAAABB) gives the expected results as well.
one 7
two 7
three 2
In your script above, you save to an output file in this format. So, to append columns to each row in your output file, we can just munge each of your data files in the same way (with any luck this might mean things can be converted into a function that will work in a foreach loop). If we save the transformed data to be output into an array of arrays (AoA), then we can just push the length values we get for each data file string onto the corresponding anonymous array element and then print out the array. Voilà! Now let's hope it works ;-)
You might want to install Data::Printer which can be used from the command line as -MDDP to visualize data structures.
First - run the above script and redirect the output to a file with > /tmp/output.txt
Next - try this longish one-liner that uses DDP and p to show the structure of the array we create:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift;
#tmp = map { [split] } <>; p #tmp }
$name = s/>//r if /^>/ ;
push #out , [ $name, length $_ ] if !/^>/ ;
END{ p #out ; }' /tmp/output.txt datafile2.txt `
In the BEGIN block we local-ize #ARGV ; shift off the first file (our version of your TMP file) - {local #ARGV=shift} is almost a perl idiom for handling multiple input files; we then split it inside an anonymous array constructor ([]) and map { } that into the #tmp array which we display with DDP's p() function. Once we are out of the BEGIN block, the implicit while (<>){ ... } that we get with perl's -n command line switch takes over and reads in the remaining file from #ARGV ; we process lines starting with > - stripping the leading character and assigning the string that follows to the $name variable; the while continues and we push $name and the length of any line that does not start with > (if !/^>/) wrapped as elements of an anonymous array [] into the #out array which we display with p() as well (in the END{} block so it doesn't print inside our implicit while() loop). Phew!!
See the AoA that results as a gist #Github.
Finally - building on that, and now we have munged things nicely - we can change a few things in our END{...} block (add a nested for loop to push things around) and put this all together to produce the output we want.
This one liner:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift; #tmp = map {[split]} <>; }
$name = s/>//r if /^>/ ; push #out, [ $name, length $_ ] if !/^>/ ;
END{ foreach $row (0..$#tmp) { push $tmp[$row] , $out[$row][-1]} ;
print "#$_" for #tmp }' output.txt datafile2.txt
produces:
one 5 7
two 3 7
three 1 2
We'll have to convert that into a script :-)
The script consists of three rather wordy subroutines that reads the log file; parses the datafile ; merges them. We run them in order. The first one checks to see if there is an existing log and creates one and then does an exit to skip any further parsing/merging steps.
You should be able to wrap them in a loop of some kind that feeds files to the subroutines from an array instead of fetching them from STDIN. One caution - I'm using IO::All because it's fun and easy!
use 5.14.0 ;
use IO::All;
my #file = io(shift)->slurp ;
my $log = "output.txt" ;
&readlog;
&parsedatafile;
&mergetolog;
####### subs #######
sub readlog {
if (! -R $log) {
print "creating first log entry\n";
my #newlog = &parsedatafile ;
open(my $fh, '>', $log) or die "I CAN HAZ WHA????" ;
print $fh "#$_ \n" for #newlog ;
exit;
}
else {
map { [split] } io($log)->slurp ;
}
}
sub parsedatafile {
my (#out, $name) ;
while (<#file>) {
chomp ;
$name = s/>//r if /^>/;
push #out, [$name, length $_] if !/^>/ ;
}
#out;
}
sub mergetolog {
my #tmp = readlog ;
my #data = parsedatafile ;
foreach my $row (0 .. $#tmp) {
push $tmp[$row], $data[$row][-1]
}
open(my $fh, '>', $log) or die "Foobar!!!" ;
print $fh "#$_ \n" for #tmp ;
}
The subroutines do all the work here - you can likely find ways to shorten; combine; improve them. Is this a useful approach for you?
I hope this explanation is clear and useful to someone - corrections and comments welcome. Probably the same thing could be done with place editing (i.e with perl -pie '...') which is left as an exercise to those that follow ...
You need to open the output file itself. First in read mode, then in write mode.
I have written a script that does what you are asking. What really matters is the part that appends new data to old data. Adapt that to your fileSize function.
So you have the output file, output.txt
Of the form,
one 5
two 3
three 1
And an array of input files, input1.txt, input2.txt, etc, saved in the #inputfiles variable.
Of the form,
>one
AAAAA
>two
BBB
>three
C
>four
DAS
and
>one
AAAAABB
>two
BBBVFVF
>three
CS
Respectively.
After running the following perl script,
# First read previous output file.
open OUT, '<', "output.txt" or die $!;
my #outlines;
while (my $line = <OUT> ) {
chomp $line;
push #outlines, $line;
}
close OUT;
my $outsize = scalar #outlines;
# Suppose you have your array of input file names already prepared
my #inputfiles = ("input1.txt", "input2.txt");
foreach my $file (#inputfiles) {
open IN, '<', $file or die $!;
my $counter = 1; # Used to compare against output size
while (my $line = <IN>) {
chomp $line;
$line =~ m/^>(.*)$/;
my $name = $1;
my $sequence = <IN>;
chomp $sequence;
my $seqsize = length($sequence);
# Here is where I append a column to output data.
if($counter <= $outsize) {
$outlines[$counter - 1] .= " $seqsize";
} else {
$outlines[$counter - 1] = "$name $seqsize";
}
$counter++;
}
close IN;
}
# Now rewrite the results to output.txt
open OUT, '>', "output.txt" or die $!;
foreach (#outlines) {
print OUT "$_\n";
}
close OUT;
You generate the output,
one 5 5 7
two 3 3 7
three 1 1 2
four 3
This is really frustrating me. The script I'm writing is indexing coordinates in a hash and then using those index numbers to pull out values from an array.
The weird thing is that if the value begins with 2 or 22 it will not print. Any other number works. I'll show you two variations and output of the script.
First variation. This is what I want the script to do. Print chromosome, position, value.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Scalar::Util qw(looks_like_number);
open IN, "/home/big/scratch/affy_map.txt" or die "Cannot open reference\n";
my %ref;
my $head = <IN>;
my $index = 0;
while(<IN>){
chomp $_;
my #row = split /\t/, $_;
my $value = join "\t", $row[1],$row[2];
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
if($row[1] == 22 && $row[2] <= 16300001 && $row[2] <= 20500000) { $ref{$index}=$value; print $index."\t".$value."\n"; }
$index++;
}
close(IN);
my #files;
my $masterDirect = "/nfs/archive02/big/Norm/norm_gcc/";
find(\&file_names, $masterDirect);
sub file_names {
if( -f && $File::Find::name=~/\.nzd$/)
{
push #files, $File::Find::name;
}
}
my $count=0;
foreach(#files){
$count++;
if($count % 100 == 0 ){ print "\n","-" x 10, " $count ", "-" x 10,"\n";}
undef my #probes;
open IN, $_;
#file name handling
my #inDir = split "\/", $_;
my $id = pop(#inDir);
$id =~ s/\.gcc.nzd$//;
#header test
$head =<IN>;
if(looks_like_number($head)) { push #probes, $head; }
#open output
open OUT, ">/home/big/scratch/phase1_affy/".$id."_select_probeset.txt";
#load probe array
#probes = <IN>;
close(IN);
foreach my $key (sort keys %ref){
#intended function
print OUT $ref{$key}."\t".$probes[$key];
#testing
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\t".$probes[$key];}}
}
close(OUT);
}
Here's the output for the test. The printing from the reference file is flawless. The first number is the $key or index number. The second is frome $probes[$key] why is the $ref{$key} missing?
146529 0.777314368326637
146529 0.777314368326637
146530 0.116241153901913
146530 0.116241153901913
146531 0.940593233609167
146531 0.940593233609167
Variation 2.
...
foreach my $key (sort keys %ref){
print OUT $ref{$key}."\t".$probes[$key];
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\n";}}
}
And its output. See now it's printing correctly. $key and $ref{$key}
146542 2 31852
146542 2 31852
146543 2 37693
146543 2 37693
146544 2 40415
146544 2 40415
146545 2 40814
I thought it might be a DOS->UNIX file problem but I performed perl -pi -e 's/\R/\n/g' input_files.txt for all the input the script sees. It prints the same value twice because there are two elements in the #temp array. I'm really at a loss right now.
Here is a hint for possible issue. In the beginning part,
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
Note that you used two "<=" for $row[2], which looks peculiar. The next line has such "problem" too. Please double check it first otherwise you may have filtered them out in the first place.
I was stuck when combining the BLAST command into perl script. The problem is that the command line paused when the PART II begin.
PART I is used to crop the fasta sequence.
PART II is used to do BLAST with the file generated by PART I.
Both the two parts can run well individually, but met the "pause" problem when combining together.
I guess it was because the $ARGV[1] and $ARGV[3] generated by part I cannot be used in part II. I dont know how to fix, though I tried a lot.
Thanks!
#! /usr/bin/perl -w
use strict;
#### PART I
die "usage:4files fasta1 out1 fasta2 out2\n" unless #ARGV==4;
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
open OUT,">$ARGV[1]" || die "no out\n";
open (S2, "$ARGV[2]") || die "cannot open FASTA file to read: $!";
open OUT2,">$ARGV[3]" || die "no out2\n";
my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;
print "how long is the N-terminal(give number,e.g. 30. whole length input \"0\") \n";
chomp(my $nl=<STDIN>);
##delete "\n" for seq.
local $/ = ">";
<S>;
while (<S>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT #amac,"\n";
}
}
<S2>;
while (<S2>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT2 ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT2 #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT2 #amac,"\n";
}
}
##### PART II
print "nucl or prot?\n";
chomp(my $tp = <STDIN>);
system ("makeblastdb -in $ARGV[1] -dbtype prot");
system ("makeblastdb -in $ARGV[3] -dbtype $tp");
print "blast type? (blastp,blastn)\n";
chomp(my $cmd = <STDIN>);
system ("blastp -query $ARGV[1] -db $ARGV[3] -outfmt 6 -evalue 1e-3 -out 12.out ");
system ("$cmd -db $ARGV[1] -query $ARGV[3] -outfmt 6 -evalue 1e-3 -out 21.out ");
You changed the way perl reads from 'STDIN' when you set '$/' in this line:
local $/ = ">";
The easiest way to fix this is to add a left bracket right before that line and a right bracket just before the '##### PART II' comment:
{
local $/ = ">";
...
...
}
##### PART II
(I think theoretically, you could put a ">" at the end of the text you input, but that seems strange, so I wouldn't do it)
That will fix your problem. But something that should be addressed is some of the style choices you made. The two big chunks of code in the middle are both identical as far as I can tell and should probably be put into a subroutine and then called twice. This will eliminate duplication and is less error prone.
You should also use the three argument open call to open files.
Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.
I'm using Perl to generate a list of unique exons (which are the units of genes).
I've generated a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.
Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).
I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.
Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar #exons; $i++) {
my #temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar #exons_dup; $j++) {
my #inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array
unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop
($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines
if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same
($temp_line[2] == $inner_temp_line[2])) { # between the comparisons
splice(#exons, $i, 1); # delete the first one
}
}
}
}
my #exons = (
'chr1 1000 2000 gene1',
'chr1 3000 4000 gene2',
'chr1 5000 6000 gene3',
'chr1 1000 2000 gene4'
);
my %unique_exons = map {
my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
"$chro $scoor $ecoor" => $gene
} #exons;
print "$_ $unique_exons{$_} \n" for keys %unique_exons;
This will give you uniqueness, and the last gene name will be included. This results in:
chr1 1000 2000 gene4
chr1 5000 6000 gene3
chr1 3000 4000 gene2
You can use a hash to dedup en passant, but you need a way to join the parts you want to use to detect duplicates into a single string.
sub extract_dup_check_string {
my $exon = shift;
my #parts = line_splitter($exon);
# modify to suit:
my $dup_check_string = join( ';', #parts[0..2] );
return $dup_check_string;
}
my %seen;
#deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, #exons;
You can use a hash to keep track of duplicates you've already seen and then skip them. This example assumes the fields in your input file are space-delimited:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
while (my $line = <>) {
my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line;
my $key = join ':', $chromosome, $exon_start, $exon_end;
if ($seen{$key}) {
next;
}
else {
$seen{$key}++;
print $line;
}
}
As simple as it comes. I tried to use as little magic as possible.
my %exoms = ();
my $input;
open( $input, '<', "lines.in" ) or die $!;
while( <$input> )
{
if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format
{
my #splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces
my $key = $splits[1] . '_' . $splits[2];
if( !exists( $exoms{$key} ) )
{
#could output or write to a new file here, probably output to a file
#for large sets.
$exoms{$key} = \#splits;
}
}
}
#demo to show what was parsed from demo input
while( my ($key, $value) = each(%exoms) )
{
my #splits = #{$value};
foreach my $position (#splits)
{
print( "$position " );
}
print( "\n" );
}