Get shortest and longest sequence in file - perl

I'm trying to get the shortest and longest sequence in a file containing multiple genbank-like entries. example of the file:
LOCUS NM_182854 2912 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_182854
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 gggcgatcag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
61 tggcagcccc tcggcttgtc gccggagctg agaaccaaga gctcgaaggg gccatatgac
//
LOCUS NM_001323410 6992 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_001323410
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggc
//
I'd like to print the accession number, the type of the organism from the shortest sequence and the longest sequence
my code so far:
#!/usr/bin/perl
use strict;
use warnings;
print "enter file path\n";
while (my $line = <>){
chomp $line;
my #record = ($line);
foreach my $file(#record){
open(IN, "$file") or die "\n error opening file \n;/\n";
$/="//";
while (my $line = <IN>){
my #gb_seq = split ("ORIGIN", $line);
my $definition = $gb_seq[0];
my $sequence = $gb_seq[1];
$definition =~ m/ORGANISM[\s\t]+(.+)[\n\s\t]+/;
my $organism = $1;
if ($definition =~ m/ACCESSION[\s\t]+(\D\D_\d\d\d\d\d\d(\d*))[\n\s\t]+/){
my $accession = $1;
$sequence =~ s/\d//g;
$sequence =~ s/[\n\s\t]//g;
my $size = length($sequence);
my #sorted_keys = sort { $a <=> $b } keys my %size;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
print "this is the shortest: $accession $organism size: $shortest\n";
print "this is the longest: $accession $organism size: $longest\n";
}
}}}
exit;
I thought about putting the length in a hash to get the shortest and the longest but something is wrong there. I get these errors:
Use of uninitialized value $organism in concatenation (.) or string at test.pl line 39, <IN> chunk 1
Use of uninitialized value $shortest in concatenation (.) or string at test.pl line 39, <IN> chunk 1.
Use of uninitialized value $longest in concatenation (.) or string at test.pl line 40, <IN> chunk 1.
What part should I change? Thanks

We need to find extreme-length entries while being able to identify the record they belong to. Reading records by // is again a nice idea. However, then each record is a string and pulling the sequence out of it directly is harder than breaking it into lines first. Thus we may as well go line by line, given that there are clear markers for everything needed.
A choice of data structure is important and depends on the purpose. Here I organize data so that it is easy to work with, into a hash with elements
%block = ( 'accession' => { 'type' => type, 'sequence' => sequence }, ... )
The search to perform once the data is read in would be greatly aided by organizing this by 'sequence' (instead of by 'accession'), but that would make it very hard to work with. I presume that this may end up being used for more, and that a small loss of speed is not of consequence. If the sole objective here were to answer the specific question with optimal performance other approaches would be more suitable. Comments follow the code.
use warnings;
use strict;
use feature qw(say);
my $file = 'data_seqs.txt';
open my $fh, '<', $file or die "Can't open $file -- $!";
# Hash, helper variables, flag (inside a sequence?), sequence-end marker
my (%block, $accession, $sequence);
my $is_seq = 0;
my $end_marker = qr(\s*//); # marks end of sequence: //
while (my $line = <$fh>)
{
chomp($line);
next if $line =~ /^\s*$/; # skip empty lines
if ($line =~ /$end_marker/) { # done with the sequence
$is_seq = 0;
$sequence = '';
next;
}
if ($line =~ /^\s*ACCESSION\s*(\w+)/) {
$accession = $1;
}
elsif ($line =~ /^\s*ORGANISM\s*(.+)/) {
$block{$accession}{'type'} = $1;
}
elsif ($line =~ /^\s*ORIGIN/) { # start sequence on next line
$is_seq = 1;
}
elsif ($is_seq) { # read (and add to) sequence
if ($line =~ /^\s*\d+\s*(.*)/) {
$block{$accession}{'sequence'} .= $1;
}
else { warn "Not sequence? Line: $line " }
}
}
# Identify keys for max and min lenght. Initialize with any keys
my ($max, $min) = keys %block;
foreach my $acc (keys %block)
{
my $current_len = length($block{$acc}{'sequence'});
if ( $current_len > length($block{$max}{'sequence'}) ) {
$max = $acc;
}
if ( $current_len < length($block{$min}{'sequence'}) ) {
$min = $acc;
}
}
say "Maximum length sequence: ACCESSION: $max, ORGANISM: " . $block{$max}{'type'};
say "Minimum length sequence: ACCESSION: $min, ORGANISM: " . $block{$min}{'type'};
use Data::Dumper;
print Dumper(\%block);
This prints (Dumper's printout omitted)
Maximum length sequence: ACCESSION: NM_182854, ORGANISM Homo sapiens
Minimum length sequence: ACCESSION: NM_001323410, ORGANISM Homo sapiens
A comment on searching efficiency
One common approach would be to first build a reverse lookup hash, then use a library, say from List::Utils, to find max and min, then look up where they belong. For this we do need to build the lookup hash and we'd use the library twice, while searching through it by hand as above makes one pass over the structure and is also simpler. Another option would be to have hash top-level keys be sequences and then directly find max and min. However, such hash would be considerably harder to work with.
Yet another approach would be to organize data into a structure that would allow more efficient retrieval of this specific information, probably based on arrays.
However, the efficiency gain doesn't seem to justify the great loss of convenience. If the speed turns out to be a problem then this should be considered.
If you need to work with multiple files just change the loop to while (<>) and submit them on the command line. All lines from all of them will then be read line by line and the code stays the same.
It may be that I misunderstood some terms. I don't remove empty spaces from the "sequence", and use words on the first line only for "type", just to name a couple of candidates. These are easy to adjust, please let me know.

You state that you want two pieces of data - the accession and the organism - for the longest and shortest sequence. This means your hash values need to store two elements. As well as that, when you use '//' as a record separator, the '//' still appears on the end of each record. So, when you filter out whitespace and digits from you sequence, you're still left with '//' on the end. When I ran your code through the debugger, I was finding the lengths were all out by 2 because of this.
A couple of other things:
When using regexs, use 'extended mode', /x, so you can include whitespace for readabillity
you presume a successful match when you dig out $definition - better to test your regexs and assign on match, die on missmatch
Rather than store the length in the hash (and lose the sequence itself), you might as well store the sequence and calculate the lengths later;
I renamed the variable $line to $chunk as it contains several lines
All the stuff to do with calculating the shortest and longest and printing the resuts needs to move out of the loop. In its place, you simply need to make an entry into the hash. As described above, the hash values need to be an array with two values - the accession and the organism.
You remove digits from the sequence in one command and then whitespace from the sequence in another - might as well do them both togeather. While we're at it, might as well remove the '/'s on the end of the record.
Given the mods above, I get;
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my %organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
$organisms{ $sequence } = [ $accession , $organism ] ;
}
my #sorted_keys = sort { length $a <=> length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
say "this is the shortest: ", $organisms{$shortest}->[0],
", ", $organisms{$shortest}->[1],
" size: ", length $shortest, "\n",
" sequence: ", $shortest ;
say "this is the longest: ", $organisms{$longest}->[0],
", ", $organisms{$longest}->[1],
" size: ", length $longest, "\n",
" sequence: ", $longest ;
exit;
when ran on your data, it produces;
$ ./sequence.pl
Enter file path: data.txt
this is the shortest: NM_001323410, Homo sapiens size: 113
sequence: actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcggcaagaaccagcgcaagagggaagcagagttatagctaccccggc
this is the longest: NM_182854, Homo sapiens size: 120
sequence: gggcgatcagaagcaggtcacacagcctgtttcctgttttcaaacggggaacttagaaagtggcagcccctcggcttgtcgccggagctgagaaccaagagctcgaaggggccatatgac
UPDATE
The problem with the code above is that if the same sequence appears in two chunks, then data is going to be overwritten in the hash and lost. Below is an updated version that stores data in an array of arrays which will advoid the problem. It produces exactly the same output:
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my #organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
push #organisms, [$organism , $accession , $sequence] ;
}
my #sorted_organisms = sort { length $a->[2] <=> length $b->[2] } #organisms ;
my ($organism , $accession , $sequence) = #{ $sorted_organisms[0] };
say "this is the shortest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
($organism , $accession , $sequence) = #{ $sorted_organisms[-1] };
say "this is the longest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
exit;

Related

Error use of uninitialized value although it is initialized

I am trying to make a table looking content of one input file but it constantly gives me an error
Use of uninitialized value $ac[3] in concatenation (.) or string at table.pl
line 58 (#1)
and
Use of uninitialized value $or[2] in concatenation (.) or string at table.pl
line 61 (#1)
and although I made almost every possible changes it still gives me an error and does not print well.
This is how my input file looks like:
HEADER OXIDOREDUCTASE 08-JUN-12 2LU5
EXPDTA SOLID-STATE NMR
REMARK 2 RESOLUTION. NOT APPLICABLE.
HETNAM CU COPPER (II) ION
HETNAM ZN ZINC
FORMUL 2 CU CU 2+
FORMUL 2 ZN ZN 2+
END
This is a script I am using:
#!/usr/bin/env perl
use strict;
use warnings;
use diagnostics;
#my $testfile=shift;
open(INPUT, "$ARGV[0]") or die 'Cannot make it';
my #file=<INPUT>;
close INPUT;
my #ac=();
my #dr=();
my #os=();
my #or=();
my #fo=();
for (my $line=0;$line<=$#file;$line++)
{
chomp($file[$line]);
if ($file[$line] =~ /^HEADER/)
{
print( (split '\s+', $file[$line])[-1]);
print "\t";
while ($file[$line] !~ /^END /)
{
$line++;
if ($file[$line]=~/^EXPDTA/)
{
$file[$line]=~s/^EXPDTA//;
#os=(#os,split '\s+', $file[$line]);
}
if ($file[$line] =~ /^REMARK 2 RESOLUTION./)
{
$file[$line]=~s/^REMARK 2 RESOLUTION.//;
#ac = (#ac,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^HETNAM/)
{
$file[$line]=~s/^HETNAM//;
$file[$line] =~ s/\s+//;
push #dr, $file[$line];
}
if ($file[$line] =~ /^SOURCE 2 ORGANISM_SCIENTIFIC/)
{
$file[$line]=~s/^SOURCE 2 ORGANISM_SCIENTIFIC//;
#or = (#or,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^FORMUL/)
{
$file[$line]=~s/^FORMUL//;
$file[$line] =~ s/\s+//;
push #fo, $file[$line];
}
}
print "$os[1] $os[2]\t";
print "\t";
#os=();
print "$ac[3] $ac[4]\t" or die "Cannot be printed"; #line 58
print "\t";
#ac=();
print "$or[2] $or[3]\t" or die "Cannot be printed"; #line 61
print "\t";
#or=();
foreach (#dr)
{
print "$_";
print "\t\t\t\t\t";
}
#dr=();
print "\n";
}
}
And this is the output it gives me, but it doesnt seems to print well and I am really not sure why:
2LU5 SOLID-STATE NMR CU COPPER (II) ION
Desired output that I am expecting is :
HEADER EXPDTA REMARK HETNAM FORMUL
OXIDOREDUCTASE 2LU5 SOLID-STATE NMR RESOLUTION. NOT APPLICABLE. COPPER (II) ION (here better to say last column because certain diversity exists before "copper") CU 2+
ZN ZINC ZN 2+
The root of your error is that:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #ac = ();
my $str = "REMARK 2 RESOLUTION. NOT APPLICABLE. ";
$str =~ s/^REMARK 2 RESOLUTION.//;
#ac = ( #ac, split '\s+', $str );
print Dumper \#ac;
The contents of #ac is:
$VAR1 = [
'',
'NOT',
'APPLICABLE.'
];
There is no $ac[3], you only have elements 0,1,2 in there.
With your #or error, you don't have any lines matching: /^SOURCE 2 ORGANISM_SCIENTIFIC/
So that array is empty, and that too, means you've got no $or[2] to print.
More generally - what you're doing here is actually really quite clunky, and there's a much cleaner solution.
How about:
#!/usr/bin/env perl
use strict;
use warnings;
#set the text "END" as our record separator
local $/ = 'END';
#define the fields to print out.
my #field_order = qw ( HEADER EXPDTA REMARK HETNAM FORMUL );
print join ( ",", #field_order), "\n"; #print header row
#iterate STDIN or file named on command line.
#just like you're doing with open (FILE, $ARGV[0])
while ( <> ) {
#select key value pairs into a hash - first word on the line is the 'key'
#and the value is 'anything else'.
my %this_entry = m/^(\w+)\s+(.*)$/gm;
next unless $this_entry{'HEADER'}; #check we have a header.
s/\s+/ /g for values %this_entry; #strip repeated spaces from fields;
s/\s+$//g for values %this_entry; #strip trailing whitespace.
#split 'header' row into separate subfields
#this is an example of how you could transform other fields.
($this_entry{'HEADER'}, $this_entry{'DATE'}, $this_entry{'STRUCT'} ) = split ' ', $this_entry{'HEADER'};
print join (",", #this_entry{#field_order} ), "\n";
}
This will - given your input - print:
HEADER,DATE,STRUCT,EXPDTA,REMARK,HETNAM,FORMUL
OXIDOREDUCTASE,08-JUN-12,2LU5,SOLID-STATE NMR,2 RESOLUTION. NOT APPLICABLE.,CU COPPER (II) ION,2 CU CU 2+
Which isn't quite what your output matches, but hopefully it's illustrated how much simpler this task could be?

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my #proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (#proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";
This will do as you ask
It builds a hash %fasta out of the FASTA file, keeping array #keys to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, #keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push #keys, $key;
}
elsif ( $key ) {
unshift #{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( #keys ) {
print $key, "\n";
print "$_\n" for #{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max' at the top of the program
my $len = max map length, map #$_, values %fasta;
for my $key ( #keys ) {
print $key, "\n";
my $seq = join '', #{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH
I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);

Open a file.txt and find the possible start and end positions of its genes

Hi I have a file and I would like to open it and find the start and end positions of its genes,also I have some extra imformations.The beginning of each gene is mapped by the following pattern. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed by 4-10 bases downstream before the initiation codon (ATG). However there are variants of the Shine-Dalgarno sequence with the most common of which being [TA][AC]AGGA[GA][GA].The end of the gene is specified by the stop codon TAA, TAG and TGA. It must be taken care the stop codon is found after the correct Open.Reading Frame (ORF).
Now I have make a txt file with genome and I open it with this code,and the error begin when I go to read the genome and put start and end.Any help?Thanks a lot.:
#!/usr/bin/perl -w
use strict;
use warnings;
# Searching for motifs
# Ask the user for the filename of the file containing
my $proteinfilename = "yersinia_genome.fasta";
print "\nYou open the filename of the protein sequence data: yersinia_genome.fasta \n";
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless (open(PROTEINFILE, $proteinfilename) )
{
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
# Read the protein sequence data from the file, and store it
# into the array variable #protein
my #protein = <PROTEINFILE>;
# Close the file - we've read all the data into #protein now.
close PROTEINFILE;
# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
my $protein = join( '', #protein);
# Remove whitespace.
$protein =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
my $motif='TAAGGAGG';
do
{
print "\n Your motif is:$motif\n";
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ )
{
print "I found it!This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
}
until ($motif =~ /TAAGGAGG/g);
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
#HERE STARTS THE PROBLEMS,I DONT KNOW WHERE I MAKE THE MISTAKES
#$genome=$motif;
#$genome = $_[0];
my $ORF = 0;
while (my $genome = $proteinfilename) {
chomp $genome;
print "processing $genome\n";
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
# $start and $stop are 0-based indexes
my $start = pos($mrna) - 3; # back up to include the start sequence
# discard remnant if no stop sequence can be found
last unless $mrna=~ /TAA|TAG|TGA/g;
#m/^ATG(?:[ATGC]{3}){8,}?(?:TAA|TAG|TGA)/gm;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
# $ORF ++;
#print "$ORF\n";
}
}
exit;
Thanks,I have make it the solution is :
local $_=$protein;
while(/ATG/g){
my $start = pos()-3;
if(/T(?:TAA|TAG|TGA)/g){
my $stop = pos;
print $start, " " , $stop, " " ,$stop - $start, " " ,
substr ($_,$start,$stop - $start),$/;
}
}
while (my $genome = $proteinfilename) {
This creates an endless loop: you are copying the file name (not the $protein data) over and over.
The purpose of the while loop is unclear; it will never terminate.
Perhaps you simply mean
my ($genome) = $protein;
Here is a simplistic attempt at fixing the obvious problems in your code.
#!/usr/bin/perl -w
use strict;
use warnings;
my $proteinfilename = "yersinia_genome.fasta";
chomp $proteinfilename;
unless (open(PROTEINFILE, $proteinfilename) )
{
# die, don't print & exit
die "Cannot open file \"$proteinfilename\"\n";
}
# Avoid creating a potentially large temporary array
# Read directly into $protein instead
my $protein = join ('', <PROTEINFILE>);
close PROTEINFILE;
$protein =~ s/\s//g;
# As this is a static variable, no point in looping
my $motif='TAAGGAGG';
chomp $motif;
if ( $protein =~ /$motif/ )
{
print "I found it! This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
# $ORF isn't used; removed
# Again, no point in writing a loop
# Also, $genome is a copy of the data, not the filename
my $genome = $protein;
# It was already chomped, so no need to do that again
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
my $start = pos($mrna) - 3; # back up to include the start sequence
last unless $mrna=~ /TAA|TAG|TGA/g;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
}
exit;

How to parse through tab-delimited file in perl?

I'm new to Perl, and I've hit a mental roadblock. I need to extract information from a tab delimited file as shown below.
#name years risk total
adam 5 100 200
adam 5 50 100
adam 10 20 300
bill 20 5 100
bill 30 10 800
In this example, the tab delimited file shows length of investment, amount of money risked, and total at the end of investment.
I want to parse through this file, and for each name (e.g. adam), calculate sum of years invested 5+5, and calculate sum of earnings (200-100) + (100-50) + (300-20). I also would like to save the totals for each name (200, 100, 300).
Here's what I have tried so far:
my $filename;
my $seq_fh;
open $seq_fh, $frhitoutput
or die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
my $yeartotal = 0;
my $earning = 0;
#$line[0] = name
#$line[1] = years
#$line[2] = start
#$line[3] = end
while (#line[0]){
$yeartotal += $line[1];
$earning += ($line[3]-$line[2]);
}
}
Any ideas of where I went wrong?
The Text::CSV module can be used to read tab-delimited data. Often much nicer than trying to manually hack yourself something up with split and so on when it comes to things like quoting, escaping, etc..
You're wrong here : while(#line[0]){
I'd do:
my $seq_fh;
my %result;
open($seq_fh, $frhitoutput) || die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
$result{$line[0]}{yeartotal} += $line[1];
$result{$line[0]}{earning} += $line[3] - $line[2];
}
You should use hash, something like this:
my %hash;
while (my $line = <>) {
next if $line =~ /^#/;
my ($name, $years, $risk, $total) = split /\s+/, $line;
next unless defined $name and defined $years
and defined $risk and defined $total;
$hash{$name}{years} += $years;
$hash{$name}{risk} += $risk;
$hash{$name}{total} += $total;
$hash{$name}{earnings} += $total - $risk;
}
foreach my $name (sort keys %hash) {
print "$name earned $hash{$name}{earnings} in $hash{$name}{years}\n";
}
Nice opportunity to explore Perl's powerful command line options! :)
Code
Note: this code should be a command line oneliner, but it's a little bit easier to read this way. When writing it in a proper script file, you really should enable strict and warnings and use a little bit better names. This version won't compile under strict, you have to declare our $d.
#!/usr/bin/perl -nal
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
# print summary
END { print "$_:\tyears: $d{$_}{y},\tearnings: $d{$_}{e}" for sort keys %d }
Output
adam: years: 20, earnings: 430
bill: years: 50, earnings: 885
Explanation
I make use of the -n switch here which basically lets your code iterate over the input records (-l tells it to use lines). The -a switch lets perl split the lines into the array #F. Simplified version:
while (defined($_ = <STDIN>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
}
%d is a hash with the names as keys and hashrefs as values, which contain years (y) and earnings (e).
The END block is executed after finishing the input line processing and outputs %d.
Use O's Deparse to view the code which is actually executed:
book:/tmp memowe$ perl -MO=Deparse tsv.pl
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
$d{$F[0]}{'y'} += $F[1];
$d{$F[0]}{'e'} += $F[3] - $F[2];
sub END {
print "${_}:\tyears: $d{$_}{'y'},\tearnings: $d{$_}{'e'}" foreach (sort keys %d);
}
;
}
tsv.pl syntax OK
It seems like a fixed-width file, I would use unpack for that

Using Perl hashes to handle tab-delimited files

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;