Perl recursion techniques? - perl

I need a bit of help with is this code. I know the sections that should be recursive, or at least I think I do but am not sure how to implement it. I am trying to implement a path finding program from an alignment matrix that will find multiple routes back to the zero value. For example if you excute my code and insert CGCA as the first sequence and CACGTAT as the second sequence, and 1, 0, and -1 as the match, mismatch and gap scores. The program gives off the path as HDHHDD and the aligment as
CACGTAT
CGC--A-.
However there are more possible paths and aligments then this, except I don't know how many. What I want to do is have a piece of my code loop back on itself and find other paths and alignments, using the same code as the first time around, until it runs out of possible alignments. The best way I found on the net to do this is recursion, except no one can explain how to do it. In this case there should be two more paths and aligmennts HDDDHHD and CACGTAT, and C--GCA-, and. HDDDDHH , CACGTAT AND --CGCA-. I just don't know how to code to perform this task.
# Implementation of Needleman and Wunsch Algorithm
my($seq1, $len1, $seq2, $len2, $data, #matrix, $i, $j, $x, $y, $val1, $val2);
my($val3, $pathrow, $pathcol, $seq1loc, $seq2loc, $gapscore, $matchscore, $mismatchscore);
#first obtain the data from the user.
print "Please enter the first sequence for comaprsion\n";
$seq1=<STDIN>;
chomp $seq1;
print "Please enter the second sequence for comparsion\n";
$seq2=<STDIN>;
chomp $seq2;
# adding extra characters so sequences align with matrix
# saves some calculations later on
$seq1 = " " . $seq1;
$seq2 = " " . $seq2;
$len1 = length($seq1);
$len2 = length($seq2);
print "Enter the match score: ";
$matchscore=<STDIN>;
chomp $matchscore;
print "Enter the mismatch score: ";
$mismatchscore=<STDIN>;
chomp $mismatchscore;
print "Enter the gap score: ";
$gapscore=<STDIN>;
chomp $gapscore;
# declare a two dimensional array and initialize to spaces
# array must contain one extra row and one extra column
#matrix = ();
for($i = 0; $i < $len1; $i++){
for($j = 0; $j < $len2; $j++){
$matrix[$i][$j] = ' ';
}
}
# initialize 1st row and 1st column of matrix
$matrix[0][0] = 0;
for ($i = 1; $i < $len1; $i ++){
$matrix[$i][0] = $matrix[$i-1][0] + $gapscore;
}
for ($i = 1; $i < $len2; $i ++){
$matrix[0][$i] = $matrix[0][$i-1] + $gapscore;
}
# STEP 1:
# Fill in rest of matrix using the following rules:
# determine three possible values for matrix[x][y]
# value 1 = add gap score to matrix[x][y-1]
# value 2 = add gap score to matrix[x-1][y]
# value 3 = add match score or mismatch score to
# matrix[x-1][y-1] depending on nucleotide
# match for position x of $seq1 and position y
# of seq2
# place the largest of the three values in matrix[x][y]
#
# Best alignment score appears in matrix[$len1][$len2].
for($x = 1; $x < $len1; $x++){
for($y = 1; $y < $len2; $y++){
$val1 = $matrix[$x][$y-1] + $gapscore;
$val2 = $matrix[$x-1][$y] + $gapscore;
if (substr($seq1, $x, 1) eq substr($seq2, $y, 1)){
$val3 = $matrix[$x-1][$y-1] + $matchscore;
}
else{
$val3 = $matrix[$x-1][$y-1] + $mismatchscore;
}
if (($val1 >= $val2) && ($val1 >= $val3)){
$matrix[$x][$y] = $val1;
}
elsif (($val2 >= $val1) && ($val2 >= $val3)){
$matrix[$x][$y] = $val2;
}
else{
$matrix[$x][$y] = $val3;
}
}
}
# Display scoring matrix
print "MATRIX:\n";
for($x = 0; $x < $len1; $x++){
for($y = 0; $y < $len2; $y++){
print "$matrix[$x][$y] ";
}
print "\n";
}
print "\n";
# STEP 2:
# Begin at matrix[$len1][$len2] and find a path to
# matrix[0][0].
# Build string to hold path pattern by concatenating either
# 'H' (current cell produced by cell horizontally to left),
# 'D' (current cell produced by cell on diagonal),
# 'V' (current cell produced by cell vertically above)
# ***This is were I need help I need this code to be recursive, so I can find more then one path***
$pathrow = $len1-1;
$pathcol = $len2-1;
while (($pathrow != 0) || ($pathcol != 0)){
if ($pathrow == 0){
# must be from cell to left
$path = $path . 'H';
$pathcol = $pathcol - 1;
}
elsif ($pathcol == 0){
# must be from cell above
$path = $path . 'V';
$pathrow = $pathrow - 1;
}
# could be from any direction
elsif (($matrix[$pathrow][$pathcol-1] + $gapscore) == $matrix[$pathrow][$pathcol]){
# from left
$path = $path . 'H';
$pathcol = $pathcol - 1;
}
elsif (($matrix[$pathrow-1][$pathcol] + $gapscore) == $matrix[$pathrow][$pathcol]){
# from above
$path = $path . 'V';
$pathrow = $pathrow - 1;
}
else{
# must be from diagonal
$path = $path . 'D';
$pathrow = $pathrow - 1;
$pathcol = $pathcol - 1;
}
}
print "Path is $path\n";
# STEP 3:
# Determine alignment pattern by reading path string
# created in step 2.
# Create two string variables ($alignseq1 and $alignseq2) to hold
# the sequences for alignment.
# Reading backwards from $seq1, $seq2 and path string,
# if string character is 'D', then
# concatenate to front of $alignseq1, last char in $seq1
# and to the front of $alignseq2, last char in $seq2
# if string character is 'V', then
# concatenate to front of $alignseq1, last char in $seq1
# and to the front of $alignseq2, the gap char
# if string character is 'H', then
# concatenate to front of $alignseq1 the gap char
# and to the front of $alignseq2, last char in $seq2
# Continue process until path string has been traversed.
# Result appears in string $alignseq1 and $seq2
***#I need this code to be recursive as well.***
$seq1loc = $len1-1;
$seq2loc = $len2-1;
$pathloc = 0;
print length($path);
while ($pathloc < length($path)){
if (substr($path, $pathloc, 1) eq 'D'){
$alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;
$alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;
$seq1loc--;
$seq2loc--;
}
elsif (substr($path, $pathloc, 1) eq 'V'){
$alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;
$alignseq2 = '-' . $alignseq2;
$seq1loc--;
}
else{ # must be an H
$alignseq1 = '-' . $alignseq1;
$alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;
$seq2loc--;
}
$pathloc++;
}
print "\nAligned Sequences:\n";
print "$alignseq2 \n";
print "$alignseq1 \n";
# statement may be needed to hold output screen
print "Press any key to exit program";
$x = <STDIN>;
If anyone is wondering this is a needleman-wunsch algorithm. Any help here would be greatly apperciated.

I can't provide an answer, because I don't understand exactly what you are try to do, but I can offer some general advice.
Start organizing your code into discrete subroutines that perform narrowly defined tasks. In addition, the subroutines that implement your central algorithms should not be oriented toward receiving input from the keyboard and producing output to the screen; rather they should receive input as arguments and return their results. If there is a need for user input or screen output, those tasks should be in separate subroutines, not comingled with your primary algorithms.
A first (and partial) step down that path is to take you entire program, enclose it in a subroutine definition, and then call the subroutine with the required arguments. Instead of printing its key results, the subroutine should return them -- specifically, a reference to #matrix along with the values for $path, $alignseq1, $alignseq2.
sub NW_algo {
my ($seq1, $seq2, $matchscore, $mismatchscore, $gapscore) = #_;
# The rest of your code here, but with all print
# statements and <STDIN> inputs commented out.
return \#matrix, $path, $alignseq1, $alignseq2;
}
my(#return_values) = NW_algo('CGCA', 'CACGTAT', 1, 0, -1);
Print_matrix($return_values[0]);
sub Print_matrix {
for my $m ( #{$_[0]} ){
print join(' ', #$m), "\n";
}
}
At this point, you'll have an algorithm that can be invoked by other code, making it easier to test and debug your program going forward. For example, you could define various sets of input data and run NW_algo() on each set. Only then will it be possible to think about recursion or other techniques.

Since Needleman-Wunsch is a dynamic-programming algorithm, most of the work is already done by the time you compute your DP matrix. Once you have your DP matrix, you're supposed to backtrack through the matrix to find the optimal alignment. The problem is a bit like taxicab geometry except that you can move diagonally. Essentially, when you need to backtrack through the matrix, instead of choosing between going up, left, or diagonally, you do all three by making three recursive calls, and each of those call themselves for each of up, left, or diagonally, until you reach your starting point. The path traced by each strand of recursion will draw out each alignment.
EDIT: So basically you need to put Step 2 in a subprocedure (that takes position and the path traced so far), so it can call itself over and over. Of course, after you define the procedure you need to make one call to it to actually start the tracing process:
sub tracePaths {
$x = shift;
$y = shift;
$pathSoFar = shift; # assuming you're storing your path as a string
#
# ... do some work ...
#
tracePaths($x - 1, $y, $pathSoFar . something);
tracePaths($x, $y - 1, $pathSoFar . somethingelse);
tracePaths($x - 1, $y - 1, $pathSoFar . somethingelselse);
#
#
#
if(reached the end) return $pathSoFar;
}
# launch the recursion
tracePaths(beginningx, beginningy, "");

This doesn't speak specifically to your problem, but you should maybe check out the book Higher Order Perl. It goes over how to use a lot of higher-level techniques (such as recursion).

Related

Scoping in Perl

As a biology student, I'm trying to extend my programming knowledge and I ran into a problem with Perl.
I'm trying to create a program that generates random DNA strings and performs analysis work on the generated data.
In the first part of the program, I am able to print out the strings stored in the array, but the second part I cannot retrieve all but one of the elements of the array.
Could this be part of the scoping rules of Perl?
#!usr/bin/perl
# generate a random DNA strings and print it to file specified by the user.
$largearray[0] = 0;
print "How many nucleotides for the string?\n";
$n = <>;
$mylong = $n;
print "how many strings?\n";
$numstrings = <>;
# #largearray =();
$j = 0;
while ( $j < $numstrings ) {
$numstring = ''; # start with the empty string;
$dnastring = '';
$i = 0;
while ( $i < $n ) {
$numstring = int( rand( 4 ) ) . $numstring; # generate a new random integer
# between 0 and 3, and concatenate
# it with the existing $numstring,
# assigning the result to $numstring.
$i++; # increase the value of $i by one.
}
$dnastring = $numstring;
$dnastring =~ tr/0123/actg/; # translate the numbers to DNA characters.
#print $dnastring;
#print "\n";
$largearray[j] = $dnastring; #append generated string to end of array
#print $largearray[j];
#print $j;
#IN HERE THERE ARE GOOD ARRAY VALUES
#print "\n";
$j++;
}
# ii will be used to continuously take the next couple of strings from largearray
# for LCS matching.
$mytotal = 0;
$ii = 0;
while ( $ii < $numstrings ) {
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
print "\n";
$ii++;
#string1 = split( //, $line );
$line = $largearray[ii];
#print $largearray[ii];
#print "\n";
$ii++;
chomp $line;
#string2 = split( //, $line );
$n = #string1; #assigning a list to a scalar just assigns the
#number of elements in the list to the scalar.
$m = #string2;
$v = 1;
$Cm = 0;
$Im = 0;
$V[0][0] = 0; # Assign the 0,0 entry of the V matrix
for ( $i = 1; $i <= $n; $i++ ) { # Assign the column 0 values and print
# String 1 See section 5.2 of Johnson
# for loops
$V[$i][0] = -$Im * $i;
}
for ( $j = 1; $j <= $m; $j++ ) { # Assign the row 0 values and print String 2
$V[0][$j] = -$Im * $j;
}
for ( $i = 1; $i <= $n; $i++ ) { # follow the recurrences to fill in the V matrix.
for ( $j = 1; $j <= $m; $j++ ) {
# print OUT "$string1[$i-1], $string2[$j-1]\n"; # This is here for debugging purposes.
if ( $string1[ $i - 1 ] eq $string2[ $j - 1 ] ) {
$t = 1 * $v;
}
else {
$t = -1 * $Cm;
}
$max = $V[ $i - 1 ][ $j - 1 ] + $t;
# print OUT "For $i, $j, t is $t \n"; # Another debugging line.
if ( $max < $V[$i][ $j - 1 ] - 1 * $Im ) {
$max = $V[$i][ $j - 1 ] - 1 * $Im;
}
if ( $V[ $i - 1 ][$j] - 1 * $Im > $max ) {
$max = $V[ $i - 1 ][$j] - 1 * $Im;
}
$V[$i][$j] = $max;
}
} #outer for loop
print $V[$n][$m];
$mytotal += $V[$n][$m]; # append current result to the grand total
print "\n";
} # end while loop
print "the average LCS value for length ", $mylong, " strings is: ";
print $mytotal/ $numstrings;
This isn't a scoping issue. You have declared none of your variables, which has the effect of implicitly making them all global and accessible everywhere in your code
I reformatted your Perl program so that I could read it, and then added this to the top of your program
use strict;
use warnings 'all';
which are essential in every Perl program you write
Then I added
no strict 'vars';
which is a very bad idea, and lets you get away without declaring any variables
The result is this
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 60.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 61.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 67.
Argument "j" isn't numeric in array element at E:\Perl\source\dna.pl line 42.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 60.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 61.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 67.
Bareword "j" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 42.
Execution of E:\Perl\source\dna.pl aborted due to compilation errors.
Line 42 (of my reformatted version) is
$largearray[j] = $dnastring
and lines 60, 61 and 67 are
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
and
$line = $largearray[ii];
You are using j and ii as array indexes. Those are Perl subroutine calls, not variables. Adding use strict would have stopped this from compiling unless you had also declared sub ii and sub j
You might get away with it if you just change j and ii to $j and $ii, but you are certain to get into further problems
Please make the same changes to your own code, and declare every variable that you need using my as close as possible to the first place they are used
You should also improve your variable naming. Things like #largearray are pointless: the # says that it's an array, and whether it's large or not is relative, and of little use in understanding your code. If you have no better description of its purpose then #table or #data are probably a little better
Likewise, please avoid capital letters and most single-letter names. #V, $Cm and $Im are meaningless, and you would need fewer comments if those names were better
You certainly wouldn't need comments like # end while loop and # outer for loop if you had indented your blocks properly and kept them short enough so that both the beginning and the end can be seen on the screen at the same time, and the fewer comments you can get away with the better, because they badly clutter the code structure
Finally, it's worth noting that the C-style for loop is rarely the best choice in Perl. Your
for ( $i = 1; $i <= $n; $i++ ) { ... }
is much clearer as
for my $i ( 1 .. $n ) { ... }
and declaring the control variable at that point makes it unnecessary to invent new names like $ii for each new loop
I think you have a typo in your code:
ii => must be $ii
don't forget to put this at the beginning of your code:
use strict;
use warnings;
in order to avoid this (and others) kind of errors

Weird subroutine interaction in perl while drawing pyramids out of stars

I wasn't completely sure what to title this, but here is the issue. My goal is to create a subroutine called drawPyramids that will draw a pyramid out of *, with the number of rows depending on the command line parameter. With each row, the number of * increments by 2, so it goes 1-3-5-7 and so on. Here's what I have so far:
sub drawRow {
my $space = $_[0];
my $star = $_[1];
for ($i=0;$i<$space;$i++) {
print " ";
}
for ($i=0;$i<$star;$i++) {
print "*";
}
}
sub drawPyramid {
my $rows = $_[0];
my $x = 1;
for ($i=1;$i<=$rows;$i++) {
drawRow($rows-$i,$x);
print "\n";
$x+=2;
}
}
if(#ARGV == 0) { #check if user entered parameter by checking size of array
die "ERROR: Please supply command-line parameter\n";
}
foreach $a(#ARGV) { #check if number is negative
if ($a < 0) {
print "ERROR: Number must be non-negative\n";
}
}
$sp = #ARGV[0];
$st = #ARGV[1];
drawPyramid($sp);
Lets say I run it as perl pyramid.pl 5 in my CMD. The expected result is:
*
***
*****
*******
*********
with 4 spaces before the first star on the first row, 3 spaces...and so on. However, this is what I get:
*
***
*****
The third row with 5 stars should have 2 spaces before the stars start, and the program doesn't even print the last two lines (which would have 7 and 9 stars).
What is wrong with the program? Any help is appreciated.
The problem is that both subroutines are reading and modifying the same global variable $i (also known as $::i or $main::i), so they're interfering with each other.
To fix this, you should instead use local variables, declared with my; that is, change this:
for ($i = 1; $i <= ...; $i++) {
to this:
for (my $i = 1; $i <= ...; $i++) {
throughout.

Perl: subroutine throwing error with $i

Good morning,
I'm having trouble using a subroutine - if I put certain code into the subroutine, it throws an error of "use of uninitialised value $i in array element".
I have a very long script, so I will only post the bit I believe is relevant.
The subroutine I am calling is commented out underneath &exon_positive_strand (saves you scrolling down). When I remove the subroutine and uncomment the code, I get no errors. I can only imagine it is something to do with $i but I don't know what...
Any advice would be greatly appreciated.
Many thanks,
Ellie
my ($value, $col, $col2, $l_o_b, $left, $matchedID, $diff_three_prime, $diff_five_prime, $sequence, #three_prime_ss, #five_prime_ss, #reverse_five, #reverse_three);
my $i = 0;
open (EXONS_five, '>fasta_exons_five_non');
open (EXONS_three, '>fasta_exons_three_non');
foreach my $match(#exonic_matches) { ## works out exon from boundary relative to correct strand direction ##
if ($exon_ID[$i] !~ m/unknown/ && $dupmatches[$i] == 0)
{
$sequence = '';
$value = $exon_ID[$i];
$col = $exon_left{$value};
$col2 = $exon_right{$value};
#three_prime_ss = split(",", $col); ##splits left column into subcolumns
#five_prime_ss = split(",", $col2); ## splits right columnn into subcolumns
#reverse_three = reverse(#three_prime_ss);
#reverse_five = reverse(#five_prime_ss);
shift(#reverse_five);
if ($strands{$value} =~ m/\+/) {
&exon_positive_strand;
# $diff_three_prime = $LBP[$i] - $three_prime_ss[$exons2{$value} - 1]; ## minus numbers denote a difference to the left (i.e. upsteam)
# $diff_five_prime = $LBP[$i] - $five_prime_ss[$exons2{$value} - 1]; ## minus numbers denote a difference to the left (i.e. upsteam)
# $matchedID = $ID{$LBP[$i]};
# if ($diff_three_prime !~ m/\-/ && $diff_three_prime <= 3) {
# $BP{$LBP[$i]} =~ s/\[[ACTG]\]/$ref[$i]/i; ## putting variant into 50BP seq
# $l_o_b = 20;
# ##$right_of_boundary = 3;
# $l_o_b = $l_o_b + $diff_three_prime;
# $left = 51 - $l_o_b;
# $sequence = substr($BP{$LBP[$i]}, $left, 23);
# }
# elsif ($diff_five_prime =~ m/\-/ && $diff_five_prime >= -3) {
# $BP{$LBP[$i]} =~ s/\[[ACTG]\]/$ref[$i]/i; ## putting variant into 50BP seq
# $l_o_b = 3;
# ##$right_of_boundary = 6;
# $l_o_b = $l_o_b + $diff_five_prime;
# $left = 51 - $l_o_b;
# $sequence = substr( $BP{$LBP[$i]}, $left, 9);
}
}
my $seq_length = length($sequence);
if ($seq_length == 9) {
print EXONS_five (">" . "$match_exon{$col_exon_no[$i]}" . "\n", lc($sequence),"\n");
}
elsif ($seq_length == 23) {
print EXONS_three (">" . "$match_exon{$col_exon_no[$i]}" . "\n", lc($sequence),"\n");
}
$i++;
}
close (EXONS_five);
close (EXONS_three);
"Use of uninitialized value in array element" is not an error, it's a warning. Diagnostics can tell you what it means:
(W uninitialized) An undefined value was used as if it were already
defined. It was interpreted as a "" or a 0, but maybe it was a mistake.
To suppress this warning assign a defined value to your variables.
To help you figure out what was undefined, perl will try to tell you the
name of the variable (if any) that was undefined. In some cases it cannot
do this, so it also tells you what operation you used the undefined value
in. Note, however, that perl optimizes your program and the operation
displayed in the warning may not necessarily appear literally in your
program. For example, "that $foo" is usually optimized into "that "
. $foo, and the warning will refer to the concatenation (.) operator,
even though there is no . in your program.
You need to pass the $i variable to the subroutine:
exon_positive_strand($i);
and
sub exon_positive_strand {
my $i = shift;
...

A little help with loops on perl

I am having trouble specifiying the correct algorithm. I am iterating over an input file with loops. The issue that I have is on the last loop.
#!/usr/bin/perl
# Lab #4
# Judd Bittman
# http://www-users.cselabs.umn.edu/classes/Spring-2011/csci3003/index.php?page=labs
# this site has what needs to be in the lab
# lab4 is the lab instructions
# yeast protein is the part that is being read
use warnings;
use strict;
my $file = "<YeastProteins.txt";
open(my $proteins, $file);
my #identifier;
my #qualifier;
my #molecularweight;
my #pi;
while (my $line1 = <$proteins>) {
#print $line1;
chomp($line1);
my #line = split(/\t/, $line1);
push(#identifier, $line[0]);
push(#qualifier, $line[1]);
push(#molecularweight, $line[2]);
push(#pi, $line[3]);
}
my $extreme = 0;
my $ex_index = 0;
for (my $index = 1; $index < 6805; $index++) {
if ( defined($identifier[$index])
&& defined($qualifier[$index])
&& defined($molecularweight[$index])
&& defined($pi[$index])) {
# print"$identifier[$index]\t:\t$qualifier[$index]:\t$molecularweight[$index]:\n$pi[$index]";
}
if ( defined($identifier[$index])
&& defined($qualifier[$index])
&& defined($pi[$index])) {
if (abs($pi[$index] - 7) > $extreme && $qualifier[$index] eq "Verified")
{
$extreme = abs($pi[$index] - 7);
$ex_index = $identifier[$index];
print $extreme. " " . $ex_index . "\n";
}
}
}
print $extreme;
print "\n";
print $ex_index;
print "\n";
# the part above does part b of the assignment
# YLR204W,its part of the Mitochondrial inner membrane protein as well as a processor.
my $exindex = 0;
my $high = 0;
# two lines above and below is part c
# there is an error and I know there is something wrong
for (my $index = 1; $index < 6805; $index++) {
if ( defined($qualifier[$index])
&& ($qualifier[$index]) eq "Verified"
&& defined($molecularweight[$index])
&& (abs($molecularweight[$index]) > $high)) {
$high = (abs($molecularweight[$index]) > $high); # something wrong on this line, I know I wrote something wrong
$exindex = $identifier[$index];
}
}
print $high;
print "\n";
print $exindex;
print "\n";
close($proteins);
exit;
On the final loop I want my loop to hold on to the protein that is verified and has the highest molecular mass. This is in the input file. What code can I use to tell the program that I want to hold the highest number and its name? I feel like I am very close but I can't put my finger on it.
First, a note about perl - in general, it's more common to use foreach style loops rather than c-style indexed loops. For example:
for my $protein (#proteins) {
#do something with $p
}
(Your situation might require it, I just thought I'd mention this)
To address your specific issue though:
$high = (abs($molecularweight[$index])>$high);
$high is being set to the result of the boolean test being performed. Remove the >$high part (which is being tested in your if statement) and you'll likely end up with what you intended.
You likely want a more complex data structure, such as a nested hash. It's hard to give a solid example without more knowledge of the data, but, say your first identifier were abc, the second one was def, etc:
my %protein_entries = (
abc => {
qualifier => 'something',
molecular_weight => 1234,
pi => 'something',
},
def => {
qualifier => 'something else',
molecular_weight => 5678,
pi => 'something else',
},
# …
);
Then, rather than having several different arrays and keeping track of which belongs to which, you get at the elements like so:
Then, if you want to get at the highest by molecular weight, you can sort the identifiers by their molecular weight, then slice off the highest one:
my $highest = (sort {
$protein_entries{$a}{molecular_weight}
<=>
$protein_entries{$b}{molecular_weight}
} keys %protein_entries)[1];
You're having problem with your algorithm because you're not structuring your data properly, basically.
In this example, $highest will hold def, then later you can go back and fetch $protein_entries{def}{molecular_weight} or any of the other keys in the anonymous hash referenced by $protein_entries{def}, thus being easily able to recall any relevant associated data.
Just change:
$high = (abs($molecularweight[$index]) > $high);
To this:
$high = abs($molecularweight[$index]) if (abs($molecularweight[$index]) > $high);
At the end of the loop, $high will be the highest value in $molecularweight array.

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing.
Data could look like this:
Here is some header text.
Column One Column Two Column Three
a b
a b c
Some more text
Another Table Another Column
abdbdbdb aaaa
Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example)
Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if this is not the case, just use Text::CSV(_XS).
Assumption: no tabs used for formatting.
The logic defines a "column separator" to be any consecutive set of vertical rows populated 100% with spaces.
If by accident every row has a space which is part of the data at offset M characters, the logic will consider offset M to be a column separator, since it can't know any better. The ONLY way it can know better is if you require column separation to be at least X spaces where X>1 - see the second code fragment for that.
Sample code:
my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
# 0 means from entire file
my $lines_scanned = 0;
my #non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
chomp;
my $line = $_;
my #chars = split(//, $line);
for (my $i = 0; $i < #chars; $i++) { # Probably can be done prettier via map?
$non_spaces[$i] = 1 if $chars[$i] ne " ";
}
}
close $fh or die;
# Find columns, defined as consecutive "non-spaces" slices.
my #starts, #ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < #non_spaces; $i++) {
next if $state eq " " && !$non_spaces[$i];
next if $state eq "c" && $non_spaces[$i];
if ($state eq " ") { # && $non_spaces[$i] of course => start column
$state = "c";
push #starts, $i;
} else { # meaning $state eq "c" && !$non_spaces[$i] => end column
$state = " ";
push #ends, $i-1;
}
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
push #ends, $#non_spaces;
}
# Now split lines
my $fh = open(...) or die;
my #rows = ();
while (<$fh>) {
my #columns = ();
push #rows, \#columns;
chomp;
my $line = $_;
for (my $col_num = 0; $col_num < #starts; $col_num++) {
$columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
}
}
close $fh or die;
Now, if you require column separation to be at least X spaces where X>1, it's also doable but the parser of column locations needs to be a bit more complex :
# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my #starts, #ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < #non_spaces; $i++) {
if ($state eq "S") { # done with last column, inside a separator
if ($non_spaces[$i]) { # start a new column
$state = "c";
push #starts, $i;
}
next;
}
if ($state eq "c") { # Processing a column
if (!$non_spaces[$i]) { # First space after non-space
# Could be beginning of separator? check next X chars!
for (my $j = $i+1; $j < #non_spaces
|| $j < $i+$min_col_separator_is_X_spaces; $j++) {
if ($non_spaces[$j]) {
$i = $j++; # No need to re-scan again
next NEXT_CHAR; # OUTER loop
}
# If we reach here, next X chars are spaces! Column ended!
push #ends, $i-1;
$state = "S";
$i = $i + $min_col_separator_is_X_spaces;
}
}
next;
}
}
Here's a very quick solution, commented with an overview. (My apologies for the length.) Basically, if a "word" appears after the start of column header n, then it ends up in column n, unless most of its body trails into column n + 1, in which case it ends up there instead. Tidying this up, extending it to support multiple different tables, etc. are left as an exercise. You could also use something other than the left offset of the column header as the boundary mark, such as the centre, or some value determined by the column number.
#!/usr/bin/perl
use warnings;
use strict;
# Just plug your headers in here...
my #headers = ('Column One', 'Column Two', 'Column Three');
# ...and get your results as an array of arrays of strings.
my #result = ();
my $all_headers = '(' . (join ').*(', #headers) . ')';
my $found = 0;
my #header_positions;
my $line = '';
my $row = 0;
push #result, [] for (1 .. #headers);
# Get lines from file until a line matching the headers is found.
while (defined($line = <DATA>)) {
# Get the positions of each header within that line.
if ($line =~ /$all_headers/) {
#header_positions = #-[1 .. #headers];
$found = 1;
last;
}
}
$found or die "Table not found! :<\n";
# For each subsequent nonblank line:
while (defined($line = <DATA>)) {
last if $line =~ /^$/;
push #{$_}, "" for (#result);
++$row;
# For each word in line:
while ($line =~ /(\S+)/g) {
my $word = $1;
my $position = $-[1];
my $length = $+[1] - $position;
my $column = -1;
# Get column in which word starts.
while ($column < $#headers &&
$position >= $header_positions[$column + 1]) {
++$column;
}
# If word is not fully within that column,
# and more of it is in the next one, put it in the next one.
if (!($column == $#headers ||
$position + $length < $header_positions[$column + 1]) &&
$header_positions[$column + 1] - $position <
$position + $length - $header_positions[$column + 1]) {
my $element = \$result[$column + 1]->[$row];
$$element .= " $word";
# Otherwise, put it in the one it started in.
} else {
my $element = \$result[$column]->[$row];
$$element .= " $word";
}
}
}
# Output! Eight-column tabs work best for this demonstration. :P
foreach my $i (0 .. $#headers) {
print $headers[$i] . ": ";
foreach my $c (#{$result[$i]}) {
print "$c\t";
}
print "\n";
}
__DATA__
This line ought to be ignored.
Column One Column Two Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.
This line ought to be ignored also.
Sample output:
Column One: These lines are The data are split
Column Two: part of the tabular based on how
Column Three: data to be processed. much words overlap columns.