How to use hash in perl to reduce the use of loops? - perl

For example, first, I have identified several key words. Then, in another file, I want to find lines that matches these key words exactly.
Using loop, I would loop through the file for each key word. So if there are 10 key words, I would loop the file 10 times. But if the file is huge, this method would be very time inefficient.
I heard proper use of hash could ease such a situation but never saw an example. So exactly how to use hash to achieve the same goal while using loop only once? Could you provide an example?

You have a solution whose time taken is proportional to the size of the file and to the number of keywords (O(N*K)). You would like a solution where the time taken is not affected by the number of keywords (O(N)).
You could indeed achieve this using a hash.
my #keywords = ...;
my %keywords = map { $_ => 1 } #keywords;
while (<$fh>) {
my #words = split ' '; # To be improved.
if (grep { $keywords{$_} } #words) { # Replace with foreach loop to exit ASAP.
...
}
}
I'd use a regex match.
my #keywords = ...;
my $pat = join '|', map quotemeta, #keywords;
my $re = qr/$pat/;
while (<$fh>) {
if (/(?:^|\s)$re(?:\s|\z)/) { # To be improved.
...
}
}

Related

Why is my Tie::IxHash program taking a long time?

Basically, I have a script to create a hash for COGs with corresponding gene IDs:
# Open directory and get all the files in it
opendir(DIR, "/my/path/to/COG/");
my #infiles = grep(/OG-.*\.fasta/, readdir(DIR));
closedir(DIR);
# Create hash for COGs and their corresponding gene IDs
tie my %ids_for, 'Tie::IxHash';
if (! -e '/my/path/to/COG/COG_hash.ref') {
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
## %ids_for
store \%ids_for, '/my/path/to/COG/COG_hash.ref';
}
my $id_ref = retrieve('/my/path/to/COG/COG_hash.ref');
%ids_for = %$id_ref;
## %ids_for
The problem isn't that it doesn't work (at least I think), but that it is extremely slow for some reason. When I tried to test run it, it would take weeks for me to have an actual result. Somehow the hash creation is really really slow and I'm sure there is some way to optimize it better for it to work way faster.
Ideally, the paths should be the input of the script that way there would be no need to constantly change the script in case the path changes.
It would also be great if there could be a way to see the progress of the hash creation, like maybe have it show that it is 25% done, 50% done, 75% done and ultimately 100% done. Regarding this last point I have seen things like use Term::ProgressBar but I am not sure if it would be appropriate in this case.
Do you really need Tie::IxHash?
That aside, I suspect your culprit is this set of lines:
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
To add a key to the hash, you are creating a list of the current key-value pairs, adding the new pair, then assigning it all back to the hash.
What happens if you take the results from read_COG_fasta and add the keys one at a time?
for my $infile (#infiles) {
my %new_hash = read_COG_fasta($infile);
foreach my $key ( keys %new_hash ) {
$ids_for{$key} = $new_hash{$key};
}
}
As for progress, I usually have something like this when I'm trying to figure out something:
use v5.26;
my $file_count = #files;
foreach my $n ( 0 .. $#files ) {
say "[$n/$file_count] Processing $file[$n]";
my %result = ...;
printf "\tGot %d results", scalar %hash; # v5.26 feature!
}
You could do the same sort of thing with the keys that you get back so you can track the size.

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

count number of times string repeated in file perl

I am new to Perl, by the way. I have a Perl script that needs to count the number of times a string appears in the file. The script gets the word from the file itself.
I need it to grab the first word in the file and then search the rest of the file to see if it is repeated anywhere else. If it is repeated I need it to return the amount of times it was repeated. If it was not repeated, it can return 0. I need it to then get the next word in the file and check this again.
I will grab the first word from the file, search the file for repeats of that word, grab the second word from
the file, search the file for repeats of that word, grab the third word from the file, search the file for repeats of that word.
So far I have a while loop that is grabbing each word I need, but I do not know how to get it to search for repeats without resetting the position of my current line. So how do I do this? Any ideas or suggestions are greatly appreciated! Thanks in advance!
while (<theFile>) {
my $line1 = $_;
my $startHere = rindex($line1, ",");
my $theName = substr($line1, $startHere + 1, length($line1) - $startHere);
#print "the name: ".$theName."\n";
}
Use a hashtable;
my %wordcount = ();
while(my $line = <theFile>)
{
chomp($line);
my #words = split(' ', $line);
foreach my $word(#words)
{
$wordCount{$word} += 1;
}
}
# output
foreach my $key(keys %wordCount)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
The $wordCount{$key} - 1 in the output accounts for the first time a word was seen; Words that only apprear once in the file will have a count of 0
Unless this is actually homework and/or you have to achieve the results in the specific manor you describe, this is going to be FAR more efficient.
Edit: From your comment below:
Each word i am searching for is not "the first word" it is a certain word on the line. Basically i have a csv file and i am skipping to the third value and searching for repeats of it.
I would still use this approach. What you would want to do is:
split on , since this is a CSV file
Pull out the 3rd word in the array on each line and store the words you are interested in in their own hash table
At the end, iterate through the "search word" hash table, and pull out the counts from the wordcount table
So:
my #words = split(',', $line);
$searchTable{#words[2]} = 1;
...
foreach my $key(keys %searchTable)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
you'll have to adjust according to what rules you have around counting words that repeat in the third column. You could just remove them from #words before the loop that inserts into your wordCount hash.
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
foreach my $line (<theFile>) {
$line =~ s/$word/$wordcount++/eg;
}
print $wordcount."\n";
Look up the regex flag 'e' for more on what this does. I didn't test the code, but something like it should work. For clarification, the 'e' flag evaluates the second part of the regex (the substitution) as code before replacing, but it's more than that, so with that flag you should be able to make this work.
Now that I understand what you are asking for, the above solution won't work. What you can do, is use sysread to read the entire file into a buffer, and run the same substition after that, but you will have to get the first word off manually, or you can just decrement after the fact. This is because the sysread filehandle and the regular filehandle are handled differently, so try this:
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
my $srline = '';
#some arbitrary very long length, longer than file
#Looping also possible.
sysread(theFile,$srline,10000000)
$srline =~ s/$word/$wordcount++/eg;
$wordcount--; # I think that the first word will still be in here, causing issues, you should test.
print $wordcount."\n";
Now, given that I read your comment responding to your question, I don't think that your current algorithm is optimal, and you probably want a hash storing up all of the counts for words in a file. This would probably be best done using something like the following:
my %counts = ();
foreach my $line (<theFile>) {
$line =~ s/(\w+)/$counts{$1}++/eg;
}
# now %counts contains key-value pair words for everything in the file.
To find count of all words present in the file you can do something like:
#!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /\s+/, $line) {
$count_of{$word}++;
}
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
print "'$word': $count_of{$word}\n";
}
__END__

How do I remove a a list of character sequences from the beginning of a string in Perl?

I have to read lines from a file and store them into a hash in Perl. Many of these lines have special character sequences at the beginning that I need to remove before storing. These character sequences are
| || ### ## ##||
For example, if it is ||https://ads, I need to get https://ads; if ###http, I need to get http.
I need to exclude these character sequences. I want to do this by having all the character sequences to exclude in a array and then check if the line starts with these character sequences and remove those. What is a good way to do this?
I've gone as far as:
our $ad_file = "C:/test/list.txt";
our %ads_list_hash = ();
my $lines = 0;
# List of lines to ignore
my #strip_characters = qw /| || ### ## ##||/;
# Create a list of substrings in the easylist.txt file
open my $ADS, '<', $ad_file or die "can't open $ad_file";
while(<$ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
close $ADS;
I need to add the logic to remove the #strip_characters from the beginning of each line if any of them are present.
Probably a bit too complex and general for the task, but still..
my $strip = join "|", map {quotemeta} #strip_characters;
# avoid bare [] etc. in the RE
# ... later, in the while()
s/^(?:$strip)+//o;
# /o means "compile $strip into the regex once and for all"
Why don't you do it with a regex? Something like
$line =~ s/^[## |]+//;
should work.
If you want to remove a list of characters (according to your title), then a very simple regular expression will work.
Within the loop, add the following regular expression
while( <$ADS> ) {
chomp;
s/^[## \|]+//;
$ads_list_hash{$lines++} = $_;
}
Note the pipe charachter ('|') is escapted.
However, it appears that you want to remove a list of expressions. You can do the following
while( <$ADS> ) {
chomp;
s/^((\|)|(\|\|)|(###)|(##)|(##\|\|))+//;
$add_list_hash{$lines++} = $_;
}
You said that the list of expression is stored in an array or words. In your sample code, you create this array with 'qw'. If the list of expressions isn't known at compile time, you can build a regular expression in a variable, and use it.
my #strip_expression = ... // get an array of strip expressions
my $re = '^((' . join(')|(',#strip_expression) . '))+';
and then, use the following statement in the loop:
s/$re//;
Finaly, one thing not related to the question can be said about the code: It would be much more appropriate to use Array instead of Hash, to map an integer to a set of strings. Unless you have some other requirement, better have:
our #ads_list; // no need to initialize the array (or the hash) with empty list
...
while( <$ADS> ) {
chomp;
s/.../;
push #ads_list, $_;
}
$ads_list_hash{$lines} = $_;
$lines ++;
Don't do that. If you want an array, use an array:
push #ads_lines, $_;
Shawn's Rule of Programming #7: When creating data structures: if preserving the order is important, use an array; otherwise use a hash.
Because substitutions return whether or not they did anything you can use a
substitution to search the string for your pattern and remove it if it's there.
while( <$ADS> ) {
next unless s/^\s*(?:[#]{2,3}|(?:##)?[|]{1,2})\s*//;
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}