I have here a working code, It works fine with 8 or 10 emails, but if you just put 20 emails it never finishes computing. That is, it is not an infinite loop because otherwise it would never compute anything. Also, if you use just 10 emails but ask it to make lists of more than 2, same thing happens. Yes, as pointed out, there is a while(#address) and somewhere in there, a push into address, that is the reason. I tried to replace that array into which it was pushed by another name, but i get weird errors like it picks one email from the list and it will complain that while strict references are on, i cant use that ...
I understand 100% the code up until the 'map' line. After that, not so much...
If we look at this part:
push #addresses, $address;
$moved{$address}++;
# say "pushing $address to moved"; # debug
one would say that the variable $address would have to be pushed, not into the #addresses, as that is the source of data (hence the loop as pointed out) but onto ..'moved' but, sorry, 'moved' is a hash. You can't push a variable into a hash, can you ? should then 'moved' be actually an array and not a hash? this is where i get lost
I was thinking about this instead, but ...it is just intuition, not real knowledge
push #{ $moved[$i] }, $address
I think I have solved it, taking as departure point the remark from 'Konerak'. Indeed the issue was a never decreasing list. Because I am not knowledgeable in reference arrays I was kind of lost, but somehow reading the code I tried to find similarity in a expected behaviour.
Therefore I created another array called #reserva and I wrote this:
push # {$reserva [$i]}, $address
instead of
push #addresses, $address;
Now, I get lists the size I want regardless of how many emails I enter. I tried with 1000 and had no problem in less than one second.
So, here is the full code
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $only_index = 3; # Read from command line with $ARGV[0] or use Getopt::Long
my %blacklist = ( # Each key in this hash represents one index/day
'2' => [ 'a', 'b' ], # and has an arrayref of domains that have replied on
'3' => [ 'c' ], # that day. We look at all keys smaller than the current
); # index in each iteration and ignore all these domains
my #domains; # holds the domains we have already seen for each list
my #lists = ([]); # Holds all the lists
my %moved; # the addresses we moved to the back
my $i = 0;
my #addresses = <DATA>;
while (#addresses) {
my $address = shift #addresses;
chomp $address;
$address =~ m/#([a-zA-Z0-9\-.]*)\b/;
my $domain = $1;
# If the domain has answered, do not do it again
next if
grep { /$domain/ }
map { exists $blacklist{$_} ? #{ $blacklist{$_} } : () } (0..$i);
$i++ if (#{ $lists[$i] } == 2
|| (exists $moved{$address} && #addresses < 1));
if (exists $domains[$i]->{$domain}) {
push #addresses, $address;
$moved{$address}++;
# say "pushing $address to moved"; # debug
} else {
$domains[$i]->{$domain}++;
# send the email
# say "added $address to $i"; # debug
push #{ $lists[$i] }, $address;
}
}
# print Dumper \#lists; # Show all lists
print Dumper $lists[$only_index]; # Only show the selected list
1;
__DATA__
1#a
2#a
3#a
1#b
2#b
1#c
2#c
3#c
1#d
2#d
3#d
4#d
1#e
1#f
1#g
1#h
4#a
5#a
4#c
That's some twisty code, it's no wonder you're having trouble following it. I'm not actually sure what the body of the code is supposed to accomplish, but you can at least avoid the infinite loop by not using while (#array) - use foreach my $item (#array) instead and you'll iterate over it and avoid and bizarre behavior that will arise from modifying the array inside the loop.
chomp(#addresses); # chomp called on an array chomps each element
foreach my $address (#addresses) {
# Do work here
}
Related
I have a file like this.
>;1;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>;2;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>;3;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
I would like to change each number to a corresponding string.
I wrote the following Perl program but I don't know what is wrong with it.
%lista2 = (
1 => "CAT00.3",
2 => "CAT43.1",
3 => "CAT40.3"
);
open(OA, ">file2.txt");
foreach $key ( keys %lista2 ) {
open(SAL, "file.txt");
while ( <SAL> ) {
chomp;
if( />/ ) {
#w = split("\t");
$r = 0;
s/\;//g;
if ( /%lista2[i]/ ) {
print OA "$_ $lista2{$key}\n" ;
$r = 1;
}
}
}
}
close(SAL);
close(OA);
I want to get this
>CAT00.3
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>CAT43.1
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>CAT40.3
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
But I don't know how do that.
Well you were in the right direction, I guess. But somewhere along the path you we're lost and it seems like randomly tried to run in any direction. There are a lot of things wrong in your code.
For instance it's funny how you can have those two lines
if ( /%lista2[i]/ ) {
print OA "$_ $lista2{$key}\n" ;
having one correct attempt accessing a has value ($lista2{$key}) and a totally wrong one (%lista2[i]) so close together.
Then, since you're only printing to OA if ("/$lista2{$key}/"), you'd completely eradicate all other lines in the output. Your example indicates, you don't want that.
Furthermore change the loop nesting. Instead of opening the file over and over again, open it once, iterate over the lines and in each such iteration iterate over the hash keys. Your way wasn't strictly wrong but opening and closing files doesn't come cheap, you know. And speaking of closing files: You didn't close SAL in the body of your outer loop, but that's where you reopen it.
And use at least some very basic error handling. Check if open has failed. A wrong file name and the program fails without any indication why. Make your life easier.
Why use chomp() if you later append an \n to the output anyway and make a line of it again? Skip that.
I don't know how to interpret these lines:
#w = split("\t");
$r = 0;
s/\;//g;
Is that some leftovers? They don't do anything useful.
Last but not least it's recommended to use strict; and possibly use warnings; to get pointers on problematic spots.
Here's one that passes your example.
#!/usr/bin/perl
use strict;
use warnings;
my %lista2 =
(
1 => "CAT00.3",
2 => "CAT43.1",
3 => "CAT40.3"
);
if (!open(OA, ">file2.txt")) {
die($!);
}
if (!open(SAL, "file.txt")) {
die($!);
}
foreach my $line (<SAL>) {
foreach my $key (keys(%lista2)) {
if ($line =~ s/^>;$key;$/>$lista2{$key}/) {
last;
}
}
print(OA $line);
}
close(SAL);
close(OA);
In fact, in the core it can be simplified to a pattern replacement. No splitting or anything is needed. But patterns might be confusing if you're a beginner.
I also raised the level of verbosity to make things clearer.
I'm trying to work out a bigger problem, but have simplified the issue for readability, ultimately the logic below is the reason the extended program is failing.
I am using Perl to search for a short sequence of letters within a larger sequence (protein sequences), and if it is not found, then I'd like to do some calculations. I don't know whether I'm going crazy, but I can't work out why this logic is failing.
sub calculateEpitopeMutations {
my #mutationArray;
my #epitopeArray;
my $count;
my $localEpitope;
open( EPITOPESIN2, $ARGV[5] ) or die "Unable to open file $ARGV[5]\n";
while ( my $line = <EPITOPESIN2> ) {
chomp $line;
push #epitopeArray, $line;
}
while ( my ( $key, $value ) = each our %sequencesForCalculation ) {
foreach ( #epitopeArray ) {
$localEpitope = $_;
if ( $value =~ /($localEpitope)/g ) {
print "$key\n$localEpitope\nexactly the same\n\n";
next;
}
else {
#This is where I'd like to do the further calculations
print "$key\n$localEpitope\nthere is a difference\n\n";
next;
}
}
}
}
$ARGV[5] is the name of a text file containing a list of 9-character sequences, exactly like the following
RVSENIQRF
SFQVDCFLW
The idea is to put these into array #epitopeArray and iterate through these, and compare them with all (currently just one) $value sequences in the hash %sequencesForCalculation.
%sequencesForCalculation is a hash, where $value is a long sequence of characters, like this
MDSNTMSSFQVDCFLWHIRKRFADNGLGDAPFLDRLRRDQKSLKGRGNTLGLDIETATLVGKQIVEWILKEESSETLRMTIASVPTSRYLSDMTLEEMSRDWFMLMPRQKKIGPLCVRLDQAVMEKNIVLKANFSVIFNRLETLILLRAFTEEEAIVGEISPLPSLPGHTYEDVKNAVGVLIGGLEWNGNTVRVSENIQRFAWRNCDENGRPSLPPEQK
Currently, the small 9-character long sequence $localEpitope is contained in the longer sequence $value so when I iterate through the program, I should get this printed every time.
($key contains a header of information about the protein sequences, but is irrelevant so I have shortened it to just the variable name.)
$key
RVSENIQRF
Exactly the same
$key
SFQVDCFLW
Exactly the same
$key
But instead I'm getting this
$key
RVSENIQRF
exactly the same
$key
SFQVDCFLW
there is a difference
$key
Any ideas? Please let me know if anything further is required.
Update
TL;DR: You should change $value =~ /($localEpitope)/g to $value =~ /$localEpitope/
Okay now that we know the real circumstances, the problem (as melpomene points out in his comment) is that you have the /g modifier on your pattern match. There's no reason for that; you don't want check how many times the substring appears, you just want to know whether it's there at all
The problem is that variables subjected to a /g pattern search keep a state that says where the last search ended. So you're searching for $epitopeArray[0] in the longer string and finding it, and then searching for $epitopeArray[1] from where the previous search terminated. The first substring appears after the second one, so only the first is found
For more information on this behaviour, take a look at the pos function which returns the current value of this state. For instance pos($value) will return the character offset where the next m//g will start its search
This short program demonstrates the problem. With the /g modifier only BBB is found. Remove it and both are found
use strict;
use warnings;
use 5.010;
my $long_s = 'xxxAAAxxxBBBxxx';
for my $substr ( qw/ BBB AAA / ) {
if ( $long_s =~ /$substr/g ) {
say "$substr okay";
}
else {
say "$substr nope";
}
}
output
BBB okay
AAA nope
Original
You say
Currently, the small 9-character long sequence ($localEpitope) IS contained in the longer sequence ($value), and so when I iterate through the program, I should get the following printed everytime
So $localEpitope is a substring of $value and you're saying that
$value =~ /($localEpitope)/g
evaluates to true
That is correct behaviour. $value =~ /$localEpitope/ will check whether $localEpitope can be found anywhere in $value
Unfortunately it's not clear enough from what you've written to suggest a solution
I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.
I want to see if I have repeated items in my array, there are over 16.000 so will automate it
There may be other ways but I started with this and, well, would like to finish it unless there is a straightforward command. What I am doing is shifting and pushing from one array into another and this way, check the destination array to see if it is "in array" (like there is such a command in PHP).
So, I got this sub routine and it works with literals, but it doesn't with variables. It is because of the 'eq' or whatever I should need. The 'sourcefile' will contain one or more of the words of the destination array.
// Here I just fetch my file
$listamails = <STDIN>;
# Remove the newlines filename
chomp $listamails;
# open the file, or exit
unless ( open(MAILS, $listamails) ) {
print "Cannot open file \"$listamails\"\n\n";
exit;
}
# Read the list of mails from the file, and store it
# into the array variable #sourcefile
#sourcefile = <MAILS>;
# Close the handle - we've read all the data into #sourcefile now.
close MAILS;
my #destination = ('hi', 'bye');
sub in_array
{
my ($destination,$search_for) = #_;
return grep {$search_for eq $_} #$destination;
}
for($i = 0; $i <=100; $i ++)
{
$elemento = shift #sourcefile;
if(in_array(\#destination, $elemento))
{
print "it is";
}
else
{
print "it aint there";
}
}
Well, if instead of including the $elemento in there I put a 'hi' it does work and also I have printed the value of $elemento which is also 'hi', but when I put the variable, it does not work, and that is because of the 'eq', but I don't know what else to put. If I put == it complains that 'hi' is not a numeric value.
When you want distinct values think hash.
my %seen;
#seen{ #array } = ();
if (keys %seen == #array) {
print "\#array has no duplicate values\n";
}
It's not clear what you want. If your first sentence is the only one that matters ("I want to see if I have repeated items in my array"), then you could use:
my %seen;
if (grep ++$seen{$_} >= 2, #array) {
say "Has duplicates";
}
You said you have a large array, so it might be faster to stop as soon as you find a duplicate.
my %seen;
for (#array) {
if (++$seen{$_} == 2) {
say "Has duplicates";
last;
}
}
By the way, when looking for duplicates in a large number of items, it's much faster to use a strategy based on sorting. After sorting the items, all duplicates will be right next to each other, so to tell if something is a duplicate, all you have to do is compare it with the previous one:
#sorted = sort #sourcefile;
for (my $i = 1; $i < #sorted; ++$i) { # Start at 1 because we'll check the previous one
print "$sorted[$i] is a duplicate!\n" if $sorted[$i] eq $sorted[$i - 1];
}
This will print multiple dupe messages if there are multiple dupes, but you can clean it up.
As eugene y said, hashes are definitely the way to go here. Here's a direct translation of the code you posted to a hash-based method (with a little more Perlishness added along the way):
my #destination = ('hi', 'bye');
my %in_array = map { $_ => 1 } #destination;
for my $i (0 .. 100) {
$elemento = shift #sourcefile;
if(exists $in_array{$elemento})
{
print "it is";
}
else
{
print "it aint there";
}
}
Also, if you mean to check all elements of #sourcefile (as opposed to testing the first 101 elements) against #destination, you should replace the for line with
while (#sourcefile) {
Also also, don't forget to chomp any values read from a file! Lines read from a file have a linebreak at the end of them (the \r\n or \n mentioned in comments on the initial question), which will cause both eq and hash lookups to report that otherwise-matching values are different. This is, most likely, the reason why your code is failing to work correctly in the first place and changing to use sort or hashes won't fix that. First chomp your input to make it work, then use sort or hashes to make it efficient.
I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}