Sorting a sub-section of a file

Sorting a sub-section of a file - perl

I'm in need of some Perl wisdom from those more experienced than myself.
So far, my answer to the below is to simply go through the file line-by line, and insert relevant elements into an array, sort the array and then append the contents. But that seems like a bit long-winded and not very efficient.
I have a file whose contents look something like this :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.an.example.com # <---- not ordered
.some.example.com # <----
Is there a clever way in Perl (ideally a one-liner that could be piped) to sort the second list ? i.e. so you would get the following result :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <----
.bravo.example.com # <---- NOW this list
noperiod.example.com # <---- IS ordered ;-)
.some.example.com # <----
Four things to note :
The content to be sorted is always at the bottom of the file
The header (":127.200.200.100 etc.") is always the same
Names may or may not start with a period (i.e. .bravo.example.com vs noperiod.example.com)
There may be a large number of items, so needs to be reasonably efficient

Depends what you mean by 'efficient'. I mean, a one liner is rarely efficient and it's also rarely concise or clear as to what it's doing.
But in terms of efficiency? Well, it depends what you're doing already that's inefficient. I mean, pretty fundamentally, if you're sorting something you need to examine the whole data set. Otherwise how would you know that the last line in your file needs to be sorted to the top?
But for what you're doing, I'd approach it like this:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_noperiods {
my $a_np = $a;
$a_np =~ s/\.//g;
my $b_np = $b;
$b_np =~ s/\.//g;
return $a_np cmp $b_np;
}
while ( <> ) {
print;
last if m/Something Else/;
}
print sort sort_noperiods <>;
Which for your sample input, prints:
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <---- not ordered
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.some.example.com # <----
I'm keying off the 'Something Else' line in your file, as I couldn't quite tell how you'd identify the last line of the 'header' chunk. Anything else gets read in and sorted according to the 'noperiods' sort mechanism. (There may be a small efficiency gain by caching the result of the regular expressions, but I'm not sure of this).
This can be 'one-linerified' by:
perl -e 'while ( <> ) { print; last if m/Something Else/ }; print sort { $a =~ s/\.//gr cmp $b =~ s/\.//gr } <>; '

You can sort that by the shell with a little help from Perl: Just prepend a line number to each line before the list, and for the list, use the number of its first line. Then sort numerically by the numbers, and secondary by the rest of the line:
perl -ne 'if (1 .. /^:127\.200\.200\.100:.*/) {
print "$.\t$_";
} else {
print $.--, "\t$_"
}' file.txt \
| sort -k1,1n -k2 | cut -f2-

Related

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!

my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;

There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;

Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

How can I determine if an element exists in an array (perl)

I'm looping through an array, and I want to test if an element is found in another array.
In pseudo-code, what I'm trying to do is this:
foreach $term (#array1) {
if ($term is found in #array2) {
#do something here
}
}
I've got the "foreach" and the "do something here" parts down-pat ... but everything I've tried for the "if term is found in array" test does NOT work ...
I've tried grep:
if grep {/$term/} #array2 { #do something }
# this test always succeeds for values of $term that ARE NOT in #array2
if (grep(/$term/, #array2)) { #do something }
# this test likewise succeeds for values NOT IN the array
I've tried a couple different flavors of "converting the array to a hash" which many previous posts have indicated are so simple and easy ... and none of them have worked.
I am a long-time low-level user of perl, I understand just the basics of perl, do not understand all the fancy obfuscated code that comprises 99% of the solutions I read on the interwebs ... I would really, truly, honestly appreciate any answers that are explicit in the code and provide a step-by-step explanation of what the code is doing ...
... I seriously don't grok $_ and any other kind or type of hidden, understood, or implied value, variable, or function. I would really appreciate it if any examples or samples have all variables and functions named with clear terms ($term as opposed to $_) ... and describe with comments what the code is doing so I, in all my mentally deficient glory, may hope to possibly understand it some day. Please. :-)
...
I have an existing script which uses 'grep' somewhat succesfully:
$rc=grep(/$term/, #array);
if ($rc eq 0) { #something happens here }
but I applied that EXACT same code to my new script and it simply does NOT succeed properly ... i.e., it "succeeds" (rc = zero) when it tests a value of $term that I know is NOT present in the array being tested. I just don't get it.
The ONLY difference in my 'grep' approach between 'old' script and 'new' script is how I built the array ... in old script, I built array by reading in from a file:
#array=`cat file`;
whereas in new script I put the array inside the script itself (coz it's small) ... like this:
#array=("element1","element2","element3","element4");
How can that result in different output of the grep function? They're both bog-standard arrays! I don't get it!!!! :-(
########################################################################
addendum ... some clarifications or examples of my actual code:
########################################################################
The term I'm trying to match/find/grep is a word element, for example "word123".
This exercise was just intended to be a quick-n-dirty script to find some important info from a file full of junk, so I skip all the niceties (use strict, warnings, modules, subroutines) by choice ... this doesn't have to be elegant, just simple.
The term I'm searching for is stored in a variable which is instantiated via split:
foreach $line(#array1) {
chomp($line); # habit
# every line has multiple elements that I want to capture
($term1,$term2,$term3,$term4)=split(/\t/,$line);
# if a particular one of those terms is found in my other array 'array2'
if (grep(/$term2/, #array2) {
# then I'm storing a different element from the line into a 3rd array which eventually will be outputted
push(#known, $term1) unless $seen{$term1}++;
}
}
see that grep up there? It ain't workin right ... it is succeeding for all values of $term2 even if it is definitely NOT in array2 ... array1 is a file of a couple thousand lines. The element I'm calling $term2 here is a discrete term that may be in multiple lines, but is never repeated (or part of a larger string) within any given line. Array2 is about a couple dozen elements that I need to "filter in" for my output.
...
I just tried one of the below suggestions:
if (grep $_ eq $term2, #array2)
And this grep failed for all values of $term2 ... I'm getting an all or nothing response from grep ... so I guess I need to stop using grep. Try one of those hash solutions ... but I really could use more explanation and clarification on those.

This is in perlfaq. A quick way to do it is
my %seen;
$seen{$_}++ for #array1;
for my $item (#array2) {
if ($seen{$item}) {
# item is in array2, do something
}
}
If letter case is not important, you can set the keys with $seen{ lc($_) } and check with if ($seen{ lc($item) }).
ETA:
With the changed question: If the task is to match single words in #array2 against whole lines in #array1, the task is more complicated. Trying to split the lines and match against hash keys will likely be unsafe, because of punctuation and other such things. So, a regex solution will likely be the safest.
Unless #array2 is very large, you might do something like this:
my $rx = join "|", #array2;
for my $line (#array1) {
if ($line =~ /\b$rx\b/) { # use word boundary to avoid partial matches
# do something
}
}
If #array2 contains meta characters, such as *?+|, you have to make sure they are escaped, in which case you'd do something like:
my $rx = join "|", map quotemeta, #array2;
# etc

You could use the (infamous) "smart match" operator, provided you are on 5.10 or later:
#!/usr/bin/perl
use strict;
use warnings;
my #array1 = qw/a b c d e f g h/;
my #array2 = qw/a c e g z/;
print "a in \#array1\n" if 'a' ~~ #array1;
print "z in \#array1\n" if 'z' ~~ #array1;
print "z in \#array2\n" if 'z' ~~ #array2;
The example is very simple, but you can use an RE if you need to as well.
I should add that not everyone likes ~~ because there are some ambiguities and, um, "undocumented features". Should be OK for this though.

This should work.
#!/usr/bin/perl
use strict;
use warnings;
my #array1 = qw/a b c d e f g h/;
my #array2 = qw/a c e g z/;
for my $term (#array1) {
if (grep $_ eq $term, #array2) {
print "$term found.\n";
}
}
Output:
a found.
c found.
e found.
g found.

#!/usr/bin/perl
#ar = ( '1','2','3','4','5','6','10' );
#arr = ( '1','2','3','4','5','6','7','8','9' ) ;
foreach $var ( #arr ){
print "$var not found\n " if ( ! ( grep /$var/, #ar )) ;
}

Pattern matching is the most efficient way of matching elements. This would do the trick. Cheers!
print "$element found in the array\n" if ("#array" =~ m/$element/);

Your 'actual code' shouldn't even compile:
if (grep(/$term2/, #array2) {
should be:
if (grep (/$term2/, #array2)) {
You have unbalanced parentheses in your code. You may also find it easier to use grep with a callback (code reference) that operates on its arguments (the array.) It helps keep the parenthesis from blurring together. This is optional, though. It would be:
if (grep {/$term2/} #array2) {
You may want to use strict; and use warnings; to catch issues like this.

The example below might be helpful, it tries to see if any element in #array_sp is present in #my_array:
#! /usr/bin/perl -w
#my_array = qw(20001 20003);
#array_sp = qw(20001 20002 20004);
print "#array_sp\n";
foreach $case(#my_array){
if("#array_sp" =~ m/$case/){
print "My God!\n";
}
}
use pattern matching can solve this. Hope it helps
-QC

1. grep with eq , then
if (grep {$_ eq $term2} #array2) {
print "$term2 exists in the array";
}
2. grep with regex , then
if (grep {/$term2/} #array2) {
print "element with pattern $term2 exists in the array";
}

Remove duplicates from list of files in perl

I know this should be pretty simple and the shell version is something like:
$ sort example.txt | uniq -u
in order to remove duplicate lines from a file. How would I go about doing this in Perl?

The interesting spin on this question is the uniq -u! I don't think the other answers I've seen tackle this; they deal with sort -u example.txt or (somewhat wastefully) sort example.txt | uniq.
The difference is that the -u option eliminates all occurrences of duplicated lines, so the output is of lines that appear only once.
To tackle this, you need to know how many times each name appears, and then you need to print the names that appear just once. Assuming the list is to be read from standard input, then this code does the trick:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
foreach my $name (sort keys %counts)
{
print "$name\n" if $counts{$name} == 1;
}
Or, using using grep:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
{
local $, = "\n";
print grep { $counts{$_} == 1 } sort keys %counts;
}
Or, if you don't need to remove the newlines (because you're only going to print the names):
my %counts;
$counts{$_}++ for (<>);
print grep { $counts{$_} == 1 } sort keys %counts;
If you do in fact want every name that appears in the input to appear in the output (but only once), then any of the other solutions will do the trick (or, with minimal adaptation, will do the trick). In fact, since the input lines will end with a newline, you can generate the answer in just two lines:
my %counts = map { $_, 1 } <>;
print sort keys %counts;
No, you can't do it in one by simply replacing %counts in the print line with the map in the first line:
print sort keys map { $_, 1 } <>;
You get the error:
Type of arg 1 to keys must be hash or array (not map iterator) at ...

or use 'uniq' sub from List::MoreUtils module after reading all the file to a list (although its not a good solution)

Are you wanting to update a list of files to remove duplicate lines?
Or process a list of files, ignoring duplicate lines?
Or remove duplicate filenames from a list?
Assuming the latter:
my %seen;
#filenames = grep !$seen{$_}++, #filenames;
or other solutions from perldoc -q duplicate

First of all, sort -u xxx.txt would have been smarter than sort | uniq -u.
Second, perl -ne 'print unless $seen{$_}++' is prone to integer overflow, so a more sophisticated way of perl -ne 'if(!$seen{$_}){print;$seen{$_}=1}' seems preferable.

Perl Map Function

I'm new to the map and grep functions and I'm trying to make an existing script more concise.
I can "grep" the #tracknames successfully but I'm having a problem with "map". I want #trackartist to return true if two consecutive "--" are found in a line and take the value of $1, otherwise false, but it returns the whole line if the upper condition is not met.
What am I doing wrong?
my #tracknames = grep /^\d\d\..*?(\.(?:flac|wv))$/, <*.*>;
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; } <*.*>;
Sample of files
01. some track artist 1 -- some track name 1.(flac or wv)
02. some track artist 2 -- some track name 2.(flac or wv)
03. some track artist 3 -- some track name 3.(flac or wv)
etc.

Remember that grep is for filtering a list and map is for transforming a list. Right now, your map statement returns $_ for every item in the list. If $_ matches the pattern in your substitution, it will be modified and replaced with the first match. Otherwise, it's not modified and the original $_ is returned.
It sounds like you want to filter out items that don't match the pattern. One way would be to combine a map and a grep:
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; }
grep { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ } <*.*>;
Of course, this means you're doing the same pattern match twice. Another approach is to do a transform with map, but transform anything that doesn't match the pattern into an empty list.
my #trackartist = map { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ ? $1 : ( ) } <*.*>
This uses the ternary conditional operator (?:) to check if the regex matches (returning a true value). If it does, $1 is returned from the map block, if not, an empty list ( ) is returned, which adds nothing to the list resulting from the map.
As a side note, you might want to look into using the glob function rather than <>, which has some disadvantages.

I like map and grep as much as the next guy, but your task seems more suited to a divide-and-conquer parsing approach. I say this because your comments suggest that your interest in map is leading you down a road where you'll end up with a data model consisting of parallel arrays -- #tracks, #artists, etc. -- which is often difficult to maintain in the long run. Here's a sketch of what I mean:
my #tracks;
while (my $file_name = <DATA>){ # You'll use glob() or <*.*>
# Filter out unwanted files.
my ($num, $artist_title, $ext) = $file_name =~ /
^ (\d\d) \. \s*
(.*)
\. (flac|wv) $
/x;
next unless $ext;
# Try to parse the artist and title. Adjust as needed.
my ($artist, $title) = split /\s+--\s+/, $artist_title, 2;
($artist, $title) = ('UNKNOWN', $artist) unless $title;
# Store all info as a hash ref. No need for parallel arrays.
push #tracks, {
file_name => $file_name,
ext => $ext,
artist => $artist,
title => $title,
};
}
__DATA__
01. Perl Jam -- Open or die.wv
02. Perl Jam -- Map to nowhere.flac
03. Perl Jam -- What the #$#!?.wv
04. Perl Jam -- Regex blues.wv
05. Perl Jam -- Use my package, baby.wv
06. Perl Jam -- No warnings.wv
07. Perl Jam -- Laziness ISA virtue.wv
08. Guido and the Pythons -- Home on the xrange.flac
09. Guido and the Pythons -- You gotta keep em generated.flac
10. StackOverflow medley.wv
foo.txt

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF

First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.

Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}

There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;

Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!

Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}