I have a set of numerical data for which is important to me to know what pairs of numbers occurred together, and how many times. Each set of data contain 7 numbers betwen 1 and 20. There are several hundred sets of data.
Essentially, by parsing each set of my data, I want to create a 20 x 20 array that I can use to keep a count of when pairs of numbers occurred together.
I have done a lot of searching, but maybe I've used the wrong key words. I've seen loads of examples how to create a "2D array" - I know perl doesn't actually do that, and that it's really an array of references - and to print the values contained therein, but nothing really on how to work with one particular cell by number and alter it.
Below is my conceptual code. The commented lines don't work, but illustrate what I want to achieve. I'm reasonably new to coding perl, and this just seems to advanced for me to understand the examples I've seen and translate it into something I can actually use.
my #datapairs;
while (<DATAFILE>)
{
chomp;
my #data = split(",",$_);
for ($prcount=0; $prcount <=5; $prcount++)
{
for ($othcount=($prcount+1); $othcount<=6; $othcount++)
{
#data[$prcount]=#data[$prcount]+1;
#data[$othcount]=#data[$othcount]+1;
#data[$prcount]=#data[$prcount]-1;
#data[$othcount]=#data[$othcount]-1;
print #data[$prcount]." ".#data[$othcount]."; ";
##datapairs[#data[$prcount]][#data[$othcount]]++;
##datapairs[#data[$othcount]][#data[$prcount]]++;
}
}
}
Any input or suggestions would be much appreciated.
To access a "cell" in a "2-d array" in Perl (as you alredy figured out, it's an array of arrayrefs), is simple:
my #datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs[$i]->[$j]++;
print that value
print "$datapairs[$i]->[$j]\n";
It's not clear what you mean by "occur together" - if you mean "in the same length-7 array", it's easy:
my #datapairs;
while (<DATAFILE>) {
chomp;
my #data = split(",", $_);
for (my $prcount = 0; $prcount <= 5; $prcount++) {
for (my $othcount = $prcount + 1; $othcount <=6 ; $othcount++) {
$datapairs[ $data[$prcount] ]->[ $data[$othcount] ]++;
}
}
}
# Print
for (my $i = 0; $i < 20; $i++) {
for (my $j = 0; $j < 20; $j++) {
print "$datapairs[$i]->[$j], ";
}
print "\n";
}
As a side note, personally, just for stylistic reasons, I strongly prefer to reference EVERYTHING, e.g. use arrayref of arrayrefs instead of array of arrays. E.g.
my $datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs->[$i]->[$j]++;
print that value
print "$datapairs->[$i]->[$j]\n";
The second (and third...) arrow dereference operator is optional in Perl but I personally find it significantly more readable to enforce its usage - it spaces out the index expressions.
Related
I found the following code written in Perl about the Sieve of Eratosthenes (an algorithm to find primes in a given range of numbers) and it's working fine, but I don't understand it. Can somebody comment it out for me, so I'll get a better understanding of how the primes are found?
$max= 120;
#primes= ();
#tested= (1);
$j= 1;
while ($j < $max) {
next if $tested[$j++];
push #primes, $j;
for ($k= $j; $k <= $max; $k+=$j) {
$tested[$k-1]= 1;
}
}
print "#primes\n";
I would rewrite (clean up) that script like the following to make it more clear.
Take this as lesson that if one gives variables meaningful names, code can become self-documenting:
use strict;
use warnings;
my $max = 120;
my #primes;
my #notprime;
for my $num (2..$max) {
next if $notprime[$num];
push #primes, $num;
for (my $multiple = 2 * $num; $multiple <= $max; $multiple += $num) {
$notprime[$multiple] = 1;
}
}
print "#primes\n";
The wikipedia article on Sieve of Eratosthenes is going to explain the algorithm fully, and provide pretty little visuals on the process. However, the summary is just this:
Iterate over all the integers from 2 to max.
If an integer hasn't been marked as notprime, then it's prime!
Then just cycle through all multiples of the recognized prime so that we can mark them as not prime.
$max= 120;
#primes= ();
$tested might be better named something like $nonprime. Although we put 1 into the array to start with, it doesn't actually do anything useful... it could equally be left empty.
Also, #tested isn't a list of non-primes, but a list of boolean values whose indices are non-primes. If, for some reason, you wanted to mark 2 as non-prime, you'd have to do something like this instead: #tested = (1,1);
#tested= (1);
$j= 1;
Sweep through all the integers from 1 to 120.
while ($j < $max) {
If we've already checked this number for primality and it failed, restart the loop to check the next number.
next if $tested[$j++];
We now know that j is a prime, because we haven't marked it as being non-prime, so we can add it to the end of our list. The final list will be in ascending order because of this.
push #primes, $j;
Sweep through every remaining number between this one and the end of the array. We increment by our new prime number each time, so we're basically striding over all the multiples of $j
for ($k= $j; $k <= $max; $k+=$j) {
Mark each multiple as tested. We know it cannot be prime, as it has $j as a factor.
$tested[$k-1]= 1;
}
}
The rest of the script is left as an exercise for the reader.
print "#primes\n";
I want to see if I have repeated items in my array, there are over 16.000 so will automate it
There may be other ways but I started with this and, well, would like to finish it unless there is a straightforward command. What I am doing is shifting and pushing from one array into another and this way, check the destination array to see if it is "in array" (like there is such a command in PHP).
So, I got this sub routine and it works with literals, but it doesn't with variables. It is because of the 'eq' or whatever I should need. The 'sourcefile' will contain one or more of the words of the destination array.
// Here I just fetch my file
$listamails = <STDIN>;
# Remove the newlines filename
chomp $listamails;
# open the file, or exit
unless ( open(MAILS, $listamails) ) {
print "Cannot open file \"$listamails\"\n\n";
exit;
}
# Read the list of mails from the file, and store it
# into the array variable #sourcefile
#sourcefile = <MAILS>;
# Close the handle - we've read all the data into #sourcefile now.
close MAILS;
my #destination = ('hi', 'bye');
sub in_array
{
my ($destination,$search_for) = #_;
return grep {$search_for eq $_} #$destination;
}
for($i = 0; $i <=100; $i ++)
{
$elemento = shift #sourcefile;
if(in_array(\#destination, $elemento))
{
print "it is";
}
else
{
print "it aint there";
}
}
Well, if instead of including the $elemento in there I put a 'hi' it does work and also I have printed the value of $elemento which is also 'hi', but when I put the variable, it does not work, and that is because of the 'eq', but I don't know what else to put. If I put == it complains that 'hi' is not a numeric value.
When you want distinct values think hash.
my %seen;
#seen{ #array } = ();
if (keys %seen == #array) {
print "\#array has no duplicate values\n";
}
It's not clear what you want. If your first sentence is the only one that matters ("I want to see if I have repeated items in my array"), then you could use:
my %seen;
if (grep ++$seen{$_} >= 2, #array) {
say "Has duplicates";
}
You said you have a large array, so it might be faster to stop as soon as you find a duplicate.
my %seen;
for (#array) {
if (++$seen{$_} == 2) {
say "Has duplicates";
last;
}
}
By the way, when looking for duplicates in a large number of items, it's much faster to use a strategy based on sorting. After sorting the items, all duplicates will be right next to each other, so to tell if something is a duplicate, all you have to do is compare it with the previous one:
#sorted = sort #sourcefile;
for (my $i = 1; $i < #sorted; ++$i) { # Start at 1 because we'll check the previous one
print "$sorted[$i] is a duplicate!\n" if $sorted[$i] eq $sorted[$i - 1];
}
This will print multiple dupe messages if there are multiple dupes, but you can clean it up.
As eugene y said, hashes are definitely the way to go here. Here's a direct translation of the code you posted to a hash-based method (with a little more Perlishness added along the way):
my #destination = ('hi', 'bye');
my %in_array = map { $_ => 1 } #destination;
for my $i (0 .. 100) {
$elemento = shift #sourcefile;
if(exists $in_array{$elemento})
{
print "it is";
}
else
{
print "it aint there";
}
}
Also, if you mean to check all elements of #sourcefile (as opposed to testing the first 101 elements) against #destination, you should replace the for line with
while (#sourcefile) {
Also also, don't forget to chomp any values read from a file! Lines read from a file have a linebreak at the end of them (the \r\n or \n mentioned in comments on the initial question), which will cause both eq and hash lookups to report that otherwise-matching values are different. This is, most likely, the reason why your code is failing to work correctly in the first place and changing to use sort or hashes won't fix that. First chomp your input to make it work, then use sort or hashes to make it efficient.
Currently learning Perl, and trying to solve a little challenge to find the sum of even terms from the first 4,000,000 Fibonacci terms. I have created a Fibonacci array that seems to work, and then tried different methods to throw out the odd-valued terms, and continually run into an error when trying to sum my resulting array, getting reports of:
Use of uninitialized value in addition (+) at prob2_3.plx line 23
Here is what I have:
#!/usr/bin/perl -w
# prob2_2.plx
use warnings;
use strict;
my #fib; my $i; my $t; my $n;
#fib = (1, 2);
for ($i=2; $i<4000000; $i++) {
my $new= ( $fib[$i-1] + $fib[$i-2] );
push #fib, $new;}
for ($t=3; $t<4000000; $t++) {
if (($fib[$t] % 2) != 0 ) {
delete $fib[$t]; } }
my $total = 0;
for ($n=1; $n<$#fib; $n++) {
$total += $fib[($n+1)];}
print $total;
The warning means you are adding undef to something. delete $fib[$t]; is a bad way of doing $fib[$t] = undef;, which you later add to $total.
You have at least one other error:
The first two Fibonacci numbers are 0 and 1, not 1 and 2.
You have a major problem:
The 4,000,000th Fib number is going to be extremely large, much too large to fit in a double.
For reference purposes,
10,000th has 2090 digits: 20793608237133498072112648988642836825087036094015903119682945866528501423455686648927456034305226515591757343297190158010624794267250973176133810179902738038231789748346235556483191431591924532394420028067810320408724414693462849062668387083308048250920654493340878733226377580847446324873797603734794648258113858631550404081017260381202919943892370942852601647398213554479081823593715429566945149312993664846779090437799284773675379284270660175134664833266377698642012106891355791141872776934080803504956794094648292880566056364718187662668970758537383352677420835574155945658542003634765324541006121012446785689171494803262408602693091211601973938229446636049901531963286159699077880427720289235539329671877182915643419079186525118678856821600897520171070499437657067342400871083908811800976259727431820539554256869460815355918458253398234382360435762759823179896116748424269545924633204614137992850814352018738480923581553988990897151469406131695614497783720743461373756218685106856826090696339815490921253714537241866911604250597353747823733268178182198509240226955826416016690084749816072843582488613184829905383150180047844353751554201573833105521980998123833253261228689824051777846588461079790807828367132384798451794011076569057522158680378961532160858387223882974380483931929541222100800313580688585002598879566463221427820448492565073106595808837401648996423563386109782045634122467872921845606409174360635618216883812562321664442822952537577492715365321134204530686742435454505103269768144370118494906390254934942358904031509877369722437053383165360388595116980245927935225901537634925654872380877183008301074569444002426436414756905094535072804764684492105680024739914490555904391369218696387092918189246157103450387050229300603241611410707453960080170928277951834763216705242485820801423866526633816082921442883095463259080471819329201710147828025221385656340207489796317663278872207607791034431700112753558813478888727503825389066823098683355695718137867882982111710796422706778536913192342733364556727928018953989153106047379741280794091639429908796650294603536651238230626
20,000th has 4180 digits: 1564344347109763849734765364072743458162050946855915883181245417404580803852433819127477934504143316103671237797087184052487157589846395314335101792632666883301188491698850377253383735812017943059782268835280360618754466932406192674904182868594738499500415166599602737300793712012046275485369495600019495004126039595217556097603510836899682827827626851274417838565958464881549888154511565687715162081527027421167926710592169405764372872023265791851279526521097739802047796738013885512616267273220024096214780132567479711643567372517808245262560562426651659391013837988476506124649092538307827326285964637268328029765707984607120961599796336714632362497169952413163370558311283612961033588836334352432860332222878648950508154331165678617373097939647648015552782638392654938551724289386017566932982065441392025369213734676739845068956966278536757235977421127565055467060906533383001625925978595472181091151062798507286798754728450358266089744616465914255799764431508559485853637841082521780322710748029546001980460990695999087046617731317608498316428164179967150350939374702201821818895349621858954893061034598954341939850973673870946183079728029105624782161827626661367017673681922257604178810154438462080217794489109678386881153826838075832058191153133704042628156419344516917867369755345135618986917642004521509538436204298618130363401395547933177643760161135638357088649014469358006518300404036431113143777969391584246934245800739809135619744598808977628245309941537928439431608665523308894967310600529498446943933665468406306292762942409786097847875240014036353917928156220446650579514092031254308059314931618726692376640987446459276331196950780063664171751110087644649773058213117640640085100552927878404516279461437503857017398937097042607258059612257878307007002086913210922626760728342901272768408974906007921227446242552261362505471751722906558235533709070548109789519920405521647836164156675304784097782435865165640401897107828859121831521126567446611716077075769257072773697947064329836969249852382976202348037425889031090020976240691949742160088733357875561841760194799534815496104106903184713919847662253483806138312440578732122855388348848736018217032877013531004653902335692761900988709302797685265501972628217528866551995479526195626503247164073793787381643388365618488630255600890924552511767690989186316859159306438477097458585889829326938198129884953178437411315486719927412151054551726325421747462698125767761987300812744880048122138953746796038485281452086680809803469350470844184375258620810652745992631459076192613797545486775651410699327289089628593588395142531659083933746399666161863597357735290387376161440280731398703030590410957840047591721635117677190494658658256770952605314604687704388833897300447300322491720569722311756874534871145435101596346787454258165870310592717473670917638475152605474446188958081898150393481484970581519902582271877141251593259282483539345792009117894084860435326938689664322383123823631494470354941767039585133484331342468806167901166928052638999423311570618981137348891538818027216596300491989181231598151123614651043205656474490923109982595235880446420678700336717534914381729578113169753046083981752465156933790288020841880688083888166659362896648911608716373579944854235997384986302902608821566689026676371268703303207406827737925274781301986480762462594420398637607893961010824979395439225300832931626540179218558345947558472159906873998923767432504278838419479068093778976997276416592421223235719653905071392295735398272851826350645605643470417155719500185143594804374322010189545136205568856276559806316789533450612097900180399440915139647060459321993254566103255011590902408116018722996267956826555434955409390951728022815209412027248353062982911544674007147249326697275010788100666958314965810320432736615962898175585320993128871046552842068867557341007383399180807449030159797672605530835244157256109268527578172314358255179589605335375414082046575557122636364391407861922824529441261003866098066404526541912783214030236752423547997110159548536582622929575859635210831021463323632502412193578592457118234067116894159316798758933206918936334540039454055299101076302263831614132510576874528929742319396129011617501
Even the 10,000th is too large for a double, and the 20,000th is double the size of the 10,000th, so imagine how large the 4,000,000th will be!
Stylistic issues:
my $i; for ($i=2; $i<4000000; $i++)
is much harder to read than
for my $i (2..$N-1)
with the following at the top to avoid having to repeat the number everywhere:
my $N = 4_000_000;
As if the fact that the 4 millionth Fibonacci number is more than 10^835950 isn't a big enough problem, this is not very good:
for ($t=3; $t<4000000; $t++) {
if (($fib[$t] % 2) != 0 ) {
delete $fib[$t]; } }
my $total = 0;
for ($n=1; $n<$#fib; $n++) {
$total += $fib[($n+1)];}
Why are you walking through the list twice here? Much better would be to combine the two loops into one. You want the sum of the odd terms, so sum the odd terms. Don't delete the odd terms (stylistically very bad) and then walk over the list again, relying on the fact that undef has a numerical value of 0 (but only with a warning).
And mn, the formatting of that code is very, very ugly. Eventually you will write code that someone else needs to read or maintain. My motto: Imagine that the person who will maintain your code is a psychopath who knows where you live.
As ikegami points out, your uninitialized problem is assuming delete removes elements from an array, when in fact it just sets them to undef (unless they are at the end of the array).
Given the storage requirements of the larger Fibonacci numbers, you don't want them in an array at all; fortunately, there's no need to keep them around for this problem.
I would do it like this (takes many minutes to run):
use strict;
use warnings;
use Math::BigInt 'lib' => 'GMP';
my $fib_A = Math::BigInt->new(0);
my $fib_B = Math::BigInt->new(1);
my $sum = Math::BigInt->new(0);
# get the next 3999998
for (1..(4000000-2)) {
my $next = $fib_A + $fib_B;
$sum += $next if $next % 2 == 0;
($fib_A, $fib_B) = ($fib_B, $next);
}
print "The sum is $sum\n";
I have inherited some code from a guy whose favorite past time was to shorten every line to its absolute minimum (and sometimes only to make it look cool). His code is hard to understand but I managed to understand (and rewrite) most of it.
Now I have stumbled on a piece of code which, no matter how hard I try, I cannot understand.
my #heads = grep {s/\.txt$//} OSA::Fast::IO::Ls->ls($SysKey,'fo','osr/tiparlo',qr{^\d+\.txt$}) || ();
my #selected_heads = ();
for my $i (0..1) {
$selected_heads[$i] = int rand scalar #heads;
for my $j (0..#heads-1) {
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].cache","$recdir/heads/$head_nr.cache");
}
From what I can understand, this is supposed to be some kind of randomizer, but I never saw a more complex way to achieve randomness. Or are my assumptions wrong? At least, that's what this code is supposed to do. Select 2 random files and copy them.
=== NOTES ===
The OSA Framework is a Framework of our own. They are named after their UNIX counterparts and do some basic testing so that the application does not need to bother with that.
This looks like some C code with Perl syntax. Sometimes knowing the language the person is thinking in helps you figure out what's going on. In this case, the person's brain is infected with the inner workings of memory management, pointer arithmetic, and other low level concerns, so he wants to minutely control everything:
my #selected_heads = ();
# a tricky way to make a two element array
for my $i (0..1) {
# choose a random file
$selected_heads[$i] = int rand #heads;
# for all the files (could use $#heads instead)
for my $j (0..#heads-1) {
# stop if the chosen file is not already in #selected_heads
# it's that damned ! in front of the grep that's mind-warping
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
# if we are this far, the two files we selected are the same
# choose a different file if we're this far
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
...
}
This is a lot of work because the original programmer either doesn't understand hashes or doesn't like them.
my %selected_heads;
until( keys %selected_heads == 2 )
{
my $try = int rand #heads;
redo if exists $selected_heads{$try};
$selected_heads{$try}++;
}
my #selected_heads = keys %selected_heads;
If you still hate hashes and have Perl 5.10 or later, you can use smart-matching to check if a value is in an array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try ~~ #selected_heads;
push #selected_heads, $try;
}
However, you have a special constraint on this problem. Since you know there are only two elements, you just have to check if the element you want to add is the prior element. In the first case it won't be undef, so the first addition always works. In the second case, it just can't be the last element in the array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try eq $selected_heads[-1];
push #selected_heads, $try;
}
Huh. I can't remember the last time I used until when it actually fit the problem. :)
Note that all of these solutions have the problem that they can cause an infinite loop if the number of original files is less than 2. I'd add a guard condition higher up so the no and single file cases through an error and perhaps the two file case doesn't bother to order them.
Another way you might do this is to shuffle (say, with List::Util) the entire list of original files and just take off the first two files:
use List::Util qw(shuffle);
my #input = 'a' .. 'z';
my #two = ( shuffle( #input ) )[0,1];
print "selected: #two\n";
It selects a random element from #heads.
Then it adds on another random but different element from #heads (if it is the element previously selected, it scrolls through #heads till it find an element not previously selected).
In summary, it selects N (in your case N=2) different random indexes in #heads array and then copies files corresponding to those indexes.
Personally I would write it a bit differently:
# ...
%selected_previously = ();
foreach my $i (0..$N) { # Generalize for N random files instead of 2
my $random_head_index = int rand scalar #heads;
while ($selected_previously[$random_head_index]++) {
$random_head_index = $random_head_index + 1) % #heads; # Cache me!!!
}
# NOTE: "++" in the while() might be considered a bit of a hack
# More readable version: $selected_previously[$random_head_index]=1; here.
The part you labeled "WTF" isn't so troubling, it's just simply making sure that $selected_heads[$i] remains as a valid subscript of #head. The really troubling part is that it is a pretty inefficient way of making sure he's not selecting the same file.
Then again, if the size of #heads is small, stepping from 0..$#heads is probably more efficient than just generating int rand( 2 ) and testing if they are the same.
But basically it copies two files at random (why?) as a '.txt' file and a '.cache' file.
How about just
for my $i (0..1) {
my $selected = splice( #heads, rand #heads, 1 );
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.cache","$recdir/heads/$head_nr.cache");
}
unless #heads or #selected_heads are used later.
Here's yet another way to select 2 unique random indices:
my #selected_heads = ();
my #indices = 0..$#heads;
for my $i (0..1) {
my $j = int rand (#heads - $i);
push #selected_heads, $indices[$j];
$indices[$j] = $indices[#heads - $i - 1];
}
I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}