I am working on a collatz sequence. I currently have a for loop.
for my $num (1..1000000) {
my $count = 1;
for (my $i = $num; $i != 1; $count++) {
$i = $i % 2 ? 3 * $i + 1 : $i / 2;
}
}
And then I have a simple way of working out the count of the loop (who many times it takes to complete the theory).
if ($count > $max_length) {
$max = $num;
$max_length = $count;
}
I worked out that code could be made quicker by using a simple theory.
If n = 3, it would have this sequence {3,10,5,16,8,4,2,1} [8] If n =
6, it would have this sequence {6,3,10,5,16,8,4,2,1} [9] If n = 12, it
would have this sequence {12,6,3,10,5,16,8,4,2,1} [10]
So I want to save the result of 3, to be able to work out the result of 6 by just adding 1 to the count and so forth.
I tried to tackle this, with what I thought would do the trick but it infact made my program take 1 minute longer to complete, I now have a program that takes 1.49 seconds rather than 30 seconds I had before.
This is how I added the cache(it's probably wrong)
The below is outside of the for loop
my $cache = 0;
my $lengthcache = 0;
I then have this bit of code which sits after the $i line, line 4 in the for loop
$cache = $i;
$lengthcache = $count;
if ($cache = $num*2) {
$lengthcache++;
}
I don't want the answer given to me in full, I just need to understand how to correctly cache without making the code slower.
You just want the length, right? There's no much savings to be obtained to caching the sequence, and the memory usage will be quite large.
Write a recursive function that returns the length.
sub seq_len {
my ($n) = #_;
return 1 if $n == 1;
return 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Cache the result.
my %cache;
sub seq_len {
my ($n) = #_;
return $cache{$n} if $cache{$n};
return $cache{$n} = 1 if $n == 1;
return $cache{$n} = 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Might as well move terminating conditions to the cache.
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
return $cache{$n} ||= 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Recursion is not necessary. You can speed it up by flatting it. It's a bit tricky, but you can do it using the usual technique[1].
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
my #to_cache;
while (1) {
if (my $length = $cache{$n}) {
$cache{pop(#to_cache)} = ++$length while #to_cache;
return $length;
}
push #to_cache, $n;
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
}
}
Making sure it works:
use strict;
use warnings;
use feature qw( say );
use List::Util qw( sum );
my $calculations;
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
my #to_cache;
while (1) {
if (my $length = $cache{$n}) {
$cache{pop(#to_cache)} = ++$length while #to_cache;
return $length;
}
push #to_cache, $n;
++$calculations;
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
}
}
my #results = map { seq_len($_) } 3,6,12;
say for #results;
say "$calculations calculations instead of " . (sum(#results)-#results);
8
9
10
9 calculations instead of 24
Notes:
To remove recursion,
Make the function tail-recursive by rearranging code or by passing down information about what to do on return. (The former is not possible here.)
Replace the recursion with a loop plus stack.
Eliminate the stack if possible. (Not possible here.)
Clean up the result.
Changing your algorithm to cache results so that it can break out early:
use strict;
use warnings;
my #steps = (0,0);
my $max_steps = 0;
my $max_num = 0;
for my $num (2..1_000_000) {
my $count = 0;
my $i = $num;
while ($i >= $num) {
$i = $i % 2 ? 3 * $i + 1 : $i / 2;
$count++;
}
$count += $steps[$i];
$steps[$num] = $count;
if ($max_steps < $count) {
$max_steps = $count;
$max_num = $num;
}
}
print "$max_num takes $max_steps steps\n";
Changes my processing time from 37 seconds to 2.5 seconds.
Why is 2.5 seconds enough of an improvement?
I chose caching in an array #steps because the processing of all integers from 1 to N easily matches the indexes of an array. This also provides a memory benefit over using a hash of 33M vs 96M in a hash holding the same data.
As ikegami pointed out, this does mean that I can't cache all the values of cycles that go past 1million though, as that would quickly use up all memory. For example, the number 704,511 has a cycle that goes up to 56,991,483,520.
In the end, this means that my method does recalculate portions of certain cycles, but overall there is still a speed improvement due to not having to check for caches at every step. When I change this to use a hash and cache every cycle, the speed decreases to 9.2secs.
my %steps = (1 => 0);
for my $num (2..1_000_000) {
my #i = $num;
while (! defined $steps{$i[-1]}) {
push #i, $i[-1] % 2 ? 3 * $i[-1] + 1 : $i[-1] / 2;
}
my $count = $steps{pop #i};
$steps{pop #i} = ++$count while (#i);
#...
And when I use memoize like demonstrated by Oesor, the speed is 23secs.
If you change your implementation to be a recursive function, you can wrap it with Memoize (https://metacpan.org/pod/Memoize) to speed up already calculated responses.
use strict;
use warnings;
use feature qw/say/;
use Data::Printer;
use Memoize;
memoize('collatz');
for my $num (qw/3 6 12 1/) {
my #series = collatz($num);
p(#series);
say "$num : " . scalar #series;
}
sub collatz {
my($i) = #_;
return $i if $i == 1;
return ($i, collatz( $i % 2 ? 3 * $i + 1 : $i / 2 ));
}
Output
[
[0] 3,
[1] 10,
[2] 5,
[3] 16,
[4] 8,
[5] 4,
[6] 2,
[7] 1
]
3 : 8
[
[0] 6,
[1] 3,
[2] 10,
[3] 5,
[4] 16,
[5] 8,
[6] 4,
[7] 2,
[8] 1
]
6 : 9
[
[0] 12,
[1] 6,
[2] 3,
[3] 10,
[4] 5,
[5] 16,
[6] 8,
[7] 4,
[8] 2,
[9] 1
]
12 : 10
[
[0] 1
]
1 : 1
Related
I'm trying to figure out why this keeps printing the "majority element" candidate in every cycle.
The code I've been trying to make work is a Majority Element search (to find an element that is repeated more than half of the length of a list).
I can't separate the processes of finding the candidate and testing against the array because my input is a text file that has an indeterminate number of arrays. It's an exercise from rosalind.info that has different inputs every time you try to solve it.
An example of the input would be
-5 5 5 5 5 5 5 5 -8 7 7 7 1 7 3 7 -7 1 6 5 10 100 1000 1 -5 1 6 7 1 1 10 1
Here's what I've written so far.
foreach my $currentrow (#lists) {
my #row = ();
#row = split( /\s/, $currentrow );
my $length = $#row;
my $count = 0;
my $i = 0;
for $i ( 0 .. $length - 1 ) {
if ( $count == 0 ) {
$candidate = $row[$i];
$count++;
}
if ( ( $count > 0 ) and ( $i = $length - 1 ) ) {
my $counter2 = 0;
for my $j ( 0 .. $length - 1 ) {
if ( $row[$j] == $candidate ) {
$counter2++;
}
}
if ( $counter2 <= ( $#row / 2 ) and ( $i = $length - 1 ) ) {
$candidate = -1;
print $candidate, " ", $i, " ";
}
if ( $counter2 > ( $#row / 2 ) and ( $i = $length - 1 ) ) {
print $candidate, " ", $i, " ";
}
}
if ( $candidate == $row[$i] and $count > 0 ) {
$count = $count + 1;
}
if ( $candidate != $row[$i] and $count > 0 ) {
$count = $count - 1;
}
}
}
Do you have use strict and use warnings 'all' in place?
I imagine that your problem may be because of the test $i = $length - 1, which is an assignment, and should be $i == $length - 1
To find a majority element I would use a hash:
perl -nae '%h=(); $h{$_}+=2 for #F; $h{$_}>#F and print for keys %h; print "\n"'
Each line of input is treated separately. Each line of output matches a line of input and presents its majority element or is empty if there is no such element.
Edit: Now the solution uses autosplit (-a), which is shorter and work not only for numbers.
I am trying to take one set of data and subtract each value in that data by another set of data.
For example:
Data set one (1, 2, 3)
Data set two (1, 2, 3, 4, 5)
So I should get something like (1 - (1 .. 5)) then (2 - (1..5)) and so on.
I currently have:
#!/usr/bin/perl
use strict;
use warnings;
my $inputfile = $ARGV[0];
open( INPUTFILE, "<", $inputfile ) or die $!;
my #array = <INPUTFILE>;
my $protein = 'PROT';
my $chain = 'P';
my $protein_coords;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( $array[$line] =~ m/\s+$protein\s+/ ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
my %coordinates = (
x => $splitline[5],
y => $splitline[6],
z => $splitline[7],
);
push #{ $protein_coords->[0] }, \%coordinates;
}
}
print "$protein_coords->[0]->[0]->{'z'} \n";
my $lipid1 = 'MEM1';
my $lipid2 = 'MEM2';
my $lipid_coords;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( $array[$line] =~ m/\s+$lipid1\s+/ || $array[$line] =~ m/\s+$lipid2\s+/ ) {
chomp $array[$line];
my #splitline = ( split /\s+/, $array[$line] );
my %coordinates = (
x => $splitline[5],
y => $splitline[6],
z => $splitline[7],
);
push #{ $lipid_coords->[1] }, \%coordinates;
}
}
print "$lipid_coords->[1]->[0]->{'z'} \n";
I am trying to take every value in $protein_coords->[0]->[$ticker]->{'z'} minus each value in $lipid_coords->[1]->[$ticker]->{'z'}.
My overall objective is to find (z2-z1)^2 in the equation d = sqrt((x2-x1)^2+(y2-y1)^2-(z2-z1)^2). I think that if I can do this once then I can do it for X and Y also. Technically I am trying to find the distance between every atom in a PDB file against every lipid atom in the same PDB and print the ResID for distance less than 5A.
To iterate on all combinations of two arrays, just embed two for loops:
use strict;
use warnings;
my #dataone = (1, 2, 3);
my #datatwo = (1, 2, 3, 4, 5);
for my $one (#dataone) {
for my $two (#datatwo) {
print "$one - $two\n";
}
}
Outputs:
1 - 1
1 - 2
1 - 3
1 - 4
1 - 5
2 - 1
2 - 2
2 - 3
2 - 4
2 - 5
3 - 1
3 - 2
3 - 3
3 - 4
3 - 5
This will give you the result of subtracting each element of set 2 from each element of set 1 in what I believe is the manner you were asking.
#!/usr/bin/perl
use strict;
use warnings;
my #set1 = (1, 2, 3);
my #set2 = (1, 2, 3, 4, 5);
my #set3 = ();
for my $val (#set1) {
push #set3, map { $val - $_ } #set2;
}
local $" = ', ';
print "#set3\n";
system 'pause';
The result will be an array containing (1 - (1..5), 2 - (1..5), 3 - (1..5)).
Contents of #set3 after script runs:
0, -1, -2, -3, -4, 1, 0, -1, -2, -3, 2, 1, 0, -1, -2
All the other protein and lipid stuff is way over my head, but I hope this at least helps a little. You should now have an array containing the subtracted elements that you can work with to get the rest of your results!
Edit:
Can replace the loop with this one liner :)
my #set3 = map { my $v = $_; map { $v - $_ } #set2 } #set1;
map is a pretty nifty function!
The easiest way to do this is to do your calculations while you're going through file two:
for (my $line = 0; $line <= $#array; ++$line) {
if (($array[$line] =~ m/\s+$lipid1\s+/) | ($array[$line] =~ m/\s+$lipid2\s+/)) {
chomp $array[$line];
my #splitline = (split /\s+/, $array[$line]);
my %coordinates = (x => $splitline[5],
y => $splitline[6],
z => $splitline[7],
);
push #{$lipid_coords->[1]}, \%coordinates;
# go through each of the sets of protein coors in your array...
for my $p (#{$protein_coords->[0]}) {
# you can store this value however you want...
my $difference = $protein_coords->[0][$p]{z} - $coordinates{z};
}
}
}
If I were you, I would use some form of unique identifier to allow me to access the data on each combination -- e.g. build a hash of the form $difference->{<protein_id>}{<lipid_id>} = <difference>.
For clarification, if I had a list of 8 elements, i would want to randomly pick 2. If I had a list of 20 elements, I would want to randomly pick 5. I would also like to assure (though not needed) that two elements don't touch, i.e. if possible not the 3 and then 4 element. Rather, 3 and 5 would be nicer.
The simplest solution:
Shuffle the list
select the 1st quarter.
Example implementation:
use List::Util qw/shuffle/;
my #nums = 1..20;
my #pick = (shuffle #nums)[0 .. 0.25 * $#nums];
say "#pick";
Example output: 10 2 18 3 19.
Your additional restriction “no neighboring numbers” actually makes this less random, and should be avoided if you want actual randomness. To avoid that two neighboring elements are included in the output, I would iteratively splice unwanted elements out of the list:
my #nums = 1..20;
my $size = 0.25 * #nums;
my #pick;
while (#pick < $size) {
my $i = int rand #nums;
push #pick, my $num = $nums[$i];
# check and remove neighbours
my $len = 1;
$len++ if $i < $#nums and $num + 1 == $nums[$i + 1];
$len++, $i-- if 0 < $i and $num - 1 == $nums[$i - 1];
splice #nums, $i, $len;
}
say "#pick";
use strict;
use warnings;
sub randsel {
my ($fact, $i, #r) = (1.0, 0);
while (#r * 4 < #_) {
if (not grep { $_ == $i } #r) {
$fact = 1.0;
# make $fact = 0.0 if you really don't want
# consecutive elements
$fact = 0.1 if grep { abs($i - $_) == 1 } #r;
push(#r, $i) if (rand() < 0.25 * $fact);
}
$i = ($i + 1) % #_;
}
return map { $_[$_] } sort { $a <=> $b } #r;
}
my #l;
$l[$_] = $_ for (0..19);
print join(" ", randsel(#l)), "\n";
I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms?
Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep track of the path so the highest-scoring alignment can be recreated later. It works fine, but I know I am not doing things very concisely and clearly.
edit: This was for an assignment. I completed it, but want to clean up my code a bit. The details on implementing the algorithm can be found on the class website if any of you are interested.
sub create_matrix {
my $self = shift;
#empty array reference
my $matrix = $self->{score_matrix};
#empty array ref
my $path_matrix = $self->{path_matrix};
#$seq1 and $seq2 are strings set previously
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
#create the 2d array of scores
for (my $i = 0; $i < $num_of_rows; $i++) {
push(#$matrix, []);
push(#$path_matrix, []);
$$matrix[$i][0] = $i * $self->{gap_cost};
$$path_matrix[$i][0] = 1;
}
#fill out the first row
for (my $i = 0; $i < $num_of_columns; $i++) {
$$matrix[0][$i] = $i * $self->{gap_cost};
$$path_matrix[0][$i] = -1;
}
#flag to signal end of traceback
$$path_matrix[0][0] = 2;
#double for loop to fill out each row
for (my $row = 1; $row < $num_of_rows; $row++) {
for (my $column = 1; $column < $num_of_columns; $column++) {
my $seq1_gap = $$matrix[$row-1][$column] + $self->{gap_cost};
my $seq2_gap = $$matrix[$row][$column-1] + $self->{gap_cost};
my $match_mismatch = $$matrix[$row-1][$column-1] + $self->get_match_score(substr($self->{seq1}, $row-1, 1), substr($self->{seq2}, $column-1, 1));
$$matrix[$row][$column] = max($seq1_gap, $seq2_gap, $match_mismatch);
#set the path matrix
#if it was a gap in seq1, -1, if was a (mis)match 0 if was a gap in seq2 1
if ($$matrix[$row][$column] == $seq1_gap) {
$$path_matrix[$row][$column] = -1;
}
elsif ($$matrix[$row][$column] == $match_mismatch) {
$$path_matrix[$row][$column] = 0;
}
elsif ($$matrix[$row][$column] == $seq2_gap) {
$$path_matrix[$row][$column] = 1;
}
}
}
}
You're getting several suggestions regarding syntax, but I would also suggest a more modular approach, if for no other reason that code readability. It's much easier to come up to speed on code if you can perceive the big picture before worrying about low-level details.
Your primary method might look like this.
sub create_matrix {
my $self = shift;
$self->create_2d_array_of_scores;
$self->fill_out_first_row;
$self->fill_out_other_rows;
}
And you would also have several smaller methods like this:
n_of_rows
n_of_cols
create_2d_array_of_scores
fill_out_first_row
fill_out_other_rows
And you might take it even further by defining even smaller methods -- getters, setters, and so forth. At that point, your middle-level methods like create_2d_array_of_scores would not directly touch the underlying data structure at all.
sub matrix { shift->{score_matrix} }
sub gap_cost { shift->{gap_cost} }
sub set_matrix_value {
my ($self, $r, $c, $val) = #_;
$self->matrix->[$r][$c] = $val;
}
# Etc.
One simple change is to use for loops like this:
for my $i (0 .. $num_of_rows){
# Do stuff.
}
For more info, see the Perl documentation on foreach loops and the range operator.
I have some other comments as well, but here is the first observation:
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
So $self->{seq1} and $self->{seq2} are strings and you keep accessing individual elements using substr. I would prefer to store them as arrays of characters:
$self->{seq1} = [ split //, $seq1 ];
Here is how I would have written it:
sub create_matrix {
my $self = shift;
my $matrix = $self->{score_matrix};
my $path_matrix = $self->{path_matrix};
my $rows = #{ $self->{seq1} };
my $cols = #{ $self->{seq2} };
for my $row (0 .. $rows) {
$matrix->[$row]->[0] = $row * $self->{gap_cost};
$path_matrix->[$row]->[0] = 1;
}
my $gap_cost = $self->{gap_cost};
$matrix->[0] = [ map { $_ * $gap_cost } 0 .. $cols ];
$path_matrix->[0] = [ (-1) x ($cols + 1) ];
$path_matrix->[0]->[0] = 2;
for my $row (1 .. $rows) {
for my $col (1 .. $cols) {
my $gap1 = $matrix->[$row - 1]->[$col] + $gap_cost;
my $gap2 = $matrix->[$row]->[$col - 1] + $gap_cost;
my $match_mismatch =
$matrix->[$row - 1]->[$col - 1] +
$self->get_match_score(
$self->{seq1}->[$row - 1],
$self->{seq2}->[$col - 1]
);
my $max = $matrix->[$row]->[$col] =
max($gap1, $gap2, $match_mismatch);
$path_matrix->[$row]->[$col] = $max == $gap1
? -1
: $max == $gap2
? 1
: 0;
}
}
}
Instead of dereferencing your two-dimensional arrays like this:
$$path_matrix[0][0] = 2;
do this:
$path_matrix->[0][0] = 2;
Also, you're doing a lot of if/then/else statements to match against particular subsequences: this could be better written as given statements (perl5.10's equivalent of C's switch). Read about it at perldoc perlsyn:
given ($matrix->[$row][$column])
{
when ($seq1_gap) { $path_matrix->[$row][$column] = -1; }
when ($match_mismatch) { $path_matrix->[$row][$column] = 0; }
when ($seq2_gap) { $path_matrix->[$row][$column] = 1; }
}
The majority of your code is manipulating 2D arrays. I think the biggest improvement would be switching to using PDL if you want to do much stuff with arrays, particularly if efficiency is a concern. It's a Perl module which provides excellent array support. The underlying routines are implemented in C for efficiency so it's fast too.
I would always advise to look at CPAN for previous solutions or examples of how to do things in Perl. Have you looked at Algorithm::NeedlemanWunsch?
The documentation to this module includes an example for matching DNA sequences. Here is an example using the similarity matrix from wikipedia.
#!/usr/bin/perl -w
use strict;
use warnings;
use Inline::Files; #multiple virtual files inside code
use Algorithm::NeedlemanWunsch; # refer CPAN - good style guide
# Read DNA sequences
my #a = read_DNA_seq("DNA_SEQ_A");
my #b = read_DNA_seq("DNA_SEQ_B");
# Read Similarity Matrix (held as a Hash of Hashes)
my %SM = read_Sim_Matrix();
# Define scoring based on "Similarity Matrix" %SM
sub score_sub {
if ( !#_ ) {
return -3; # gap penalty same as wikipedia)
}
return $SM{ $_[0] }{ $_[1] }; # Similarity Value matrix
}
my $matcher = Algorithm::NeedlemanWunsch->new( \&score_sub, -3 );
my $score = $matcher->align( \#a, \#b, { align => \&check_align, } );
print "\nThe maximum score is $score\n";
sub check_align {
my ( $i, $j ) = #_; # #a[i], #b[j]
print "seqA pos: $i, seqB pos: $j\t base \'$a[$i]\'\n";
}
sub read_DNA_seq {
my $source = shift;
my #data;
while (<$source>) {
push #data, /[ACGT-]{1}/g;
}
return #data;
}
sub read_Sim_Matrix {
#Read DNA similarity matrix (scores per Wikipedia)
my ( #AoA, %HoH );
while (<SIMILARITY_MATRIX>) {
push #AoA, [/(\S+)+/g];
}
for ( my $row = 1 ; $row < 5 ; $row++ ) {
for ( my $col = 1 ; $col < 5 ; $col++ ) {
$HoH{ $AoA[0][$col] }{ $AoA[$row][0] } = $AoA[$row][$col];
}
}
return %HoH;
}
__DNA_SEQ_A__
A T G T A G T G T A T A G T
A C A T G C A
__DNA_SEQ_B__
A T G T A G T A C A T G C A
__SIMILARITY_MATRIX__
- A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
And here is some sample output:
seqA pos: 7, seqB pos: 2 base 'G'
seqA pos: 6, seqB pos: 1 base 'T'
seqA pos: 4, seqB pos: 0 base 'A'
The maximum score is 100
I have the following sparse matrix A.
2 3 0 0 0
3 0 4 0 6
0 -1 -3 2 0
0 0 1 0 0
0 4 2 0 1
Then I would like to capture the following information from there:
cumulative count of entries, as matrix is scanned columnwise.
Yielding:
Ap = [ 0, 2, 5, 9, 10, 12 ];
row indices of entries, as matrix is scanned columnwise.
Yielding:
Ai = [0, 1, 0, 2, 4, 1, 2, 3, 4, 2, 1, 4 ];
Non-zero matrix entries, as matrix is scanned columnwise.
Yielding:
Ax = [2, 3, 3, -1, 4, 4, -3, 1, 2, 2, 6, 1];
Since the actual matrix A is potentially very2 large, is there any efficient way
in Perl that can capture those elements? Especially without slurping all matrix A
into RAM.
I am stuck with the following code. Which doesn't give what I want.
use strict;
use warnings;
my (#Ax, #Ai, #Ap) = ();
while (<>) {
chomp;
my #elements = split /\s+/;
my $i = 0;
my $new_line = 1;
while (defined(my $element = shift #elements)) {
$i++;
if ($element) {
push #Ax, 0 + $element;
if ($new_line) {
push #Ai, scalar #Ax;
$new_line = 0;
}
push #Ap, $i;
}
}
}
push #Ai, 1 + #Ax;
print('#Ax = [', join(" ", #Ax), "]\n");
print('#Ai = [', join(" ", #Ai), "]\n");
print('#Ap = [', join(" ", #Ap), "]\n");
A common strategy for storing sparse data is to drop the values you don't care about (the zeroes) and to store the row and column indexes with each value that you do care about, thus preserving their positional information:
[VALUE, ROW, COLUMN]
In your case, you can economize further since all of your needs can be met by processing the data column-by-column, which means we don't have to repeat COLUMN for every value.
use strict;
use warnings;
use Data::Dumper;
my ($r, $c, #dataC, #Ap, #Ai, #Ax, $cumul);
# Read data row by row, storing non-zero values by column.
# $dataC[COLUMN] = [
# [VALUE, ROW],
# [VALUE, ROW],
# etc.
# ]
$r = -1;
while (<DATA>) {
chomp;
$r ++;
$c = -1;
for my $v ( split '\s+', $_ ){
$c ++;
push #{$dataC[$c]}, [$v, $r] if $v;
}
}
# Iterate through the data column by column
# to compute the three result arrays.
$cumul = 0;
#Ap = ($cumul);
$c = -1;
for my $column (#dataC){
$c ++;
$cumul += #$column;
push #Ap, $cumul;
for my $value (#$column){
push #Ax, $value->[0];
push #Ai, $value->[1];
}
}
__DATA__
2 3 0 0 0
3 0 4 0 6
0 -1 -3 2 0
0 0 1 0 0
0 4 2 0 1
This is what you are looking for, I guess:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper::Simple;
my #matrix;
# Populate #matrix
while (<>) {
push #matrix, [ split /\s+/ ];
}
my $columns = #{ $matrix[0] };
my $rows = #matrix;
my ( #Ap, #Ai, #Ax );
my $ap = 0;
for ( my $j = 0 ; $j <= $rows ; $j++ ) {
for ( my $i = 0 ; $i <= $columns ; $i++ ) {
if ( $matrix[$i]->[$j] ) {
$ap++;
push #Ai, $i;
push #Ax, $matrix[$i]->[$j];
}
}
push #Ap, $ap;
}
print Dumper #Ap;
print Dumper #Ai;
print Dumper #Ax;
Updated based on FM's comment. If you do not want to store any of the original data:
#!/usr/bin/perl
use strict;
use warnings;
my %matrix_info;
while ( <DATA> ) {
chomp;
last unless /[0-9]/;
my #v = map {0 + $_ } split;
for (my $i = 0; $i < #v; ++$i) {
if ( $v[$i] ) {
push #{ $matrix_info{$i}->{indices} }, $. - 1;
push #{ $matrix_info{$i}->{nonzero} }, $v[$i];
}
}
}
my #cum_count = (0);
my #row_indices;
my #nonzero;
for my $i ( sort {$a <=> $b } keys %matrix_info ) {
my $mi = $matrix_info{$i};
push #nonzero, #{ $mi->{nonzero} };
my #i = #{ $mi->{indices} };
push #cum_count, $cum_count[-1] + #i;
push #row_indices, #i;
}
print(
"\#Ap = [#cum_count]\n",
"\#Ai = [#row_indices]\n",
"\#Ax = [#nonzero]\n",
);
__DATA__
2 3 0 0 0
3 0 4 0 6
0 -1 -3 2 0
0 0 1 0 0
0 4 2 0 1
Output:
C:\Temp> m
#Ap = [0 2 5 9 10 12]
#Ai = [0 1 0 2 4 1 2 3 4 2 1 4]
#Ax = [2 3 3 -1 4 4 -3 1 2 2 6 1]
Ap is easy: simply start with zeroes and increment each time you meet a nonzero number. I don't see you trying to write anything into #Ap, so it's no surprise it doesn't end up as you wish.
Ai and Ax are trickier: you want a columnwise ordering while you're scanning rowwise. You won't be able to do anything in-place since you don't know yet how many elements the columns will yield, so you can't know in advance the elements' position.
Obviously, it would be a hell lot easier if you could just alter the requirement to have a rowwise ordering instead. Failing that, you could get complex and collect (i, j, x) triplets. While collecting, they'd naturally be ordered by (i, j). Post-collection, you'd just want to sort them by (j, i).
The code you provided works on a row-by-row basis. To get results sequential by columns you have to accumulate your values into separate arrays, one for each column:
# will look like ([], [], [] ...), one [] for each column.
my #columns;
while (<MATRIX>) {
my #row = split qr'\s+';
for (my $col = 0; $col < #row; $col++) {
# push each non-zero value into its column
push #{$columns[$col]}, $row[$col] if $row[$col] > 0;
}
}
# now you only need to flatten it to get the desired kind of output:
use List::Flatten;
#non_zero = flat #columns;
See also List::Flatten.