I wand to sort a 3-dimension array in Perl. The elements of the array are in the form:
$arr_3d[indA][indB][indC] , and each element for indC=1 is a number
What I need is, for a given value of indA, sort all the sub-arrays indexed/defined by indB, with the decreasing order of the value of $arr_3d[indA][indB][indC=1],.
e.g. for an 1x2x2 array if:
$arr_3d[1][1][1] = 1
$arr_3d[1][1][2] = 4
$arr_3d[1][2][1] = 2
$arr_3d[1][2][2] = 3
Then after sorting :
$arr_3d[1][1][1] = 2
$arr_3d[1][1][2] = 3
$arr_3d[1][2][1] = 1
$arr_3d[1][2][2] = 4
So after sorting the sub-arrays $arr_3d[1][1] and $arr_3d[1][2] are swapped.
Sorry for the messed up description.. Any ideas?
Regards,
Giorgos
This is related to the " Schwartzian transform in Perl? " . You are really just sorting a single array (#{ $arr_3d[$indA] }).
I test this and it works. You are probably using Fortran index notation (starting at 1), so I changed it to C indexing (starting at 0).
use Data::Dumper;
my #arr_3d ;
$arr_3d[0][0][0] = 1;
$arr_3d[0][1][0] = 2;
$arr_3d[0][0][1] = 4;
$arr_3d[0][1][1] = 3;
my $indA = 0;
my $indC = 0;
my #temp = #{ $arr_3d[$indA] };
#{ $arr_3d[$indA] } = sort { $b->[$indC] <=> $a->[$indC] } #temp;
print Dumper(\#arr_3d);
Related
I need to find the next neighbor using the euklid-distance-algorithm.
Given are two hashes with each 100 elements in this format:
$hash{$i}{price}
$hash{$i}{height}
I need to compare each hash elements with each other so that my result is a 100 x 100 matrix.
After that my matrix shoul be sorted after the following rules:
$hash{$i,$j} {lowest_euklid_distance}
$hash{$i,$j+1}{lowest_euklid_distance + 1}
$hash{$i,$j+2}{lowest_euklid_distance + 2}
.
.
$hash{$i+1,$j}{lowest_euklid_distance}
$hash{$i+1,$j}{lowest_euklid_distance + 2}
.
.
$hash{$i+n,$j+n}{lowest_euklid_distance+n}
My problem is to sort these elements properly.
Any adivices?
Thanks in advance.
/edit: adding more information:
I create the distance with the following subroutine:
sub euklid_distance{
#w1 = price_testdata
#w2 = height_testdata
#h1 = price_origindata
#h2 = height_origindata
my $w1 = trim($_[0]);
my $w2 = trim($_[1]);
my $h1 = trim($_[2]);
my $h2 = trim($_[3]);
my $result = (((($w2-$w1)**2)+(($h2-$h1)**2))**(1/2));
return $result;
}
I fetch the test and the origindata from two seperate lists.
The resulthash with the 100x100 matrix is created by the following code:
my %distancehash;
my $countvar=0;
for (my $j=0;$j<100;$j++){
for (my $i=0;$i<100;$i++){
$distancehash{$countvar}{distance} = euklid_distance( $origindata{$i}{price}, $testdata{$j}{price}, $origindata{$i}{height}, $testdata{$j}{height} );
$distancehash{$countvar}{originPrice} = $origindata{$i}{price};
$distancehash{$countvar}{originHeight} = $origindata{$i}{height};
$distancehash{$countvar}{testPrice} = $testdata{$j}{price};
$distancehash{$countvar}{testHeight} = $testdata{$j}{height};
$countvar++;
}
}
where $j goes over the testdata and $i over the origindata.
My goal is to have a new hash which is sorted by the lowest distance from the current $j ascending to the highest.
I have a structure array with a large number of fields that I don't care about, so I want to extract the limited number of fields I DO care about and put it into a separate structure array.
For a structure array of size one, I've done this by creating the new array from scratch, for example:
structOld.a = 1;
structOld.b = 2;
structOld.usefulA = 'useful information';
structOld.usefulB = 'more useful information';
structOld.c = 3;
structOld.d = 'words';
keepFields = {'usefulA','usefulB'};
structNew = struct;
for fn = keepFields
structNew.(fn{:}) = structOld.(fn{:});
end
which gives
structNew =
usefulA: 'useful information'
usefulB: 'more useful information'
Is there a more efficient way of doing this? How can I scale up to an structure array (vector) of size N?
N = 50;
structOld(1).a = 1;
structOld(1).b = 2;
structOld(1).usefulA = 500;
structOld(1).usefulB = 'us';
structOld(1).c = 3;
structOld(1).d = 'ef';
structOld(2).a = 4;
structOld(2).b = 5;
structOld(2).usefulA = 501;
structOld(2).usefulB = 'ul';
structOld(2).c = 6;
structOld(2).d = 'in';
structOld(3).a = 7;
structOld(3).b = '8';
structOld(3).usefulA = 504;
structOld(3).usefulB = 'fo';
structOld(3).c = 9;
structOld(3).d = 'rm';
structOld(N).a = 10;
structOld(N).b = 11;
structOld(N).usefulA = 506;
structOld(N).usefulB = 'at';
structOld(N).c = 12;
structOld(N).d = 'ion';
In this case, I'd like to end up with:
structNew =
1x50 struct array with fields:
usefulA
usefulB
Keeping elements with empty usefulA/usefulB fields is fine; I can get rid of them later if needed.
Using rmfield isn't great because the number of useless fields far outnumbers the useful fields.
You can create a new struct array using existing data as follows:
structNew = struct('usefulA',{structOld.usefulA},'usefulB',{structOld.usefulB});
If you have an arbitrary set of field names that you want to preserve, you could use a loop as follows. Here, I'm first extracting the data from strcutOld into a cell array data, which contains each of the arguments the the struct call in the previous line of code. data{:} is now a comma-separated list of these arguments, the last line of code below is identical to the line above.
keepFields = {'usefulA','usefulB'};
data = cell(2,numel(keepFields));
for ii=1:numel(keepFields)
data{1,ii} = keepFields{ii};
data{2,ii} = {structOld.(keepFields{ii})};
end
structNew = struct(data{:});
I stored some values in the array as follows $score[$userIndex] = :
Score [100, 400, 900]
userIndex (1 , 2 , 3 )
I need to ascending numeric sort based on the score while maintaining the indexes, as they have significance, the result should resemble:
Score [100, 400, 900]
userIndex (2 , 1 , 3 )
2. After the sort is complete how can I access the previous index an element was at(after the sort is complete I no longer need the score just the indexes)?
In the sorted array all i need is for eg: $sorted[0] = 2, $sorted[1] = 1, $sorted[3] = 3
I found this method of doing it after some extensive googling
while ($hashIndex < $#Score) {
$matchHash{$hashIndex} = $Score[$hashIndex];
$hashIndex++;
}
foreach my $score (sort { $matchHash{$a} <=> $matchHash{$b} } keys %matchHash) {
#DS
# printf "%-8s %s\n", $score, $matchHash{$score};
push (#sorted, $score);
}
The way to do this is to sort a list of array indices instead of the data itself. Then you can reorder the corresponding list of users using the same sorted indices.
The code would look like this
use strict;
use warnings;
my #score = (400, 100, 900);
my #users = (1, 2, 3);
my #sorted_indices = sort { $score[$a] <=> $score[$b] } 0 .. $#score;
my #sorted_users = #users[#sorted_indices];
print "#sorted_users\n";
output
2 1 3
Update
Looking at your own code, it looks like perhaps all you want is the sorted array indices, (i.e. the userIndex list in your question should properly be (0, 1, 2)). If that is the case the the sort is complete when #sorted_indices is filled.
print "#sorted_indices\n";
outputs
1 0 2
I have an array ref of about 50,000 users. I want to go through all those users and compare each one to all the others in order to build a weighted list of matches (if the name is an exact match it's worth x, a partial match is worth y etc).
After going through the list and doing all the checks, I then want to go get the 10 highest weighted matches. Here is sort of a example of what I'm doing to help explain:
#!/usr/bin/perl
######################################################################
# Libraries
# ---------
use strict;
use warnings;
my $users = [];
$users->[0]{'Name'} = 'xxx';
$users->[0]{'Address'} = 'yyyy';
$users->[0]{'Phone'} = 'xxx';
$users->[1]{'Name'} = 'xxx';
$users->[1]{'Address'} = 'yyyy';
$users->[1]{'Phone'} = 'xxx';
$users->[2]{'Name'} = 'xxx';
$users->[3]{'Address'} = 'yyyy';
$users->[4]{'Phone'} = 'xxx';
foreach my $user_to_check (#$users) {
my $matched_users = [];
foreach my $user (#$users) {
$user_to_check->{'Weight'} = 0;
if (lc($user_to_check->{'Name'}) eq lc($user->{'Name'})) {
$user_to_check->{'Weight'} = ($user_to_check->{'Weight'} + 10);
} elsif ((length($user_to_check->{'Name'}) > 2) && (length($user->{'Name'}) > 2) && ($user_to_check->{'Name'} =~ /\Q$user->{'Name'}\E/i)) {
$user_to_check->{'Weight'} = ($user_to_check->{'Weight'} + 5);
}
if (lc($user_to_check->{'Address'}) eq lc($user->{'Address'})) {
.....
}
if ($user_to_check->{'Weight'} > 0) {
# We have matches, add to matched users
push (#$matched_users,$user);
}
}
# Now we want to get just the top 10 highest matching users
foreach my $m_user (sort { $b->{'Weight'} <=> $a->{'Weight'} } #$matched_users ) {
last if $counter == 10;
.... # Do stuff with the 10 we want
}
}
The problem is, it's sooo slow. It takes more than a day to run (and I've tried it on multiple machines). I know that the "sort" is a killer but I did also try inserting the results into a tmp mysql table and then at the end instead of doing the Perl sort, I just did an order by select, but the difference in time was very minor.
As I'm just going through a existing data structure and comparing it I'm not sure what I could do (if anything) to speed it up. I'd appreciate any advise.
O(n²)
You compare each element in #$users against every element in there. That is 5E4² = 2.5E9 comparisions. For example, you wouldn't need to compare an element against itself. You also don't need to compare an element against one you have already compared. I.e. in this comparision table
X Y Z
X - + +
Y - - +
Z - - -
there only have to be three comparision to have compared each element against all others. The nine comparisions you are doing are 66% unneccessary (asymptotically: 50% unneccessary).
You can implement this by looping over indices:
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
...;
}
}
But this means that upon match, you have to increment the weight of both matching users.
Do things once, not 100,000 times
You lowercase the name of each user 1E5 times. This is 1E5 - 1 times to much! Just do it once for each element, possibly at data input.
As a side note, you shouldn't perform lowercasing, you should do case folding. This is available since at least v16 via the fc feature. Just lowercasing will be buggy when you have non-english data.
use feature 'fc'; # needs v16
$user->[NAME] = fc $name;
or
use Unicode::CaseFold;
$user->[NAME] = fc $name;
When hashes are not fast enough
Hashes are fast, in that a lookup takes constant time. But a single hash lookup is more expensive than an array access. As you only have a small, predefined set of fields, you can use the following trick to use hash-like arrays:
Declare some constants with the names of your fields that map to indices, e.g.
use constant {
WEIGHT => 0,
NAME => 1,
ADDRESS => 2,
...;
};
And then put your data into arrays:
$users->[0][NAME] = $name; ...;
You can access the fields like
$userA->[WEIGHT] += 10;
While this looks like a hash, this is actually a safe method to access only certain fields of an array with minimal overhead.
Regexes are slow
Well, they are quite fast, but there is a better way to determine if a string is a substring of another string: use index. I.e.
$user_to_check->{'Name'} =~ /\Q$user->{'Name'}\E/i
Can be written as
(-1 != index $user_to_check->{Name}, $user->{Name})
assuming both are already lowercased case folded.
Alternative implementation
Edit: this appears to be invalidated by your edit to your question. This assumed you were trying to find some global similarities, not to obtain a set of good matches for each user
Implementing these ideas would make your loops look somewhat like
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
if ($userA->[NAME] eq $userB->[NAME]) {
$userA->[WEIGHT] += 10;
$userB->[WEIGHT] += 10;
} elsif ((length($userA->[NAME]) > 2) && (length($userB->[NAME]) > 2))
$userA->[WEIGHT] += 5 if -1 != index $userA->[NAME], $userB->[NAME];
$userB->[WEIGHT] += 5 if -1 != index $userB->[NAME], $userA->[NAME];
}
if ($userA->[ADDRESS] eq $userB->[ADDRESS]) {
..... # More checks
}
}
}
my (#top_ten) = (sort { $b->[WEIGHT] <=> $a->[WEIGHT] } #$users)[0 .. 9];
Divide and conquer
The task you show is highly parallelizable. If you have the memory, using threads is easy here:
my $top10 = Thread::Queue->new;
my $users = ...; # each thread gets a copy of this data
my #threads = map threads->create(\&worker, $_), [0, int($#$users/2)], [int($#$users/2)+1, $#users];
# process output from the threads
while (defined(my $ret = $top10->dequeue)) {
my ($user, #top10) = #$ret;
...;
}
$_->join for #threads;
sub worker {
my ($from, $to) = #_;
for my $i ($from .. $to) {
my $userA = $users->[$i];
for $userB (#$users) {
...;
}
my #top10 = ...;
$top10->enqueue([ $userA, #top10 ]); # yield data to the main thread
}
}
You should probably return your output via a queue (as shown here), but do as much processing as possible inside the threads. With more advanced partitioning of the workload, should spawn as many threads as you have processors available.
But if any kind of pipelining, filtering or caching can decrease the number of iterations needed in the nested loops, you should do such optimizations (think map-reduce-style programming).
Edit: Elegantly reducing complexity through hashes for deduplication
What we are essentially doing is calculating a matrix of how good our records match, e.g.
X Y Z
X 9 4 5
Y 3 9 2
Z 5 2 9
If we assume that X is similar to Y implies Y is similar to X, then the matrix is symmetric, and we only need half of it:
X Y Z
X \ 4 5
Y \ 2
Z \
Such a matrix is equivalent to a weighted, undirected graph:
4 X 5 | X – Y: 4
/ \ | X – Z: 5
Y---Z | Y – Z: 2
2 |
Therefore, we can represent it elegantly as a hash of hashes:
my %graph;
$graph{X}{Y} = 4;
$graph{X}{Z} = 5;
$graph{Y}{Z} = 2;
However, such a hash structure implies a direction (from node X to node Y). To make querying the data easier, we might as well include the other direction too (due to the implementation of hashes, this won't lead to a large memory increase).
$graph{$x}{$y} = $graph{$y}{$x} += 2;
Because each node is now only connected to those nodes it is similar to, we don't have to sort through 50,000 records. For the 100th record, we can get the ten most similar nodes like
my $node = 100;
my #top10 = (sort { $graph{$node}{$b} <=> $graph{$node}{$a} } keys %{ $graph{$node} })[0 .. 9];
This would change the implementation to
my %graph;
# build the graph, using the array indices as node ID
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
if ($userA->[NAME] eq $userB->[NAME]) {
$graph{$j}{$i} = $graph{$i}{$j} += 10;
} elsif ((length($userA->[NAME]) > 2) && (length($userB->[NAME]) > 2))
$graph{$j}{$i} = $graph{$i}{$j} += 5
if -1 != index $userA->[NAME], $userB->[NAME]
or -1 != index $userB->[NAME], $userA->[NAME];
}
if ($userA->[ADDRESS] eq $userB->[ADDRESS]) {
..... # More checks
}
}
}
# the graph is now fully populated.
# do somethething with each top10
while (my ($node_id, $similar) = each %graph) {
my #most_similar_ids = (sort { $similar->{$b} <=> $similar->{$a} } keys %$similar)[0 .. 9];
my ($user, #top10) = #$users[ $node_id, #most_similar_ids ];
...;
}
Building the graph this way should take half the time of naive iteration, and if the average number of edges for each node is low enough, going through similar nodes should be considerably faster.
Parallelizing this is a bit harder, as the graph each thread produces has to be combined before the data can be queried. For this, it would be best for each thread to perform the above code with the exception that the iteration bounds are given as parameters, and that only one edge should produced. The pair of edges will be completed in the combination phase:
THREAD A [0 .. 2/3] partial
\ graph
=====> COMBINE -> full graph -> QUERY
/ partial
THREAD B [2/3 .. 1] graph
# note bounds recognizing the triangular distribution of workload
However, this is only beneficial if there are only very few similar nodes for a given node, as combination is expensive.
I have data that looks like this:
my #homopol = (
["T","C","CC","G"], # part1
["T","TT","C","G","A"], #part2
["C","CCC","G"], #part3 ...upto part K=~50
);
my #prob = ([1.00,0.63,0.002,1.00,0.83],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02]);
# Note also that the dimension of #homopol is always exactly the same with #prob.
# Although number of elements can differ from 'part' to 'part'.
What I want to do is to
Generate all combinations of elements in part1 through out partK
Find the product of the corresponding elements in #prob.
Hence at the end we hope to get this output:
T-T-C 1 x 0.72 x 1 = 0.720
T-T-CCC 1 x 0.72 x 0.97 = 0.698
T-T-G 1 x 0.72 x 0.02 = 0.014
...
G-G-G 1 x 0.85 x 0.02 = 0.017
G-A-C 1 x 1 x 1 = 1.000
G-A-CCC 1 x 1 x 0.97 = 0.970
G-A-G 1 x 1 x 0.02 = 0.020
The problem is that the following code of mine does that by hardcoding
the loops. Since the number of parts of #homopol is can be varied and large
(e.g. ~K=50), we need a flexible and compact way to get the same result. Is there any?
I was thinking to use Algorithm::Loops, but not sure how to achieve that.
use strict;
use Data::Dumper;
use Carp;
my #homopol = (["T","C","CC","G"],
["T","TT","C","G","A"],
["C","CCC","G"]);
my #prob = ([1.00,0.63,0.002,1.00,0.83],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02]);
my $i_of_part1 = -1;
foreach my $base_part1 ( #{ $homopol[0] } ) {
$i_of_part1++;
my $probpart1 = $prob[0]->[$i_of_part1];
my $i_of_part2 =-1;
foreach my $base_part2 ( #{ $homopol[1] } ) {
$i_of_part2++;
my $probpart2 = $prob[1]->[$i_of_part2];
my $i_of_part3 = -1;
foreach my $base_part3 ( #{ $homopol[2] } ) {
$i_of_part3++;
my $probpart3 = $prob[2]->[$i_of_part3];
my $nstr = $base_part1."".$base_part2."".$base_part3;
my $prob_prod = sprintf("%.3f",$probpart1 * $probpart2 *$probpart3);
print "$base_part1-$base_part2-$base_part3 \t";
print "$probpart1 x $probpart2 x $probpart3 = $prob_prod\n";
}
}
}
I would recommend Set::CrossProduct, which will create an iterator to yield the cross product of all of your sets. Because it uses an iterator, it does not need to generate every combination in advance; rather, it yields each one on demand.
use strict;
use warnings;
use Set::CrossProduct;
my #homopol = (
[qw(T C CC G)],
[qw(T TT C G A)],
[qw(C CCC G)],
);
my #prob = (
[1.00,0.63,0.002,1.00],
[0.72,0.03,1.00, 0.85,1.00],
[1.00,0.97,0.02],
);
# Prepare by storing the data in a list of lists of pairs.
my #combined;
for my $i (0 .. $#homopol){
push #combined, [];
push #{$combined[-1]}, [$homopol[$i][$_], $prob[$i][$_]]
for 0 .. #{$homopol[$i]} - 1;
};
my $iterator = Set::CrossProduct->new([ #combined ]);
while( my $tuple = $iterator->get ){
my #h = map { $_->[0] } #$tuple;
my #p = map { $_->[1] } #$tuple;
my $product = 1;
$product *= $_ for #p;
print join('-', #h), ' ', join(' x ', #p), ' = ', $product, "\n";
}
A solution using Algorithm::Loops without changing the input data would look something like:
use Algorithm::Loops;
# Turns ([a, b, c], [d, e], ...) into ([0, 1, 2], [0, 1], ...)
my #lists_of_indices = map { [ 0 .. #$_ ] } #homopol;
NestedLoops( [ #lists_of_indices ], sub {
my #indices = #_;
my $prob_prod = 1; # Multiplicative identity
my #base_string;
my #prob_string;
for my $n (0 .. $#indices) {
push #base_string, $hompol[$n][ $indices[$n] ];
push #prob_string, sprintf("%.3f", $prob[$n][ $indices[$n] ]);
$prob_prod *= $prob[$n][ $indices[$n] ];
}
print join "-", #base_string; print "\t";
print join "x", #prob_string; print " = ";
printf "%.3f\n", $prob_prod;
});
But I think that you could actually make the code clearer by changing the structure to one more like
[
{ T => 1.00, C => 0.63, CC => 0.002, G => 0.83 },
{ T => 0.72, TT => 0.03, ... },
...
]
because without the parallel data structures you can simply iterate over the available base sequences, instead of iterating over indices and then looking up those indices in two different places.
Why don't you use recursion? Pass the depth as a parameter and let the function call itself with depth+1 inside the loop.
you could do it by creating an array of indicies the same length as the #homopol array (N say), to keep track of which combination you are looking at. In fact this array is just like a
number in base N, with the elements being the digits. Iterate in the same way as you would write down consectutive numbers in base N, e.g (0 0 0 ... 0), (0 0 0 ... 1), ...,(0 0 0 ... N-1), (0 0 0 ... 1 0), ....
Approach 1: Calculation from indices
Compute the product of lengths in homopol (length1 * length2 * ... * lengthN). Then, iterate i from zero to the product. Now, the indices you want are i % length1, (i / length1)%length2, (i / length1 / length2) % length3, ...
Approach 2: Recursion
I got beaten to it, see nikie's answer. :-)