Related
I have a sparse matrix in a data file produced by a code(which is not MATLAB). The data file consists of four columns. The first two column are the real and imaginary part of a matrix entry and the third and fourth columns are the corresponding row and column index respectively.
I convert this into a dense matrix in Matlab using the following script.
tic
dataA = load('sparse_LHS.dat');
toc
% Initialise matrix
tic
Nr = 15; Nz = 15; Neq = 5;
A (Nr*Nz*Neq,Nr*Nz*Neq) = 0;
toc
tic
lA = length(dataA)
rowA = dataA(:,3); colA = dataA(:,4);
toc
tic
for i = 1:lA
A(rowA(i), colA(i)) = complex(dataA(i,1), dataA(i,2));
end
toc
This scipt is, however, very slow(the for loop is the culprit).
Elapsed time is 0.599023 seconds.
Elapsed time is 0.001978 seconds.
Elapsed time is 0.000406 seconds.
Elapsed time is 275.462138 seconds.
Is there any fast way of doing this in matlab?
Here is what I tried so far:
parfor - This gives me
valid indices are restricted in parfor loops
I tired to recast the for loop as something like this:
A(rowA(:),colA(:)) = complex(dataA(:,1), dataA(:,2));
and I get an error
Subscripted assignment dimension mismatch.
The reason your last try doesn't work is that Matlab can't take a list of subscripts for both columns and rows, and match them to assign elements in order. Instead, it's making all the combinations of rows and columns from the list - this is how it looks:
dataA = magic(4)
dataA =
16 2 3 13
5 11 10 8
9 7 6 12
4 14 15 1
dataA([1,2],[1,4]) =
16 13
5 8
So we got 4 elements ([1,1],[1,4],[2,1],[2,4]) instead of 2 ([1,1] and [2,4]).
In order to use subscripts in a list, you need to converts them to linear indexing, and one simple way to do this is using the function sub2ind.
Using this function you can write the following code to do it all at once:
% Initialise matrix
Nr = 15; Nz = 15; Neq = 5;
A(Nr*Nz*Neq,Nr*Nz*Neq) = 0;
% Place all complex values from dataA(:,1:2) in to A by the subscripts in dataA(:,3:4):
A(sub2ind(size(A),dataA(:,3),dataA(:,4))) = complex(dataA(:,1), dataA(:,2));
sub2ind is not such a quick function (but it will be much quicker than your loop), so if you have a lot of data, you might want to do the computation of the linear index by yourself:
rowA = dataA(:,3);
colA = dataA(:,4);
% compute the linear index:
ind = (colA-1)*size(A,1)+rowA;
% Place all complex values from dataA(:,1:2) in to A by the the index 'ind':
A(ind) = complex(dataA(:,1), dataA(:,2));
P.S.:
If you are using Matlab R2015b or later:
A = zeros(Nr*Nz*Neq,Nr*Nz*Neq);
is quicker than:
A(Nr*Nz*Neq,Nr*Nz*Neq) = 0;
How to randomly pick up N numbers from a vector a with weight assigned to each number?
Let's say:
a = 1:3; % possible numbers
weight = [0.3 0.1 0.2]; % corresponding weights
In this case probability to pick up 1 should be 3 times higher than to pick up 2.
Sum of all weights can be anything.
R = randsample([1 2 3], N, true, [0.3 0.1 0.2])
randsample is included in the Statistics Toolbox
Otherwise you can use some kind of roulette-wheel selection process. See this similar question (although not MATLAB specific). Here's my one-line implementation:
a = 1:3; %# possible numbers
w = [0.3 0.1 0.2]; %# corresponding weights
N = 10; %# how many numbers to generate
R = a( sum( bsxfun(#ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 )
Explanation:
Consider the interval [0,1]. We assign for each element in the list (1:3) a sub-interval of length proportionate to the weight of each element; therefore 1 get and interval of length 0.3/(0.3+0.1+0.2), same for the others.
Now if we generate a random number with uniform distribution over [0,1], then any number in [0,1] has an equal probability of being picked, thus the sub-intervals' lengths determine the probability of the random number falling in each interval.
This matches what I'm doing above: pick a number X~U[0,1] (more like N numbers), then find which interval it falls into in a vectorized way..
You can check the results of the two techniques above by generating a large enough sequence N=1000:
>> tabulate( R )
Value Count Percent
1 511 51.10%
2 160 16.00%
3 329 32.90%
which more or less match the normalized weights w./sum(w) [0.5 0.16667 0.33333]
amro gives a nice answer (that I rated up), but it will be highly intensive if you wish to generate many numbers from a large set. This is because the bsxfun operation can generate a huge array, which is then summed. For example, suppose I had a set of 10000 values to sample from, all with different weights? Now, generate 1000000 numbers from that sample.
This will take some work to do, since it will generate a 10000x1000000 array internally, with 10^10 elements in it. It will be a logical array, but even so, 10 gigabytes of ram must be allocated.
A better solution is to use histc. Thus...
a = 1:3
w = [.3 .1 .2];
N = 10;
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R)
R =
1 1 1 2 2 1 3 1 1 1
However, for a large problem of the size I suggested above, it is fast.
a = 1:10000;
w = rand(1,10000);
N = 1000000;
tic
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R);
toc
Elapsed time is 0.120879 seconds.
Admittedly, my version takes 2 lines to write. The indexing operation must happen on a second line since it uses the second output of histc. Also note that I've used the ability of the new matlab release, with the tilde (~) operator as the first argument of histc. This causes that first argument to be immediately dumped in the bit bucket.
TL;DR
For maximum performance, if you only need a singe sample, use
R = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );
and if you need multiple samples, use
[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));
Avoid randsample. Generating multiple samples upfront is three orders of magnitude faster than generating individual values.
Performance metrics
Since this showed up near the top of my Google search, I just wanted to add some performance metrics to show that the right solution will depend very much on the value of N and the requirements of the application. Also that changing the design of the application can dramatically increase performance.
For large N, or indeed N > 1:
a = 1:3; % possible numbers
w = [0.3 0.1 0.2]; % corresponding weights
N = 100000000; % number of values to generate
w_normalized = w / sum(w) % normalised weights, for indication
fprintf('randsample:\n');
tic
R = randsample(a, N, true, w);
toc
tabulate(R)
fprintf('bsxfun:\n');
tic
R = a( sum( bsxfun(#ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 );
toc
tabulate(R)
fprintf('histc:\n');
tic
[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));
toc
tabulate(R)
Results:
w_normalized =
0.5000 0.1667 0.3333
randsample:
Elapsed time is 2.976893 seconds.
Value Count Percent
1 49997864 50.00%
2 16670394 16.67%
3 33331742 33.33%
bsxfun:
Elapsed time is 2.712315 seconds.
Value Count Percent
1 49996820 50.00%
2 16665005 16.67%
3 33338175 33.34%
histc:
Elapsed time is 2.078809 seconds.
Value Count Percent
1 50004044 50.00%
2 16665508 16.67%
3 33330448 33.33%
In this case, histc is fastest
However, in the case where maybe it is not possible to generate all N values up front, perhaps because the weights are updated on each iterations, i.e. N=1:
a = 1:3; % possible numbers
w = [0.3 0.1 0.2]; % corresponding weights
I = 100000; % number of values to generate
w_normalized = w / sum(w) % normalised weights, for indication
R=zeros(N,1);
fprintf('randsample:\n');
tic
for i=1:I
R(i) = randsample(a, 1, true, w);
end
toc
tabulate(R)
fprintf('cumsum:\n');
tic
for i=1:I
R(i) = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );
end
toc
tabulate(R)
fprintf('histc:\n');
tic
for i=1:I
[~, R(i)] = histc(rand(1),cumsum([0;w(:)./sum(w)]));
end
toc
tabulate(R)
Results:
0.5000 0.1667 0.3333
randsample:
Elapsed time is 3.526473 seconds.
Value Count Percent
1 50437 50.44%
2 16149 16.15%
3 33414 33.41%
cumsum:
Elapsed time is 0.473207 seconds.
Value Count Percent
1 50018 50.02%
2 16748 16.75%
3 33234 33.23%
histc:
Elapsed time is 1.046981 seconds.
Value Count Percent
1 50134 50.13%
2 16684 16.68%
3 33182 33.18%
In this case, the custom cumsum approach (based on the bsxfun version) is fastest.
In any case, randsample certainly looks like a bad choice all round. It also goes to show that if an algorithm can be arranged to generate all random variables upfront then it will perform much better (note that there are three orders of magnitude less values generated in the N=1 case in a similar execution time).
Code is available here.
Amro has a really nice answer for this topic. However, one might want a super-fast implementation to sample from huge PDFs where the domain might contain several thousands. For such scenarios, it might be tedious to use bsxfun and cumsum very frequently. Motivated from Gnovice's answer, it would make sense to implement roulette wheel algorithm with a run length encoding schema. I performed a benchmark with Amro's solution and new code:
%% Toy example: generate random numbers from an arbitrary PDF
a = 1:3; %# domain of PDF
w = [0.3 0.1 0.2]; %# Probability Values (Weights)
N = 10000; %# Number of random generations
%Generate using roulette wheel + run length encoding
factor = 1 / min(w); %Compute min factor to assign 1 bin to min(PDF)
intW = int32(w * factor); %Get replicator indexes for run length encoding
idxArr = zeros(1,sum(intW)); %Create index access array
idxArr([1 cumsum(intW(1:end-1))+1]) = 1;%Tag sample change indexes
sampTable = a(cumsum(idxArr)); %Create lookup table filled with samples
len = size(sampTable,2);
tic;
R = sampTable( uint32(randi([1 len],N,1)) );
toc;
tabulate(R);
Some evaluations of the code above for very large data where domain of PDF contain huge length.
a ~ 15000, n = 10000
Without table: Elapsed time is 0.006203 seconds.
With table: Elapsed time is 0.003308 seconds.
ByteSize(sampTable) 796.23 kb
a ~ 15000, n = 100000
Without table: Elapsed time is 0.003510 seconds.
With table: Elapsed time is 0.002823 seconds.
a ~ 35000, n = 10000
Without table: Elapsed time is 0.226990 seconds.
With table: Elapsed time is 0.001328 seconds.
ByteSize(sampTable) 2.79 Mb
a ~ 35000 n = 100000
Without table: Elapsed time is 2.784713 seconds.
With table: Elapsed time is 0.003452 seconds.
a ~ 35000 n = 1000000
Without table: bsxfun: out of memory
With table : Elapsed time is 0.021093 seconds.
The idea is to create a run length encoding table where frequent values of the PDF are replicated more compared to non-frequent values. At the end of the day, we sample an index for weighted sample table, using uniform distribution, and use corresponding value.
It is memory intensive, but with this approach it is even possible to scale up to PDF lengths of hundred thousands. Hence access is super-fast.
How to randomly pick up N numbers from a vector a with weight assigned to each number?
Let's say:
a = 1:3; % possible numbers
weight = [0.3 0.1 0.2]; % corresponding weights
In this case probability to pick up 1 should be 3 times higher than to pick up 2.
Sum of all weights can be anything.
R = randsample([1 2 3], N, true, [0.3 0.1 0.2])
randsample is included in the Statistics Toolbox
Otherwise you can use some kind of roulette-wheel selection process. See this similar question (although not MATLAB specific). Here's my one-line implementation:
a = 1:3; %# possible numbers
w = [0.3 0.1 0.2]; %# corresponding weights
N = 10; %# how many numbers to generate
R = a( sum( bsxfun(#ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 )
Explanation:
Consider the interval [0,1]. We assign for each element in the list (1:3) a sub-interval of length proportionate to the weight of each element; therefore 1 get and interval of length 0.3/(0.3+0.1+0.2), same for the others.
Now if we generate a random number with uniform distribution over [0,1], then any number in [0,1] has an equal probability of being picked, thus the sub-intervals' lengths determine the probability of the random number falling in each interval.
This matches what I'm doing above: pick a number X~U[0,1] (more like N numbers), then find which interval it falls into in a vectorized way..
You can check the results of the two techniques above by generating a large enough sequence N=1000:
>> tabulate( R )
Value Count Percent
1 511 51.10%
2 160 16.00%
3 329 32.90%
which more or less match the normalized weights w./sum(w) [0.5 0.16667 0.33333]
amro gives a nice answer (that I rated up), but it will be highly intensive if you wish to generate many numbers from a large set. This is because the bsxfun operation can generate a huge array, which is then summed. For example, suppose I had a set of 10000 values to sample from, all with different weights? Now, generate 1000000 numbers from that sample.
This will take some work to do, since it will generate a 10000x1000000 array internally, with 10^10 elements in it. It will be a logical array, but even so, 10 gigabytes of ram must be allocated.
A better solution is to use histc. Thus...
a = 1:3
w = [.3 .1 .2];
N = 10;
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R)
R =
1 1 1 2 2 1 3 1 1 1
However, for a large problem of the size I suggested above, it is fast.
a = 1:10000;
w = rand(1,10000);
N = 1000000;
tic
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R);
toc
Elapsed time is 0.120879 seconds.
Admittedly, my version takes 2 lines to write. The indexing operation must happen on a second line since it uses the second output of histc. Also note that I've used the ability of the new matlab release, with the tilde (~) operator as the first argument of histc. This causes that first argument to be immediately dumped in the bit bucket.
TL;DR
For maximum performance, if you only need a singe sample, use
R = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );
and if you need multiple samples, use
[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));
Avoid randsample. Generating multiple samples upfront is three orders of magnitude faster than generating individual values.
Performance metrics
Since this showed up near the top of my Google search, I just wanted to add some performance metrics to show that the right solution will depend very much on the value of N and the requirements of the application. Also that changing the design of the application can dramatically increase performance.
For large N, or indeed N > 1:
a = 1:3; % possible numbers
w = [0.3 0.1 0.2]; % corresponding weights
N = 100000000; % number of values to generate
w_normalized = w / sum(w) % normalised weights, for indication
fprintf('randsample:\n');
tic
R = randsample(a, N, true, w);
toc
tabulate(R)
fprintf('bsxfun:\n');
tic
R = a( sum( bsxfun(#ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 );
toc
tabulate(R)
fprintf('histc:\n');
tic
[~, R] = histc(rand(N,1),cumsum([0;w(:)./sum(w)]));
toc
tabulate(R)
Results:
w_normalized =
0.5000 0.1667 0.3333
randsample:
Elapsed time is 2.976893 seconds.
Value Count Percent
1 49997864 50.00%
2 16670394 16.67%
3 33331742 33.33%
bsxfun:
Elapsed time is 2.712315 seconds.
Value Count Percent
1 49996820 50.00%
2 16665005 16.67%
3 33338175 33.34%
histc:
Elapsed time is 2.078809 seconds.
Value Count Percent
1 50004044 50.00%
2 16665508 16.67%
3 33330448 33.33%
In this case, histc is fastest
However, in the case where maybe it is not possible to generate all N values up front, perhaps because the weights are updated on each iterations, i.e. N=1:
a = 1:3; % possible numbers
w = [0.3 0.1 0.2]; % corresponding weights
I = 100000; % number of values to generate
w_normalized = w / sum(w) % normalised weights, for indication
R=zeros(N,1);
fprintf('randsample:\n');
tic
for i=1:I
R(i) = randsample(a, 1, true, w);
end
toc
tabulate(R)
fprintf('cumsum:\n');
tic
for i=1:I
R(i) = a( sum( (rand(1) >= cumsum(w./sum(w)))) + 1 );
end
toc
tabulate(R)
fprintf('histc:\n');
tic
for i=1:I
[~, R(i)] = histc(rand(1),cumsum([0;w(:)./sum(w)]));
end
toc
tabulate(R)
Results:
0.5000 0.1667 0.3333
randsample:
Elapsed time is 3.526473 seconds.
Value Count Percent
1 50437 50.44%
2 16149 16.15%
3 33414 33.41%
cumsum:
Elapsed time is 0.473207 seconds.
Value Count Percent
1 50018 50.02%
2 16748 16.75%
3 33234 33.23%
histc:
Elapsed time is 1.046981 seconds.
Value Count Percent
1 50134 50.13%
2 16684 16.68%
3 33182 33.18%
In this case, the custom cumsum approach (based on the bsxfun version) is fastest.
In any case, randsample certainly looks like a bad choice all round. It also goes to show that if an algorithm can be arranged to generate all random variables upfront then it will perform much better (note that there are three orders of magnitude less values generated in the N=1 case in a similar execution time).
Code is available here.
Amro has a really nice answer for this topic. However, one might want a super-fast implementation to sample from huge PDFs where the domain might contain several thousands. For such scenarios, it might be tedious to use bsxfun and cumsum very frequently. Motivated from Gnovice's answer, it would make sense to implement roulette wheel algorithm with a run length encoding schema. I performed a benchmark with Amro's solution and new code:
%% Toy example: generate random numbers from an arbitrary PDF
a = 1:3; %# domain of PDF
w = [0.3 0.1 0.2]; %# Probability Values (Weights)
N = 10000; %# Number of random generations
%Generate using roulette wheel + run length encoding
factor = 1 / min(w); %Compute min factor to assign 1 bin to min(PDF)
intW = int32(w * factor); %Get replicator indexes for run length encoding
idxArr = zeros(1,sum(intW)); %Create index access array
idxArr([1 cumsum(intW(1:end-1))+1]) = 1;%Tag sample change indexes
sampTable = a(cumsum(idxArr)); %Create lookup table filled with samples
len = size(sampTable,2);
tic;
R = sampTable( uint32(randi([1 len],N,1)) );
toc;
tabulate(R);
Some evaluations of the code above for very large data where domain of PDF contain huge length.
a ~ 15000, n = 10000
Without table: Elapsed time is 0.006203 seconds.
With table: Elapsed time is 0.003308 seconds.
ByteSize(sampTable) 796.23 kb
a ~ 15000, n = 100000
Without table: Elapsed time is 0.003510 seconds.
With table: Elapsed time is 0.002823 seconds.
a ~ 35000, n = 10000
Without table: Elapsed time is 0.226990 seconds.
With table: Elapsed time is 0.001328 seconds.
ByteSize(sampTable) 2.79 Mb
a ~ 35000 n = 100000
Without table: Elapsed time is 2.784713 seconds.
With table: Elapsed time is 0.003452 seconds.
a ~ 35000 n = 1000000
Without table: bsxfun: out of memory
With table : Elapsed time is 0.021093 seconds.
The idea is to create a run length encoding table where frequent values of the PDF are replicated more compared to non-frequent values. At the end of the day, we sample an index for weighted sample table, using uniform distribution, and use corresponding value.
It is memory intensive, but with this approach it is even possible to scale up to PDF lengths of hundred thousands. Hence access is super-fast.
Let's assume I have the following 9 x 5 matrix:
myArray = [
54.7 8.1 81.7 55.0 22.5
29.6 92.9 79.4 62.2 17.0
74.4 77.5 64.4 58.7 22.7
18.8 48.6 37.8 20.7 43.5
68.6 43.5 81.1 30.1 31.1
18.3 44.6 53.2 47.0 92.3
36.8 30.6 35.0 23.0 43.0
62.5 50.8 93.9 84.4 18.4
78.0 51.0 87.5 19.4 90.4
];
I have 11 "subsets" of this matrix and I need to run a function (let's say max) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):
myLogicals = logical([
0 1 0 1 1
1 1 0 1 1
1 1 0 0 0
0 1 0 1 1
1 0 1 1 1
1 1 1 1 0
0 1 1 0 1
1 1 0 0 1
1 1 0 0 1
]);
or via linear indexing:
starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends = [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset
such that the first subset is 2:3, the second is 5:6, and so on.
I can find the max of each subset and store it in a vector as follows:
finalAnswers = NaN(11,1);
for n=1:length(starts) #%i.e. 1 through the number of subsets
finalAnswers(n) = max(myArray(starts(n):ends(n)));
end
After the loop runs, finalAnswers contains the maximum value of each of the data subsets:
74.4 68.6 78.0 92.9 51.0 81.1 62.2 47.0 22.5 43.5 90.4
Is it possible to obtain the same result without the use of a for loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?
EDIT:
I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.
I wrapped each solution in tic;for k=1:1000 [code here] end; toc; and here are the results:
for loop approach --- Elapsed time is 16.237400 seconds.
Shai's approach --- Elapsed time is 153.707076 seconds.
Dan's approach --- Elapsed time is 44.774121 seconds.
Divakar's approach #2 --- Elapsed time is 127.621515 seconds.
Notes:
I also tried benchmarking Dan's approach by wrapping the k=1:1000 for loop around just the accumarray line (since the rest could be
theoretically run just once). In this case the time was 28.29
seconds.
Benchmarking Shai's approach, while leaving the lb = ... line out
of the k loop, the time was 113.48 seconds.
When I ran Divakar's code, I got Non-singleton dimensions of the two
input arrays must match each other. errors for the bsxfun lines.
I "fixed" this by using conjugate transposition (the apostrophe
operator ') on trade_starts(1:starts_extent) and
intv(1:starts_extent) in the lines of code calling bsxfun. I'm
not sure why this error was occuring...
I'm not sure if my benchmarking setup is correct, but it appears that the for loop actually runs the fastest in this case.
One approach is to use accumarray. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:
sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))
s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);
So that just does what Shai's bwlabeln implementation does (but this will be 1-by-numel(myLogicals) in shape as opposed to size(myLogicals) in shape)
Now you can use accumarray:
accumarray(labelled(myLogicals), myArray(myLogicals), [], #max)
or else it may be faster to try
result = accumarray(labelled+1, myArray(:), [], #max);
result = result(2:end)
This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.
Use bwlabeln with a vertical connectivity:
lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );
Now you have a label 1..11 for each region.
To get max value you can use regionprops
props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];
You can use regionprops to get some other properties of each subset, but it is not too general.
If you wish to apply a more general function to each region, e.g., median, you can use accumarray:
finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], #median );
Ideas behind vectorization and optimization
One of the approaches that one can employ to vectorize this problem would be to convert the subsets into regular shaped blocks and then finding the max of the elements
of the those blocks in one go. Now, converting to regular shaped blocks has one issue here and it is that the subsets are unequal in lengths. To avoid this issue, one can
create a 2D matrix of indices starting from each of starts elements and extending until the maximum of the subset lengths. Good thing about this is, it allows
vectorization, but at the cost of more memory requirements which would depend on the scattered-ness of the subsets lengths.
Another issue with this vectorization technique would be that it could potentially lead to out-of-limits indices creations for final subsets.
To avoid this, one can think of two possible ways -
Use a bigger input array by extending the input array such that maximum of the subset lengths plus the starts indices still lie within the confinements of the
extended array.
Use the original input array for starts until we are within the limits of original input array and then for the rest of the subsets use the original loop code. We can call it the mixed programming just for the sake of having a short title. This would save us memory requirements on creating the extended array as discussed in the other approach earlier.
These two ways/approaches are listed next.
Approach #1: Vectorized technique
[m,n] = size(myArray); %// store no. of rows and columns in input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
max_intv_arr = [0:max_intv]'; %//'# array of max indices extent
[row1,col1] = ind2sub([m n],starts); %// get starts row and column indices
m_ext = max(row1+max_intv); %// no. of rows in extended input array
myArrayExt(m_ext,n)=0; %// extended form of input array
myArrayExt(1:m,:) = myArray;
%// New linear indices for extended form of input array
idx = bsxfun(#plus,max_intv_arr,(col1-1)*m_ext+row1);
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArrayExt(idx);
selected_ele(bsxfun(#gt,max_intv_arr,intv))= nan;
%// Get the max of the valid ones for the desired output
out = nanmax(selected_ele); %// desired output
Approach #2: Mixed programming
%// PART - I: Vectorized technique for subsets that when normalized
%// with max extents still lie within limits of input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
%// Find the last subset that when extended by max interval would still
%// lie within the limits of input array
starts_extent = find(starts+max_intv<=numel(myArray),1,'last');
max_intv_arr = [0:max_intv]'; %//'# Array of max indices extent
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArray(bsxfun(#plus,max_intv_arr,starts(1:starts_extent)));
selected_ele(bsxfun(#gt,max_intv_arr,intv(1:starts_extent))) = nan;
out(numel(starts)) = 0; %// storage for output
out(1:starts_extent) = nanmax(selected_ele); %// output values for part-I
%// PART - II: Process rest of input array elements
for n = starts_extent+1:numel(starts)
out(n) = max(myArray(starts(n):ends(n)));
end
Benchmarking
In this section we will compare the the two approaches and the original loop code against each other for performance. Let's setup codes before starting the actual benchmarking -
N = 10000; %// No. of subsets
M1 = 1510; %// No. of rows in input array
M2 = 2185; %// No. of cols in input array
myArray = rand(M1,M2); %// Input array
num_runs = 50; %// no. of runs for each method
%// Form the starts and ends by getting a sorted random integers array from
%// 1 to one minus no. of elements in input array. That minus one is
%// compensated later on into ends because we don't want any subset with
%// starts and ends as the same index
y1 = reshape(sort(randi(numel(myArray)-1,1,2*N)),2,[]);
starts = y1(1,:);
ends = y1(1,:)+1;
%// Remove identical starts elements
invalid = [false any(diff(starts,[],2)==0,1)];
starts = starts(~invalid);
ends = ends(~invalid);
%// Create myLogicals
myLogicals = false(size(myArray));
for k1=1:numel(starts)
myLogicals(starts(k1):ends(k1))=1;
end
clear invalid y1 k1 M1 M2 N %// clear unnecessary variables
%// Warm up tic/toc.
for k = 1:100
tic(); elapsed = toc();
end
Now, the placebo codes that gets us the runtimes -
disp('---------------------- With Original loop code')
tic
for iter = 1:num_runs
%// ...... approach #1 codes
end
toc
%// clear out variables used in the above approach
%// repeat this for approach #1,2
Benchmark Results
In your comments, you mentioned using 1510 x 2185 matrix, so let's do two case runs with such size and subsets of size 10000 and 2000.
Case 1 [Input - 1510 x 2185 matrix, Subsets - 10000]
---------------------- With Original loop code
Elapsed time is 15.625212 seconds.
---------------------- With Approach #1
Elapsed time is 12.102567 seconds.
---------------------- With Approach #2
Elapsed time is 0.983978 seconds.
Case 2 [Input - 1510 x 2185 matrix, Subsets - 2000]
---------------------- With Original loop code
Elapsed time is 3.045402 seconds.
---------------------- With Approach #1
Elapsed time is 11.349107 seconds.
---------------------- With Approach #2
Elapsed time is 0.214744 seconds.
Case 3 [Bigger Input - 3000 x 3000 matrix, Subsets - 20000]
---------------------- With Original loop code
Elapsed time is 12.388061 seconds.
---------------------- With Approach #1
Elapsed time is 12.545292 seconds.
---------------------- With Approach #2
Elapsed time is 0.782096 seconds.
Note that the number of runs num_runs was varied to keep the runtime of the fastest approach close to 1 sec.
Conclusions
So, I guess the mixed programming (approach #2) is the way to go! As future work, one can use standard deviation into the scattered-ness criteria if the performance suffers because of the scattered-ness and offload the work for most scattered subsets (in terms of their lengths) into the loop code.
Efficiency
Measure both the vectorised & for-loop code samples on your respective platform ( be it a <localhost> or Cloud-based ) to see the difference:
MATLAB:7> tic();max( myArray( startIndex(:):endIndex(:) ) );toc() %% Details
Elapsed time is 0.0312 seconds. %% below.
%% Code is not
%% the merit,
%% method is:
and
tic(); %% for/loop
for n = 1:length( startIndex ) %% may be
max( myArray( startIndex(n):endIndex(n) ) ); %% significantly
end %% faster than
toc(); %% vectorised
Elapsed time is 0.125 seconds. %% setup(s)
%% overhead(s)
%% As commented below,
%% subsequent re-runs yield unrealistic results due to caching artifacts
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
%% which are not so straight visible if encapsulated in an artificial in-vitro
%% via an outer re-run repetitions ( for k=1:1000 ) et al ( ref. in text below )
For a better interpretation of the test results, rather test on much larger sizes than just on a few tens of row/cols.
EDIT:
An erroneous code removed, thanks Dan for the notice. Having taken more attention to emphasize the quantitative validation, that may prove the assumption that a vectorised code may, but need not in all circumstances, be faster is not an excuse for a faulty code, sure.
Output - quantitatively comparative data:
While recommended, there is not IMHO fair to assume, the memalloc and similar overheads to be excluded from the in-vivo testing. Test re-runs typically show VM-page hits improvements, other caching artifacts, while the raw 1st "virgin" run is what typically appears in the real code deployment ( excl. external iterators, for sure ). So consider the results with care and retest in your real environment ( sometimes being run as a Virtual Machine inside a bigger system -- that also makes VM-swap mechanics necessary to take into account once huge matrices start hurt on real-life memory-access patterns ).
On other Projects I am used to use [usec] granularity of the realtime test timing, but the more care is necessary to be taken into account about the test-execution conditions and O/S background.
So nothing but testing gives relevant answers to your specific code/deployment situation, however be methodic to compare data comparable in principle.
Alarik's code:
MATLAB:8> tic(); for k=1:1000 % ( flattens memalloc issues & al )
> for n = 1:length( startIndex )
> max( myArray( startIndex(n):endIndex() ) );
> end;
> end; toc()
Elapsed time is 0.2344 seconds.
%% time is 0.0002 seconds per k-for-loop <--[ ref.^ remarks on testing ]
Dan's code:
MATLAB:9> tic(); for k=1:1000
> s_ind( size( myLogicals ) ) = 0;
> s_ind( startIndex ) = 1;
> labelled = cumsum( s_ind(:) ).*myLogicals(:);
> result = accumarray( labelled + 1, myArray(:), [], #max );
> end; toc()
error: product: nonconformant arguments (op1 is 43x1, op2 is 45x1)
%%
%% [Work in progress] to find my mistake -- sorry for not being able to reproduce
%% Dan's code and to make it work
%%
%% Both myArray and myLogicals shape was correct ( 9 x 5 )
I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.
I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:
key = sprintf('%d ', [1 2 3]); % 23 us
% key = '1 2 3 '
result = lookup_map( key ); % 32 us
This is quite time consuming though - 1 million points * 55 us = 55 seconds.
I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.
I could transfer four arrays - key1, key2, key3, result, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.
Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?
Q: What are the bounds of the three integers? And what data is looked up?
The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.
For the purposes of this question, we can assume that the result is an integer between 0 and the size of the data set.
Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?
>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.
>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.
It appears that my matlab doesn't support sparse arrays of 64-bit numbers.
In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:
function [key] = to_key(face)
key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end
Q: From #Dennis - why not use logical indexing?
Let's test it!
% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search = M(500000,1:3)
search =
850 910 581
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
850 910 581 726
Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!
We could try filtering M after each search:
>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.
This is a little faster, but still 696x slower than using Map.
I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.
I'm guessing this question is related to your previous question about tetrahedron faces. I still suggest you should try the sparse storage and sparse matrix-vector multiplication for that purpose:
size(spA)
ans =
1244810 1244810
tic;
vv = spA*v;
idx = find(vv);
toc;
Elapsed time is 0.106581 seconds.
This is just timing analysis, see my previous answer about how to implement it in your case. Before you move to CUDA and do complicated stuff, check out simpler options.
Given the attention this question has already received it feels like this answer is too simple, but why don't you just do it like this:
M=[1:6; 2:7; 3:8; 4:9]'; %Some matrix that contains key 1 2 3, corresponding value is 4
idx=M(:,1)==1&M(:,2)==2&M(:,3)==3;
M(idx,4)
This should evaluate quite fast, even if M is 1 million x 4.