3 integer key lookup in CUDA - matlab

I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.
I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:
key = sprintf('%d ', [1 2 3]); % 23 us
% key = '1 2 3 '
result = lookup_map( key ); % 32 us
This is quite time consuming though - 1 million points * 55 us = 55 seconds.
I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.
I could transfer four arrays - key1, key2, key3, result, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.
Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?
Q: What are the bounds of the three integers? And what data is looked up?
The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.
For the purposes of this question, we can assume that the result is an integer between 0 and the size of the data set.
Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?
>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.
>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.
It appears that my matlab doesn't support sparse arrays of 64-bit numbers.
In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:
function [key] = to_key(face)
key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end
Q: From #Dennis - why not use logical indexing?
Let's test it!
% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search = M(500000,1:3)
search =
850 910 581
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
850 910 581 726
Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!
We could try filtering M after each search:
>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.
This is a little faster, but still 696x slower than using Map.
I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.

I'm guessing this question is related to your previous question about tetrahedron faces. I still suggest you should try the sparse storage and sparse matrix-vector multiplication for that purpose:
size(spA)
ans =
1244810 1244810
tic;
vv = spA*v;
idx = find(vv);
toc;
Elapsed time is 0.106581 seconds.
This is just timing analysis, see my previous answer about how to implement it in your case. Before you move to CUDA and do complicated stuff, check out simpler options.

Given the attention this question has already received it feels like this answer is too simple, but why don't you just do it like this:
M=[1:6; 2:7; 3:8; 4:9]'; %Some matrix that contains key 1 2 3, corresponding value is 4
idx=M(:,1)==1&M(:,2)==2&M(:,3)==3;
M(idx,4)
This should evaluate quite fast, even if M is 1 million x 4.

Related

Extract data from multidimentional array into 2 dims based on index

I have a huge (1000000x100x7) matrix and i need to create a (1000000x100x1) matrix based on an index vector (100x1) which holds 1 2 3 4 5 6 or 7 for each location.
I do not want to use loops
The problem (I think)
First, let me try create a minimum working example that I think captures what you want to do. You have a matrix A and an index vector index:
A = rand(1000000, 100, 7);
index = randi(7, [100, 1]);
And you would like to do something like this:
[I,J,K] = size(A);
B = zeros(I,J);
for i=1:I
for j=1:J
B(i,j) = A(i,j,index(j));
end
end
Only you'd like to do so without the loops.
Linear indexing
One way to do this is by using linear indexing. This is kinda a tricky thing that depends on how the matrix is laid out in memory, and I'm gonna do a really terrible job explaining it, but you can also check out the documentation for the sub2ind and ind2sub functions.
Anyways, it means that given your (1,000,000 x 100 x 7) matrix stored in column-major format, you can refer to the same element in many different ways, i.e.:
A(i, j, k)
A(i, j + 100*(k-1))
A(i + 1000000*(j-1 + 100*(k-1)))
all refer to the same element of the matrix. Anyways, the punchline is:
linear_index = (1:J)' + J*(index-1);
B_noloop = A(:, linear_index);
And of course we should verify that this produces the same answer:
>> isequal(B, B_noloop)
ans =
1
Yay!
Performance vs. readability
So testing this on my computer, the nested loops took 5.37 seconds and the no-loop version took 0.29 seconds. However, it's kinda hard to tell what's going on in that code. Perhaps a more reasonable compromise would be:
B_oneloop = zeros(I,J);
for j=1:J
B_oneloop(:,j) = A(:,j,index(j));
end
which vectorizes the longest dimension of the matrix and thus gets most of the way there (0.43 seconds), but maintains the readability of the original code.

Matlab: creating vector from 2 input vectors [duplicate]

I'm trying to insert multiple values into an array using a 'values' array and a 'counter' array. For example, if:
a=[1,3,2,5]
b=[2,2,1,3]
I want the output of some function
c=somefunction(a,b)
to be
c=[1,1,3,3,2,5,5,5]
Where a(1) recurs b(1) number of times, a(2) recurs b(2) times, etc...
Is there a built-in function in MATLAB that does this? I'd like to avoid using a for loop if possible. I've tried variations of 'repmat()' and 'kron()' to no avail.
This is basically Run-length encoding.
Problem Statement
We have an array of values, vals and runlengths, runlens:
vals = [1,3,2,5]
runlens = [2,2,1,3]
We are needed to repeat each element in vals times each corresponding element in runlens. Thus, the final output would be:
output = [1,1,3,3,2,5,5,5]
Prospective Approach
One of the fastest tools with MATLAB is cumsum and is very useful when dealing with vectorizing problems that work on irregular patterns. In the stated problem, the irregularity comes with the different elements in runlens.
Now, to exploit cumsum, we need to do two things here: Initialize an array of zeros and place "appropriate" values at "key" positions over the zeros array, such that after "cumsum" is applied, we would end up with a final array of repeated vals of runlens times.
Steps: Let's number the above mentioned steps to give the prospective approach an easier perspective:
1) Initialize zeros array: What must be the length? Since we are repeating runlens times, the length of the zeros array must be the summation of all runlens.
2) Find key positions/indices: Now these key positions are places along the zeros array where each element from vals start to repeat.
Thus, for runlens = [2,2,1,3], the key positions mapped onto the zeros array would be:
[X 0 X 0 X X 0 0] % where X's are those key positions.
3) Find appropriate values: The final nail to be hammered before using cumsum would be to put "appropriate" values into those key positions. Now, since we would be doing cumsum soon after, if you think closely, you would need a differentiated version of values with diff, so that cumsum on those would bring back our values. Since these differentiated values would be placed on a zeros array at places separated by the runlens distances, after using cumsum we would have each vals element repeated runlens times as the final output.
Solution Code
Here's the implementation stitching up all the above mentioned steps -
% Calculate cumsumed values of runLengths.
% We would need this to initialize zeros array and find key positions later on.
clens = cumsum(runlens)
% Initalize zeros array
array = zeros(1,(clens(end)))
% Find key positions/indices
key_pos = [1 clens(1:end-1)+1]
% Find appropriate values
app_vals = diff([0 vals])
% Map app_values at key_pos on array
array(pos) = app_vals
% cumsum array for final output
output = cumsum(array)
Pre-allocation Hack
As could be seen that the above listed code uses pre-allocation with zeros. Now, according to this UNDOCUMENTED MATLAB blog on faster pre-allocation, one can achieve much faster pre-allocation with -
array(clens(end)) = 0; % instead of array = zeros(1,(clens(end)))
Wrapping up: Function Code
To wrap up everything, we would have a compact function code to achieve this run-length decoding like so -
function out = rle_cumsum_diff(vals,runlens)
clens = cumsum(runlens);
idx(clens(end))=0;
idx([1 clens(1:end-1)+1]) = diff([0 vals]);
out = cumsum(idx);
return;
Benchmarking
Benchmarking Code
Listed next is the benchmarking code to compare runtimes and speedups for the stated cumsum+diff approach in this post over the other cumsum-only based approach on MATLAB 2014B-
datasizes = [reshape(linspace(10,70,4).'*10.^(0:4),1,[]) 10^6 2*10^6]; %
fcns = {'rld_cumsum','rld_cumsum_diff'}; % approaches to be benchmarked
for k1 = 1:numel(datasizes)
n = datasizes(k1); % Create random inputs
vals = randi(200,1,n);
runs = [5000 randi(200,1,n-1)]; % 5000 acts as an aberration
for k2 = 1:numel(fcns) % Time approaches
tsec(k2,k1) = timeit(#() feval(fcns{k2}, vals,runs), 1);
end
end
figure, % Plot runtimes
loglog(datasizes,tsec(1,:),'-bo'), hold on
loglog(datasizes,tsec(2,:),'-k+')
set(gca,'xgrid','on'),set(gca,'ygrid','on'),
xlabel('Datasize ->'), ylabel('Runtimes (s)')
legend(upper(strrep(fcns,'_',' '))),title('Runtime Plot')
figure, % Plot speedups
semilogx(datasizes,tsec(1,:)./tsec(2,:),'-rx')
set(gca,'ygrid','on'), xlabel('Datasize ->')
legend('Speedup(x) with cumsum+diff over cumsum-only'),title('Speedup Plot')
Associated function code for rld_cumsum.m:
function out = rld_cumsum(vals,runlens)
index = zeros(1,sum(runlens));
index([1 cumsum(runlens(1:end-1))+1]) = 1;
out = vals(cumsum(index));
return;
Runtime and Speedup Plots
Conclusions
The proposed approach seems to be giving us a noticeable speedup over the cumsum-only approach, which is about 3x!
Why is this new cumsum+diff based approach better than the previous cumsum-only approach?
Well, the essence of the reason lies at the final step of the cumsum-only approach that needs to map the "cumsumed" values into vals. In the new cumsum+diff based approach, we are doing diff(vals) instead for which MATLAB is processing only n elements (where n is the number of runLengths) as compared to the mapping of sum(runLengths) number of elements for the cumsum-only approach and this number must be many times more than n and therefore the noticeable speedup with this new approach!
Benchmarks
Updated for R2015b: repelem now fastest for all data sizes.
Tested functions:
MATLAB's built-in repelem function that was added in R2015a
gnovice's cumsum solution (rld_cumsum)
Divakar's cumsum+diff solution (rld_cumsum_diff)
knedlsepp's accumarray solution (knedlsepp5cumsumaccumarray) from this post
Naive loop-based implementation (naive_jit_test.m) to test the just-in-time compiler
Results of test_rld.m on R2015b:
Old timing plot using R2015a here.
Findings:
repelem is always the fastest by roughly a factor of 2.
rld_cumsum_diff is consistently faster than rld_cumsum.
repelem is fastest for small data sizes (less than about 300-500 elements)
rld_cumsum_diff becomes significantly faster than repelem around 5 000 elements
repelem becomes slower than rld_cumsum somewhere between 30 000 and 300 000 elements
rld_cumsum has roughly the same performance as knedlsepp5cumsumaccumarray
naive_jit_test.m has nearly constant speed and on par with rld_cumsum and knedlsepp5cumsumaccumarray for smaller sizes, a little faster for large sizes
Old rate plot using R2015a here.
Conclusion
Use repelem below about 5 000 elements and the cumsum+diff solution above.
There's no built-in function I know of, but here's one solution:
index = zeros(1,sum(b));
index([1 cumsum(b(1:end-1))+1]) = 1;
c = a(cumsum(index));
Explanation:
A vector of zeroes is first created of the same length as the output array (i.e. the sum of all the replications in b). Ones are then placed in the first element and each subsequent element representing where the start of a new sequence of values will be in the output. The cumulative sum of the vector index can then be used to index into a, replicating each value the desired number of times.
For the sake of clarity, this is what the various vectors look like for the values of a and b given in the question:
index = [1 0 1 0 1 1 0 0]
cumsum(index) = [1 1 2 2 3 4 4 4]
c = [1 1 3 3 2 5 5 5]
EDIT: For the sake of completeness, there is another alternative using ARRAYFUN, but this seems to take anywhere from 20-100 times longer to run than the above solution with vectors up to 10,000 elements long:
c = arrayfun(#(x,y) x.*ones(1,y),a,b,'UniformOutput',false);
c = [c{:}];
There is finally (as of R2015a) a built-in and documented function to do this, repelem. The following syntax, where the second argument is a vector, is relevant here:
W = repelem(V,N), with vector V and vector N, creates a vector W where element V(i) is repeated N(i) times.
Or put another way, "Each element of N specifies the number of times to repeat the corresponding element of V."
Example:
>> a=[1,3,2,5]
a =
1 3 2 5
>> b=[2,2,1,3]
b =
2 2 1 3
>> repelem(a,b)
ans =
1 1 3 3 2 5 5 5
The performance problems in MATLAB's built-in repelem have been fixed as of R2015b. I have run the test_rld.m program from chappjc's post in R2015b, and repelem is now faster than other algorithms by about a factor 2:

How can I vectorize code that runs a function on subsets of a larger matrix?

Let's assume I have the following 9 x 5 matrix:
myArray = [
54.7 8.1 81.7 55.0 22.5
29.6 92.9 79.4 62.2 17.0
74.4 77.5 64.4 58.7 22.7
18.8 48.6 37.8 20.7 43.5
68.6 43.5 81.1 30.1 31.1
18.3 44.6 53.2 47.0 92.3
36.8 30.6 35.0 23.0 43.0
62.5 50.8 93.9 84.4 18.4
78.0 51.0 87.5 19.4 90.4
];
I have 11 "subsets" of this matrix and I need to run a function (let's say max) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):
myLogicals = logical([
0 1 0 1 1
1 1 0 1 1
1 1 0 0 0
0 1 0 1 1
1 0 1 1 1
1 1 1 1 0
0 1 1 0 1
1 1 0 0 1
1 1 0 0 1
]);
or via linear indexing:
starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends = [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset
such that the first subset is 2:3, the second is 5:6, and so on.
I can find the max of each subset and store it in a vector as follows:
finalAnswers = NaN(11,1);
for n=1:length(starts) #%i.e. 1 through the number of subsets
finalAnswers(n) = max(myArray(starts(n):ends(n)));
end
After the loop runs, finalAnswers contains the maximum value of each of the data subsets:
74.4 68.6 78.0 92.9 51.0 81.1 62.2 47.0 22.5 43.5 90.4
Is it possible to obtain the same result without the use of a for loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?
EDIT:
I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.
I wrapped each solution in tic;for k=1:1000 [code here] end; toc; and here are the results:
for loop approach --- Elapsed time is 16.237400 seconds.
Shai's approach --- Elapsed time is 153.707076 seconds.
Dan's approach --- Elapsed time is 44.774121 seconds.
Divakar's approach #2 --- Elapsed time is 127.621515 seconds.
Notes:
I also tried benchmarking Dan's approach by wrapping the k=1:1000 for loop around just the accumarray line (since the rest could be
theoretically run just once). In this case the time was 28.29
seconds.
Benchmarking Shai's approach, while leaving the lb = ... line out
of the k loop, the time was 113.48 seconds.
When I ran Divakar's code, I got Non-singleton dimensions of the two
input arrays must match each other. errors for the bsxfun lines.
I "fixed" this by using conjugate transposition (the apostrophe
operator ') on trade_starts(1:starts_extent) and
intv(1:starts_extent) in the lines of code calling bsxfun. I'm
not sure why this error was occuring...
I'm not sure if my benchmarking setup is correct, but it appears that the for loop actually runs the fastest in this case.
One approach is to use accumarray. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:
sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))
s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);
So that just does what Shai's bwlabeln implementation does (but this will be 1-by-numel(myLogicals) in shape as opposed to size(myLogicals) in shape)
Now you can use accumarray:
accumarray(labelled(myLogicals), myArray(myLogicals), [], #max)
or else it may be faster to try
result = accumarray(labelled+1, myArray(:), [], #max);
result = result(2:end)
This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.
Use bwlabeln with a vertical connectivity:
lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );
Now you have a label 1..11 for each region.
To get max value you can use regionprops
props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];
You can use regionprops to get some other properties of each subset, but it is not too general.
If you wish to apply a more general function to each region, e.g., median, you can use accumarray:
finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], #median );
Ideas behind vectorization and optimization
One of the approaches that one can employ to vectorize this problem would be to convert the subsets into regular shaped blocks and then finding the max of the elements
of the those blocks in one go. Now, converting to regular shaped blocks has one issue here and it is that the subsets are unequal in lengths. To avoid this issue, one can
create a 2D matrix of indices starting from each of starts elements and extending until the maximum of the subset lengths. Good thing about this is, it allows
vectorization, but at the cost of more memory requirements which would depend on the scattered-ness of the subsets lengths.
Another issue with this vectorization technique would be that it could potentially lead to out-of-limits indices creations for final subsets.
To avoid this, one can think of two possible ways -
Use a bigger input array by extending the input array such that maximum of the subset lengths plus the starts indices still lie within the confinements of the
extended array.
Use the original input array for starts until we are within the limits of original input array and then for the rest of the subsets use the original loop code. We can call it the mixed programming just for the sake of having a short title. This would save us memory requirements on creating the extended array as discussed in the other approach earlier.
These two ways/approaches are listed next.
Approach #1: Vectorized technique
[m,n] = size(myArray); %// store no. of rows and columns in input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
max_intv_arr = [0:max_intv]'; %//'# array of max indices extent
[row1,col1] = ind2sub([m n],starts); %// get starts row and column indices
m_ext = max(row1+max_intv); %// no. of rows in extended input array
myArrayExt(m_ext,n)=0; %// extended form of input array
myArrayExt(1:m,:) = myArray;
%// New linear indices for extended form of input array
idx = bsxfun(#plus,max_intv_arr,(col1-1)*m_ext+row1);
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArrayExt(idx);
selected_ele(bsxfun(#gt,max_intv_arr,intv))= nan;
%// Get the max of the valid ones for the desired output
out = nanmax(selected_ele); %// desired output
Approach #2: Mixed programming
%// PART - I: Vectorized technique for subsets that when normalized
%// with max extents still lie within limits of input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
%// Find the last subset that when extended by max interval would still
%// lie within the limits of input array
starts_extent = find(starts+max_intv<=numel(myArray),1,'last');
max_intv_arr = [0:max_intv]'; %//'# Array of max indices extent
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArray(bsxfun(#plus,max_intv_arr,starts(1:starts_extent)));
selected_ele(bsxfun(#gt,max_intv_arr,intv(1:starts_extent))) = nan;
out(numel(starts)) = 0; %// storage for output
out(1:starts_extent) = nanmax(selected_ele); %// output values for part-I
%// PART - II: Process rest of input array elements
for n = starts_extent+1:numel(starts)
out(n) = max(myArray(starts(n):ends(n)));
end
Benchmarking
In this section we will compare the the two approaches and the original loop code against each other for performance. Let's setup codes before starting the actual benchmarking -
N = 10000; %// No. of subsets
M1 = 1510; %// No. of rows in input array
M2 = 2185; %// No. of cols in input array
myArray = rand(M1,M2); %// Input array
num_runs = 50; %// no. of runs for each method
%// Form the starts and ends by getting a sorted random integers array from
%// 1 to one minus no. of elements in input array. That minus one is
%// compensated later on into ends because we don't want any subset with
%// starts and ends as the same index
y1 = reshape(sort(randi(numel(myArray)-1,1,2*N)),2,[]);
starts = y1(1,:);
ends = y1(1,:)+1;
%// Remove identical starts elements
invalid = [false any(diff(starts,[],2)==0,1)];
starts = starts(~invalid);
ends = ends(~invalid);
%// Create myLogicals
myLogicals = false(size(myArray));
for k1=1:numel(starts)
myLogicals(starts(k1):ends(k1))=1;
end
clear invalid y1 k1 M1 M2 N %// clear unnecessary variables
%// Warm up tic/toc.
for k = 1:100
tic(); elapsed = toc();
end
Now, the placebo codes that gets us the runtimes -
disp('---------------------- With Original loop code')
tic
for iter = 1:num_runs
%// ...... approach #1 codes
end
toc
%// clear out variables used in the above approach
%// repeat this for approach #1,2
Benchmark Results
In your comments, you mentioned using 1510 x 2185 matrix, so let's do two case runs with such size and subsets of size 10000 and 2000.
Case 1 [Input - 1510 x 2185 matrix, Subsets - 10000]
---------------------- With Original loop code
Elapsed time is 15.625212 seconds.
---------------------- With Approach #1
Elapsed time is 12.102567 seconds.
---------------------- With Approach #2
Elapsed time is 0.983978 seconds.
Case 2 [Input - 1510 x 2185 matrix, Subsets - 2000]
---------------------- With Original loop code
Elapsed time is 3.045402 seconds.
---------------------- With Approach #1
Elapsed time is 11.349107 seconds.
---------------------- With Approach #2
Elapsed time is 0.214744 seconds.
Case 3 [Bigger Input - 3000 x 3000 matrix, Subsets - 20000]
---------------------- With Original loop code
Elapsed time is 12.388061 seconds.
---------------------- With Approach #1
Elapsed time is 12.545292 seconds.
---------------------- With Approach #2
Elapsed time is 0.782096 seconds.
Note that the number of runs num_runs was varied to keep the runtime of the fastest approach close to 1 sec.
Conclusions
So, I guess the mixed programming (approach #2) is the way to go! As future work, one can use standard deviation into the scattered-ness criteria if the performance suffers because of the scattered-ness and offload the work for most scattered subsets (in terms of their lengths) into the loop code.
Efficiency
Measure both the vectorised & for-loop code samples on your respective platform ( be it a <localhost> or Cloud-based ) to see the difference:
MATLAB:7> tic();max( myArray( startIndex(:):endIndex(:) ) );toc() %% Details
Elapsed time is 0.0312 seconds. %% below.
%% Code is not
%% the merit,
%% method is:
and
tic(); %% for/loop
for n = 1:length( startIndex ) %% may be
max( myArray( startIndex(n):endIndex(n) ) ); %% significantly
end %% faster than
toc(); %% vectorised
Elapsed time is 0.125 seconds. %% setup(s)
%% overhead(s)
%% As commented below,
%% subsequent re-runs yield unrealistic results due to caching artifacts
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
%% which are not so straight visible if encapsulated in an artificial in-vitro
%% via an outer re-run repetitions ( for k=1:1000 ) et al ( ref. in text below )
For a better interpretation of the test results, rather test on much larger sizes than just on a few tens of row/cols.
EDIT:
An erroneous code removed, thanks Dan for the notice. Having taken more attention to emphasize the quantitative validation, that may prove the assumption that a vectorised code may, but need not in all circumstances, be faster is not an excuse for a faulty code, sure.
Output - quantitatively comparative data:
While recommended, there is not IMHO fair to assume, the memalloc and similar overheads to be excluded from the in-vivo testing. Test re-runs typically show VM-page hits improvements, other caching artifacts, while the raw 1st "virgin" run is what typically appears in the real code deployment ( excl. external iterators, for sure ). So consider the results with care and retest in your real environment ( sometimes being run as a Virtual Machine inside a bigger system -- that also makes VM-swap mechanics necessary to take into account once huge matrices start hurt on real-life memory-access patterns ).
On other Projects I am used to use [usec] granularity of the realtime test timing, but the more care is necessary to be taken into account about the test-execution conditions and O/S background.
So nothing but testing gives relevant answers to your specific code/deployment situation, however be methodic to compare data comparable in principle.
Alarik's code:
MATLAB:8> tic(); for k=1:1000 % ( flattens memalloc issues & al )
> for n = 1:length( startIndex )
> max( myArray( startIndex(n):endIndex() ) );
> end;
> end; toc()
Elapsed time is 0.2344 seconds.
%% time is 0.0002 seconds per k-for-loop <--[ ref.^ remarks on testing ]
Dan's code:
MATLAB:9> tic(); for k=1:1000
> s_ind( size( myLogicals ) ) = 0;
> s_ind( startIndex ) = 1;
> labelled = cumsum( s_ind(:) ).*myLogicals(:);
> result = accumarray( labelled + 1, myArray(:), [], #max );
> end; toc()
error: product: nonconformant arguments (op1 is 43x1, op2 is 45x1)
%%
%% [Work in progress] to find my mistake -- sorry for not being able to reproduce
%% Dan's code and to make it work
%%
%% Both myArray and myLogicals shape was correct ( 9 x 5 )

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

Number of values greater than a threshold

I have a matrix A. Now I want to find the number of elements greater than 5 and their corresponding indices. How to solve this in matlab without using for loop?
For example if A = [1 4 6 8 9 5 6 8 9]':
Number of elements > 5: 6
Indices: [3 4 5 7 8 9]
You use find:
index = find(A>5);
numberOfElements = length(index);
You use sum, which allows you to get the number of elements with one command:
numberOfElements = sum(A>5);
Do you really need explicit indices? Because the logical matrix A>5 can also be used as index (usually a tad more efficient than indexing with find):
index = (A>5);
numberOfElements = sum(index);
For completeness: indexing with logicals is the same as with regular indices:
>> A(A>5)
ans =
6 8 9 6 8 9
Motivated by the above discussion with Rody, here is a simple benchmark, which tests speed of integer vs. logical array indexing in MATLAB. Quite an important thing I would say, since 'vectorized' MATLAB is mostly about indexing. So
% random data
a = rand(10^7, 1);
% threashold - how much data meets the a>threashold criterion
% This determines the total indexing time - the more data we extract from a,
% the longer it takes.
% In this example - small threashold meaning most data in a
% will meet the criterion.
threashold = 0.08;
% prepare logical and integer indices (note the uint32 cast)
index_logical = a>threashold;
index_integer = uint32(find(index_logical));
% logical indexing of a
tic
for i=1:10
b = a(index_logical);
end
toc
% integer indexing of a
tic
for i=1:10
b = a(index_integer);
end
toc
On my computer the results are
Elapsed time is 0.755399 seconds.
Elapsed time is 0.728462 seconds.
meaning that the two methods perform almost the same - thats how I chose the example threashold. It is interesing, because the index_integer array is almost 4 times larger!
index_integer 9198678x1 36794712 uint32
index_logical 10000000x1 10000000 logical
For larger values of the threashold integer indexing is faster. Results for threashold=0.5:
Elapsed time is 0.687044 seconds. (logical)
Elapsed time is 0.296044 seconds. (integer)
Unless I am doing something wrong here, integer indexing seems to be the fastest most of the time.
Including the creation of the indices in the test yields very different results however:
a = rand(1e7, 1);
threshold = 0.5;
% logical
tic
for i=1:10
inds = a>threshold;
b = a(inds);
end
toc
% double
tic
for i=1:10
inds = find(a>threshold);
b = a(inds);
end
toc
% integer
tic
for i=1:10
inds = uint32(find(a>threshold));
b = a(inds);
end
toc
Results (Rody):
Elapsed time is 1.945478 seconds. (logical)
Elapsed time is 3.233831 seconds. (double)
Elapsed time is 3.508009 seconds. (integer)
Results (angainor):
Elapsed time is 1.440018 seconds. (logical)
Elapsed time is 1.851225 seconds. (double)
Elapsed time is 1.726806 seconds. (integer)
So it would seem that the actual indexing is faster when indexing with integers, but front-to-back, logical indexing performs much better.
The runtime difference between the last two methods is unexpected though -- it seems Matlab's internals either do not cast the doubles to integers, of perform error-checking on each element before doing the actual indexing. Otherwise, we would have seen virtually no difference between the double and integer methods.
Edit There are two options as I see it:
matlab converts double indices to uint32 indices explicitly before the indexing call (much like we do in the integer test)
matlab passes doubles and performs the double->int cast on the fly during the indexing call
The second option should be faster, because we only have to read the double indexes once. In our explicit conversion test we have to read double indices, write integer indices, and then again read the integer indices during the actual indexing. So matlab should be faster... Why is it not?