Improving the performance of many sub-matrix left division operations (mldivide, \) - matlab

I have two matrices, a1 and a2. a1 is 3x12000 and a2 is 3x4000. I want to create another array that is 3x4000 which is the left matrix division (mldivide, \) of the 3x3 sub-matrices of a1 and the 3x1 sub-matrices of a2. You can do this easily with a for loop:
for ii = 1:3:12000
a = a1(:,ii:ii+2)\a2(:, ceil(ii/3));
end
However, I was wondering if there was a faster way to do this.
Edit: I am aware preallocation increases speed, I was just showing that for visual purposes.
Edit2: Removed iterative increase of array. It seems my questions has been misinterpreted a bit. I was mainly wondering if there were some matrix operations I could do to achieve my goal as that would likely be quicker than a for loop i.e. reshape a1 to a 3x3x4000 matrix and a2 to a 3x1x4000 matrix and left matrix divide each level in one go, however, you can't left matrix divide with 3D matrices.

You can create one system of equations containing many independent 'sub-systems' of equations by putting the sub-matrices of a1 in a the diagonal of a 12000x12000 matrix like this:
a1(1,1) a1(1,2) a1(1,3) 0 0 0 0 0 0
a1(2,1) a1(2,2) a1(2,3) 0 0 0 0 0 0
a1(3,1) a1(3,2) a1(3,3) 0 0 0 0 0 0
0 0 0 a1(1,4) a1(1,5) a1(1,6) 0 0 0
0 0 0 a1(2,4) a1(2,5) a1(2,6) 0 0 0
0 0 0 a1(3,4) a1(3,5) a1(3,6) 0 0 0
0 0 0 0 0 0 a1(1,7) a1(1,8) a1(1,9)
0 0 0 0 0 0 a1(2,7) a1(2,8) a1(2,9)
0 0 0 0 0 0 a1(3,7) a1(3,8) a1(3,9)
and then left divide it by a2(:).
This can be done using kron and sparse matrix like this (source):
a1_kron = kron(speye(12000/3),ones(3));
a1_kron(logical(a1_kron)) = a1(:);
a = a1_kron\a2(:);
a = reshape(a, [3 12000/3]);
Advantage - Speed: This is about 3-4 times faster than a for loop with preallocation on my PC.
Disadvantage: There is one disadvantage you must consider with this approach: when using left division, Matlab looks for the best way to solve the systems of linear equations, so if you solve each sub-system independently, the best way will be chosen for each sub-system, but if you solve theme as one system, Matlab will find the best way for all the sub-systems together - not the best for each sub-system.
Note: As shown in Stefano M's answer, using one big system of equations (using kron and sparse matrix) is faster than using a for loop (with preallocation) only for very small size of sub-systems of equations (on my PC, for number of equation <= 7) for bigger sizes of sub-systems of equations, using a for loop is faster.
Comparing different methods
I wrote and ran a code to compare 4 different methods for solving this problems:
for loop, no preallocation
for loop, with preallocation
kron
cellfun
Test:
n = 1200000;
a1 = rand(3,n);
a2 = rand(3,n/3);
disp('Method 1: for loop, no preallocation')
tic
a_method1 = [];
for ii = 1:3:n
a_method1 = [a_method1 a1(:,ii:ii+2)\a2(:, ceil(ii/3))];
end
toc
disp(' ')
disp('Method 2: for loop, with preallocation')
tic
a1_reshape = reshape(a1, 3, 3, []);
a_method2 = zeros(size(a2));
for i = 1:size(a1_reshape,3)
a_method2(:,i) = a1_reshape(:,:,i) \ a2(:,i);
end
toc
disp(' ')
disp('Method 3: kron')
tic
a1_kron = kron(speye(n/3),ones(3));
a1_kron(logical(a1_kron)) = a1(:);
a_method3 = a1_kron\a2(:);
a_method3 = reshape(a_method3, [3 n/3]);
toc
disp(' ')
disp('Method 4: cellfun')
tic
a1_cells = mat2cell(a1, size(a1, 1), repmat(3 ,1,size(a1, 2)/3));
a2_cells = mat2cell(a2, size(a2, 1), ones(1,size(a2, 2)));
a_cells = cellfun(#(x, y) x\y, a1_cells, a2_cells, 'UniformOutput', 0);
a_method4 = cell2mat(a_cells);
toc
disp(' ')
Results:
Method 1: for loop, no preallocation
Elapsed time is 747.635280 seconds.
Method 2: for loop, with preallocation
Elapsed time is 1.426560 seconds.
Method 3: kron
Elapsed time is 0.357458 seconds.
Method 4: cellfun
Elapsed time is 3.390576 seconds.
Comparing the results of the four methods, you can see that using method 3 - kron gives slightly different results:
disp(['sumabs(a_method1(:) - a_method2(:)): ' num2str(sumabs(a_method1(:)-a_method2(:)))])
disp(['sumabs(a_method1(:) - a_method3(:)): ' num2str(sumabs(a_method1(:)-a_method3(:)))])
disp(['sumabs(a_method1(:) - a_method4(:)): ' num2str(sumabs(a_method1(:)-a_method4(:)))])
Result:
sumabs(a_method1(:) - a_method2(:)): 0
sumabs(a_method1(:) - a_method3(:)): 8.9793e-05
sumabs(a_method1(:) - a_method4(:)): 0

You are solving a series of N systems with m linear equations each, the N systems are of the form
Ax = b
You can convert these to a single system of Nm linear equations:
|A1 0 0 ... 0 | |x1| |b1|
|0 A2 0 ... 0 | |x2| |b2|
|0 0 A3 ... 0 | |x3| = |b3|
|. . . ... . | |. | |. |
|0 0 0 ... AN| |xN| |bN|
However, solving this one system of equations is a lot more expensive than solving all the little ones. Typically, the cost is O(n^3), so you go from O(N m^3) to O((Nm)^3). A huge pessimization. (Eliahu proved me wrong here, apparently the sparsity of the matrix can be exploited.)
Reducing the computational cost can be done, but you need to provide guarantees about the data. For example, if the matrices A are positive definite, the systems can be solved more cheaply. Nonetheless, given that you are dealing with 3x3 matrices, the winnings there will be slim, since those are pretty simple systems to solve.
If you are asking this because you think that loops are inherently slow in MATLAB, you should know that this is no longer the case, and hasn’t been the case since MATLAB gained a JIT 15 years or so ago. Today, many vectorization attempts lead to equally fast code, and oftentimes to slower code, especially for large data. (I could fish up some timings I’ve posted here on SO to prove this if necessary.)
I would think that solving all systems in one go could reduce the number of checks that MATLAB does every time the operator \ is called. That is, hard-coding the problem size and type might improve throughout. But the only way to do so is to write a MEX-file.

MarginalBiggest improvement would be to preallocate the output matrix, instead of growing it:
A1 = reshape(A1, 3, 3, []);
a = zeros(size(A2));
for i = 1:size(A1,3)
a(:,i) = A1(:,:,i) \ A2(:,i);
end
With the preallocate array, if the Parallel Toolbox is available, you can try parfor
Edit
This answer is no more relevant, since the OP rephrased the question to avoid growing the result array, which was the original major bottleneck.
The problem here is that one has to solve 4000 independent 3x3 linear systems. The matrix is so small that an ad hoc solution could be of interest, especially if one has some information on the matrix properties (symmetric, or not, condition number, etc.). However sticking to the \ matlab operator, the best way to speed up computations is by explicitly leverage of parallelism, e.g. by the parfor command.
The sparse matrix solution of the other answer by Eliahu Aaron is indeed very clever, but its speed advantage is not general but depends on the specific problem size.
With this function you can explore different problem sizes:
function [t2, t3] = sotest(n, n2)
a1 = rand(n,n*n2);
a2 = rand(n,n2);
tic
a1_reshape = reshape(a1, n, n, []);
a_method2 = zeros(size(a2));
for i = 1:size(a1_reshape,3)
a_method2(:,i) = a1_reshape(:,:,i) \ a2(:,i);
end
t2 = toc;
tic
a1_kron = kron(speye(n2),ones(n));
a1_kron(logical(a1_kron)) = a1(:);
a_method3 = a1_kron\a2(:);
a_method3 = reshape(a_method3, [n n2]);
t3 = toc;
assert ( norm(a_method2 - a_method3, 1) / norm(a_method2, 1) < 1e-8)
Indeed for n=3 the sparse matrix method is clearly superior, but for increasing n it becomes less competitive
The above figure was obtained with
>> for i=1:20; [t(i,1), t(i,2)] = sotest(i, 50000); end
>> loglog(1:20, t, '*-')
My final comment is that an explicit loop with the dense \ operator is indeed fast; the sparse matrix formulation is slightly less accurate and could become problematic in edge cases; and for sure the sparse matrix solution is not very readable. If the number n2 of systems to solve is very very big (>1e6) then maybe ad hoc solutions should be explored.

Related

Finding the greatest common divisor of a matrix in MATLAB

Im looking for a way to divide a certain matrix elements with its lowest common divisor.
for example, I have vectors
[0,0,0; 2,4,2;-2,0,8]
I can tell the lowest common divisor is 2, so the matrix after the division will be
[0,0,0;1,2,1;-1,0,4]
What is the built in method that can compute this?
Thanks in advance
p.s. I personally do not like using loops for this computation, it seems like there is built in computation that can perform matrix element division.
Since you don't like loops, how about recursive functions?
iif = #(varargin) varargin{2 * find([varargin{1:2:end}], 1, 'first')}();
gcdrec=#(v,gcdr) iif(length(v)==1,v, ...
v(1)==1,1, ...
length(v)==2,#()gcd(v(1),v(2)), ...
true,#()gcdr([gcd(v(1),v(2)),v(3:end)],gcdr));
mygcd=#(v)gcdrec(v(:)',gcdrec);
A=[0,0,0; 2,4,2;-2,0,8];
divisor=mygcd(A);
A=A/divisor;
The first function iif will define an inline conditional construct. This allows to define a recursive function, gcdrec, to find the greatest common divisor of your array. This iif works like this: it tests whether the first argument is true, if it is, then it returns the second argument. Otherwise it tests the third argument, and if that's true, then it returns the fourth, and so on. You need to protect recursive functions and sometimes other quantities appearing inside it with #(), otherwise you can get errors.
Using iif the recursive function gcdrec works like this:
if the input vector is a scalar, it returns it
else if the first component of the vector is 1, there's no chance to recover, so it returns 1 (allows quick return for large matrices)
else if the input vector is of length 2, it returns the greatest common divisor via gcd
else it calls itself with a shortened vector, in which the first two elements are substituted with their greatest common divisor.
The function mygcd is just a front-end for convenience.
Should be pretty fast, and I guess only the recursion depth could be a problem for very large problems. I did a quick timing check to compare with the looping version of #Adriaan, using A=randi(100,N,N)-50, with N=100, N=1000 and N=5000 and tic/toc.
N=100:
looping 0.008 seconds
recursive 0.002 seconds
N=1000:
looping 0.46 seconds
recursive 0.04 seconds
N=5000:
looping 11.8 seconds
recursive 0.6 seconds
Update: interesting thing is that the only reason that I didn't trip the recursion limit (which is by default 500) is that my data didn't have a common divisor. Setting a random matrix and doubling it will lead to hitting the recursion limit already for N=100. So for large matrices this won't work. Then again, for small matrices #Adriaan's solution is perfectly fine.
I also tried to rewrite it to half the input vector in each recursive step: this indeed solves the recursion limit problem, but it is very slow (2 seconds for N=100, 261 seconds for N=1000). There might be a middle ground somewhere, where the matrix size is large(ish) and the runtime's not that bad, but I haven't found it yet.
A = [0,0,0; 2,4,2;-2,0,8];
B = 1;
kk = max(abs(A(:))); % start at the end
while B~=0 && kk>=0
tmp = mod(A,kk);
B = sum(tmp(:));
kk = kk - 1;
end
kk = kk+1;
This is probably not the fastest way, but it will do for now. What I did here is initialise some counter, B, to store the sum of all elements in your matrix after taking the mod. the kk is just a counter which runs through integers. mod(A,kk) computes the modulus after division for each element in A. Thus, if all your elements are wholly divisible by 2, it will return a 0 for each element. sum(tmp(:)) then makes a single column out of the modulo-matrix, which is summed to obtain some number. If and only if that number is 0 there is a common divisor, since then all elements in A are wholly divisible by kk. As soon as that happens your loop stops and your common divisor is the number in kk. Since kk is decreased every count it is actually one value too low, thus one is added.
Note: I just edited the loop to run backwards since you are looking for the Greatest cd, not the Smallest cd. If you'd have a matrix like [4,8;16,8] it would stop at 2, not 4. Apologies for that, this works now, though both other solutions here are much faster.
Finally, dividing matrices can be done like this:
divided_matrix = A/kk;
Agreed, I don't like the loops either! Let's kill them -
unqA = unique(abs(A(A~=0))).'; %//'
col_extent = [2:max(unqA)]'; %//'
B = repmat(col_extent,1,numel(unqA));
B(bsxfun(#gt,col_extent,unqA)) = 0;
divisor = find(all(bsxfun(#times,bsxfun(#rem,unqA,B)==0,B),2),1,'first');
if isempty(divisor)
out = A;
else
out = A/divisor;
end
Sample runs
Case #1:
A =
0 0 0
2 4 2
-2 0 8
divisor =
2
out =
0 0 0
1 2 1
-1 0 4
Case #2:
A =
0 3 0
5 7 6
-5 0 21
divisor =
1
out =
0 3 0
5 7 6
-5 0 21
Here's another approach. Let A be your input array.
Get nonzero values of A and take their absolute value. Call the resulting vector B.
Test each number from 1 to max(B), and see if it divides all entries of B (that is, if the remainder of the division is zero).
Take the largest such number.
Code:
A = [0,0,0; 2,4,2;-2,0,8]; %// data
B = nonzeros(abs(A)); %// step 1
t = all(bsxfun(#mod, B, 1:max(B))==0, 1); %// step 2
result = find(t, 1, 'last'); %// step 3

How can I vectorize code that runs a function on subsets of a larger matrix?

Let's assume I have the following 9 x 5 matrix:
myArray = [
54.7 8.1 81.7 55.0 22.5
29.6 92.9 79.4 62.2 17.0
74.4 77.5 64.4 58.7 22.7
18.8 48.6 37.8 20.7 43.5
68.6 43.5 81.1 30.1 31.1
18.3 44.6 53.2 47.0 92.3
36.8 30.6 35.0 23.0 43.0
62.5 50.8 93.9 84.4 18.4
78.0 51.0 87.5 19.4 90.4
];
I have 11 "subsets" of this matrix and I need to run a function (let's say max) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):
myLogicals = logical([
0 1 0 1 1
1 1 0 1 1
1 1 0 0 0
0 1 0 1 1
1 0 1 1 1
1 1 1 1 0
0 1 1 0 1
1 1 0 0 1
1 1 0 0 1
]);
or via linear indexing:
starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends = [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset
such that the first subset is 2:3, the second is 5:6, and so on.
I can find the max of each subset and store it in a vector as follows:
finalAnswers = NaN(11,1);
for n=1:length(starts) #%i.e. 1 through the number of subsets
finalAnswers(n) = max(myArray(starts(n):ends(n)));
end
After the loop runs, finalAnswers contains the maximum value of each of the data subsets:
74.4 68.6 78.0 92.9 51.0 81.1 62.2 47.0 22.5 43.5 90.4
Is it possible to obtain the same result without the use of a for loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?
EDIT:
I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.
I wrapped each solution in tic;for k=1:1000 [code here] end; toc; and here are the results:
for loop approach --- Elapsed time is 16.237400 seconds.
Shai's approach --- Elapsed time is 153.707076 seconds.
Dan's approach --- Elapsed time is 44.774121 seconds.
Divakar's approach #2 --- Elapsed time is 127.621515 seconds.
Notes:
I also tried benchmarking Dan's approach by wrapping the k=1:1000 for loop around just the accumarray line (since the rest could be
theoretically run just once). In this case the time was 28.29
seconds.
Benchmarking Shai's approach, while leaving the lb = ... line out
of the k loop, the time was 113.48 seconds.
When I ran Divakar's code, I got Non-singleton dimensions of the two
input arrays must match each other. errors for the bsxfun lines.
I "fixed" this by using conjugate transposition (the apostrophe
operator ') on trade_starts(1:starts_extent) and
intv(1:starts_extent) in the lines of code calling bsxfun. I'm
not sure why this error was occuring...
I'm not sure if my benchmarking setup is correct, but it appears that the for loop actually runs the fastest in this case.
One approach is to use accumarray. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:
sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))
s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);
So that just does what Shai's bwlabeln implementation does (but this will be 1-by-numel(myLogicals) in shape as opposed to size(myLogicals) in shape)
Now you can use accumarray:
accumarray(labelled(myLogicals), myArray(myLogicals), [], #max)
or else it may be faster to try
result = accumarray(labelled+1, myArray(:), [], #max);
result = result(2:end)
This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.
Use bwlabeln with a vertical connectivity:
lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );
Now you have a label 1..11 for each region.
To get max value you can use regionprops
props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];
You can use regionprops to get some other properties of each subset, but it is not too general.
If you wish to apply a more general function to each region, e.g., median, you can use accumarray:
finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], #median );
Ideas behind vectorization and optimization
One of the approaches that one can employ to vectorize this problem would be to convert the subsets into regular shaped blocks and then finding the max of the elements
of the those blocks in one go. Now, converting to regular shaped blocks has one issue here and it is that the subsets are unequal in lengths. To avoid this issue, one can
create a 2D matrix of indices starting from each of starts elements and extending until the maximum of the subset lengths. Good thing about this is, it allows
vectorization, but at the cost of more memory requirements which would depend on the scattered-ness of the subsets lengths.
Another issue with this vectorization technique would be that it could potentially lead to out-of-limits indices creations for final subsets.
To avoid this, one can think of two possible ways -
Use a bigger input array by extending the input array such that maximum of the subset lengths plus the starts indices still lie within the confinements of the
extended array.
Use the original input array for starts until we are within the limits of original input array and then for the rest of the subsets use the original loop code. We can call it the mixed programming just for the sake of having a short title. This would save us memory requirements on creating the extended array as discussed in the other approach earlier.
These two ways/approaches are listed next.
Approach #1: Vectorized technique
[m,n] = size(myArray); %// store no. of rows and columns in input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
max_intv_arr = [0:max_intv]'; %//'# array of max indices extent
[row1,col1] = ind2sub([m n],starts); %// get starts row and column indices
m_ext = max(row1+max_intv); %// no. of rows in extended input array
myArrayExt(m_ext,n)=0; %// extended form of input array
myArrayExt(1:m,:) = myArray;
%// New linear indices for extended form of input array
idx = bsxfun(#plus,max_intv_arr,(col1-1)*m_ext+row1);
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArrayExt(idx);
selected_ele(bsxfun(#gt,max_intv_arr,intv))= nan;
%// Get the max of the valid ones for the desired output
out = nanmax(selected_ele); %// desired output
Approach #2: Mixed programming
%// PART - I: Vectorized technique for subsets that when normalized
%// with max extents still lie within limits of input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
%// Find the last subset that when extended by max interval would still
%// lie within the limits of input array
starts_extent = find(starts+max_intv<=numel(myArray),1,'last');
max_intv_arr = [0:max_intv]'; %//'# Array of max indices extent
%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArray(bsxfun(#plus,max_intv_arr,starts(1:starts_extent)));
selected_ele(bsxfun(#gt,max_intv_arr,intv(1:starts_extent))) = nan;
out(numel(starts)) = 0; %// storage for output
out(1:starts_extent) = nanmax(selected_ele); %// output values for part-I
%// PART - II: Process rest of input array elements
for n = starts_extent+1:numel(starts)
out(n) = max(myArray(starts(n):ends(n)));
end
Benchmarking
In this section we will compare the the two approaches and the original loop code against each other for performance. Let's setup codes before starting the actual benchmarking -
N = 10000; %// No. of subsets
M1 = 1510; %// No. of rows in input array
M2 = 2185; %// No. of cols in input array
myArray = rand(M1,M2); %// Input array
num_runs = 50; %// no. of runs for each method
%// Form the starts and ends by getting a sorted random integers array from
%// 1 to one minus no. of elements in input array. That minus one is
%// compensated later on into ends because we don't want any subset with
%// starts and ends as the same index
y1 = reshape(sort(randi(numel(myArray)-1,1,2*N)),2,[]);
starts = y1(1,:);
ends = y1(1,:)+1;
%// Remove identical starts elements
invalid = [false any(diff(starts,[],2)==0,1)];
starts = starts(~invalid);
ends = ends(~invalid);
%// Create myLogicals
myLogicals = false(size(myArray));
for k1=1:numel(starts)
myLogicals(starts(k1):ends(k1))=1;
end
clear invalid y1 k1 M1 M2 N %// clear unnecessary variables
%// Warm up tic/toc.
for k = 1:100
tic(); elapsed = toc();
end
Now, the placebo codes that gets us the runtimes -
disp('---------------------- With Original loop code')
tic
for iter = 1:num_runs
%// ...... approach #1 codes
end
toc
%// clear out variables used in the above approach
%// repeat this for approach #1,2
Benchmark Results
In your comments, you mentioned using 1510 x 2185 matrix, so let's do two case runs with such size and subsets of size 10000 and 2000.
Case 1 [Input - 1510 x 2185 matrix, Subsets - 10000]
---------------------- With Original loop code
Elapsed time is 15.625212 seconds.
---------------------- With Approach #1
Elapsed time is 12.102567 seconds.
---------------------- With Approach #2
Elapsed time is 0.983978 seconds.
Case 2 [Input - 1510 x 2185 matrix, Subsets - 2000]
---------------------- With Original loop code
Elapsed time is 3.045402 seconds.
---------------------- With Approach #1
Elapsed time is 11.349107 seconds.
---------------------- With Approach #2
Elapsed time is 0.214744 seconds.
Case 3 [Bigger Input - 3000 x 3000 matrix, Subsets - 20000]
---------------------- With Original loop code
Elapsed time is 12.388061 seconds.
---------------------- With Approach #1
Elapsed time is 12.545292 seconds.
---------------------- With Approach #2
Elapsed time is 0.782096 seconds.
Note that the number of runs num_runs was varied to keep the runtime of the fastest approach close to 1 sec.
Conclusions
So, I guess the mixed programming (approach #2) is the way to go! As future work, one can use standard deviation into the scattered-ness criteria if the performance suffers because of the scattered-ness and offload the work for most scattered subsets (in terms of their lengths) into the loop code.
Efficiency
Measure both the vectorised & for-loop code samples on your respective platform ( be it a <localhost> or Cloud-based ) to see the difference:
MATLAB:7> tic();max( myArray( startIndex(:):endIndex(:) ) );toc() %% Details
Elapsed time is 0.0312 seconds. %% below.
%% Code is not
%% the merit,
%% method is:
and
tic(); %% for/loop
for n = 1:length( startIndex ) %% may be
max( myArray( startIndex(n):endIndex(n) ) ); %% significantly
end %% faster than
toc(); %% vectorised
Elapsed time is 0.125 seconds. %% setup(s)
%% overhead(s)
%% As commented below,
%% subsequent re-runs yield unrealistic results due to caching artifacts
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
%% which are not so straight visible if encapsulated in an artificial in-vitro
%% via an outer re-run repetitions ( for k=1:1000 ) et al ( ref. in text below )
For a better interpretation of the test results, rather test on much larger sizes than just on a few tens of row/cols.
EDIT:
An erroneous code removed, thanks Dan for the notice. Having taken more attention to emphasize the quantitative validation, that may prove the assumption that a vectorised code may, but need not in all circumstances, be faster is not an excuse for a faulty code, sure.
Output - quantitatively comparative data:
While recommended, there is not IMHO fair to assume, the memalloc and similar overheads to be excluded from the in-vivo testing. Test re-runs typically show VM-page hits improvements, other caching artifacts, while the raw 1st "virgin" run is what typically appears in the real code deployment ( excl. external iterators, for sure ). So consider the results with care and retest in your real environment ( sometimes being run as a Virtual Machine inside a bigger system -- that also makes VM-swap mechanics necessary to take into account once huge matrices start hurt on real-life memory-access patterns ).
On other Projects I am used to use [usec] granularity of the realtime test timing, but the more care is necessary to be taken into account about the test-execution conditions and O/S background.
So nothing but testing gives relevant answers to your specific code/deployment situation, however be methodic to compare data comparable in principle.
Alarik's code:
MATLAB:8> tic(); for k=1:1000 % ( flattens memalloc issues & al )
> for n = 1:length( startIndex )
> max( myArray( startIndex(n):endIndex() ) );
> end;
> end; toc()
Elapsed time is 0.2344 seconds.
%% time is 0.0002 seconds per k-for-loop <--[ ref.^ remarks on testing ]
Dan's code:
MATLAB:9> tic(); for k=1:1000
> s_ind( size( myLogicals ) ) = 0;
> s_ind( startIndex ) = 1;
> labelled = cumsum( s_ind(:) ).*myLogicals(:);
> result = accumarray( labelled + 1, myArray(:), [], #max );
> end; toc()
error: product: nonconformant arguments (op1 is 43x1, op2 is 45x1)
%%
%% [Work in progress] to find my mistake -- sorry for not being able to reproduce
%% Dan's code and to make it work
%%
%% Both myArray and myLogicals shape was correct ( 9 x 5 )

Matlab: element 3D matrices multiplication

I have two matrices: B with size 9x100x51 and K with size 34x9x100. I want to multiply all of K(34) with each one of B(9) so as to have a final matrix G with size 34x9x100x51.
For example: the element G(:,5,60,25) is composed as follow
G(:,5,60,25)=K(:,5,60)*B(5,60,25).
I hope that the example helps to understand what I want to do.
Thank you
Any time you find yourself writing nested loops in matlab, there's a good chance you can speed up quite a bit using the built-in vectorized forms of the functions. The code ends up being quite a bit shorter typically too (but often less immediately clear to a reader, so comment your code!).
In this case, does avoiding the nested loops make a difference? Absolutely! Let's get to work. #slayton has provided a 3-loop solution. We can get faster.
Restating the problem a bit, B has 51 9x100 matrices and K has 34 9x100 matrices. For each combination of 51x34, you want to element-wise multiply the respective 9x100 matrices from B and K.
Element-wise multiplication is a great job for bsxfun, so we can conceptually reduce this problem to working along two dimensions (the third dimension of B, first dimension of K):
Initial, two-loop solution:
B = rand(9,100,51);
K = rand(34,9,100);
G = nan(34,9,100,51);
for b=1:size(B,3)
for k=1:size(K,1)
G(k,:,:,b) = bsxfun(#times,B(:,:,b), squeeze(K(k,:,:)));
end
end
Ok, two loops is making progress. Can we do better? Well, let's recognize that the matrices B and K can be replicated along the appropriate dimensions, then element-wise multiplied all at once.
B = rand(9,100,51);
K = rand(34,9,100);
B2 = repmat(permute(B,[4 1 2 3]), [size(K,1) size(B)]);
K2 = repmat(K, [size(K) size(B,3)]);
G = bsxfun(#times,B2,K2);
So, how do the solutions compare speed-wise? I tested the on the octave online utility, and didn't include the time to generate the initial B and K matrices. I did include the time to preallocate the G matrix for the solutions that needed preallocation. The code is below.
3 loops (#slayton's answer): 4.024471 s
2 loop solution: 1.616120 s
0-loop repmat/bsxfun solution: 1.211850 s
0-loop repmat/bsxfun solution, no temporaries: 0.605838 s
Caveat: The timing may depend quite a bit on your machine, I wouldn't trust the online utility for great timing tests. Changing the order of when the loops were executed (even taking care not to reuse variables and mess up time of allocation) did change things a bit, namely the 2-loop solution was sometimes as fast as the no-loop solution with temporaries stored. However, the more vectorized you can get, the better you will be.
Here's the code for the speed test:
B = rand(9,100,51);
K = rand(34,9,100);
tic
G1 = nan(34,9,100,51);
for ii = 1:size(B,1)
for jj = 1:size(B,2);
for kk = 1:size(B,3)
G1(:, ii, jj, kk) = K(:,ii,jj) .* B(ii,jj,kk);
end
end
end
t=toc;
printf('Time for 3 loop solution: %f\n' ,t)
tic
G2 = nan(34,9,100,51);
for b=1:size(B,3)
for k=1:size(K,1)
G2(k,:,:,b) = bsxfun(#times,B(:,:,b), squeeze(K(k,:,:)));
end
end
t=toc;
printf('Time for 2 loop solution: %f\n' ,t)
tic
B2 = repmat(permute(B,[4 1 2 3]), [size(K,1) 1 1 1]);
K2 = repmat(K, [1 1 1 size(B,3)]);
G3 = bsxfun(#times,B2,K2);
t=toc;
printf('Time for 0-loop repmat/bsxfun solution: %f\n' ,t)
tic
G4 = bsxfun(#times,repmat(permute(B,[4 1 2 3]), [size(K,1) 1 1 1]),repmat(K, [1 1 1 size(B,3)]));
t=toc;
printf('Time for 0-loop repmat/bsxfun solution, no temporaries: %f\n' ,t)
disp('Are the results equal?')
isequal(G1,G2)
isequal(G1,G3)
Time for 3 loop solution: 4.024471
Time for 2 loop solution: 1.616120
Time for 0-loop repmat/bsxfun solution: 1.211850
Time for 0-loop repmat/bsxfun solution, no temporaries: 0.605838
Are the results equal?
ans = 1
ans = 1
You can do this with nested loops, although it probably won't be terribly fast:
B = rand(9,100,51);
K = rand(34,9,100);
G = nan(34,9,100,51)
for ii = 1:size(B,1)
for jj = 1:size(B,2);
for kk = 1:size(B,3)
G(:, ii, jj, kk) = K(:,ii,jj) .* B(ii,jj,kk);
end
end
end
Its been a long day and my brain is a bit fried, kudos to anyone who can improve this!

Update only one matrix element for iterative computation

I have a 3x3 matrix, A. I also compute a value, g, as the maximum eigen value of A. I am trying to change the element A(3,3) = 0 for all values from zero to one in 0.10 increments and then update g for each of the values. I'd like all of the other matrix elements to remain the same.
I thought a for loop would be the way to do this, but I do not know how to update only one element in a matrix without storing this update as one increasingly larger matrix. If I call the element at A(3,3) = p (thereby creating a new matrix Atry) I am able (below) to get all of the values from 0 to 1 that I desired. I do not know how to update Atry to get all of the values of g that I desire. The state of the code now will give me the same value of g for all iterations, as expected, as I do not know how to to update Atry with the different values of p to then compute the values for g.
Any suggestions on how to do this or suggestions for jargon or phrases for me to web search would be appreciated.
A = [1 1 1; 2 2 2; 3 3 0];
g = max(eig(A));
% This below is what I attempted to achieve my solution
clear all
p(1) = 0;
Atry = [1 1 1; 2 2 2; 3 3 p];
g(1) = max(eig(Atry));
for i=1:100;
p(i+1) = p(i)+ 0.01;
% this makes a one giant matrix, not many
%Atry(:,i+1) = Atry(:,i);
g(i+1) = max(eig(Atry));
end
This will also accomplish what you want to do:
A = #(x) [1 1 1; 2 2 2; 3 3 x];
p = 0:0.01:1;
g = arrayfun(#(x) eigs(A(x),1), p);
Breakdown:
Define A as an anonymous function. This means that the command A(x) will return your matrix A with the (3,3) element equal to x.
Define all steps you want to take in vector p
Then "loop" through all elements in p by using arrayfun instead of an actual loop.
The function looped over by arrayfun is not max(eig(A)) but eigs(A,1), i.e., the 1 largest eigenvalue. The result will be the same, but the algorithm used by eigs is more suited for your type of problem -- instead of computing all eigenvalues and then only using the maximum one, you only compute the maximum one. Needless to say, this is much faster.
First, you say 0.1 increments in the text of your question, but your code suggests you are actually interested in 0.01 increments? I'm going to operate under the assumption you mean 0.01 increments.
Now, with that out of the way, let me state what I believe you are after given my interpretation of your question. You want to iterate over the matrix A, where for each iteration you increase A(3, 3) by 0.01. Given that you want all values from 0 to 1, this implies 101 iterations. For each iteration, you want to calculate the maximum eigenvalue of A, and store all these eigenvalues in some vector (which I will call gVec). If this is correct, then I believe you just want the following:
% Specify the "Current" A
CurA = [1 1 1; 2 2 2; 3 3 0];
% Pre-allocate the values we want to iterate over for element (3, 3)
A33Vec = (0:0.01:1)';
% Pre-allocate a vector to store the maximum eigenvalues
gVec = NaN * ones(length(A33Vec), 1);
% Loop over A33Vec
for i = 1:1:length(A33Vec)
% Obtain the version of A that we want for the current i
CurA(3, 3) = A33Vec(i);
% Obtain the maximum eigen value of the current A, and store in gVec
gVec(i, 1) = max(eig(CurA));
end
EDIT: Probably best to paste this code into your matlab editor. The stack-overflow automatic text highlighting hasn't done it any favors :-)
EDIT: Go with Rody's solution (+1) - it is much better!

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)
The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!
This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html
I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);
You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).
The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.