I am currently looking for the most efficient way to shift and rearrange large matrices. Essentially, I have data with some parabolic shift that needs to be corrected in order to shift the "signal" to a linear event.
I have currently tried the following solutions and tried timing them. Is there any other method that may prove to be more efficient?
DATA = ones(100000,501);
DATA(10000,251) = 100;
for i=1:250
DATA(10000+i^2-1000:10000+i^2+1000,251-i) = 100;
DATA(10000+i^2-1000:10000+i^2+1000,251+i) = 100;
end
k = abs(-250:1:250).^2;
d = size(DATA,1);
figure(99)
imagesc(DATA)
t_INDEX = timeit(#()fun_INDEX(DATA,k))
t_SNIPPET = timeit(#()fun_SNIPPET(DATA,k))
t_CIRCSHIFT = timeit(#()fun_CIRCSHIFT(DATA,k))
t_INDEX_clean = timeit(#()fun_INDEX_clean(DATA,k))
t_SPARSE = timeit(#()fun_SPARSE(DATA,k))
t_BSXFUN = timeit(#()fun_BSXFUN(DATA,k))
function fun_INDEX(DATA,k)
DATA_1 = zeros(size(DATA));
for i=1:size(DATA,2)
DATA_1(:,i) = DATA([k(i)+1:end 1:k(i)],i);
end
figure(1)
imagesc(DATA_1)
end
function fun_SNIPPET(DATA,k)
kmax = max(k);
DATA_2 = zeros(size(DATA,1)-kmax,size(DATA,2));
for i=1:size(DATA,2)
DATA_2(:,i) = DATA(k(i)+1:end-kmax+k(i),i);
end
figure(2)
imagesc(DATA_2)
end
function fun_CIRCSHIFT(DATA,k)
DATA_3 = zeros(size(DATA));
for i=1:size(DATA,2)
DATA_3(:,i) = circshift(DATA(:,i),-k(i),1);
end
figure(3)
imagesc(DATA_3)
end
function fun_INDEX_clean(DATA,k)
[m, n] = size(DATA);
k = size(DATA,1)-k;
DATA_4 = zeros(m, n);
for i = (1 : n)
DATA_4(:, i) = [DATA((m - k(i) + 1 : m), i); DATA((1 : m - k(i) ), i)];
end
figure(4)
imagesc(DATA_4)
end
function fun_SPARSE(DATA,k)
[m,n] = size(DATA);
k = -k;
S = full(sparse(mod(k,m)+1,1:n,1,m,n));
DATA_5 = ifft(fft(DATA).*fft(S),'symmetric');
figure(5)
imagesc(DATA_5)
end
function fun_BSXFUN(DATA,k)
DATA = DATA';
k = -k;
[m,n] = size(DATA);
idx0 = mod(bsxfun(#plus,n-k(:),1:n)-1,n);
DATA_6 = DATA(bsxfun(#plus,(idx0*m),(1:m)'));
figure(6)
imagesc(DATA_6)
end
Is there any way to decrease computation time for this kind of problem?
Thanks in advance for any tips!
One option would be to use MATLAB's GPU functions, if your workstation has a GPU. Depending on if the entire data fits on the GPU at once, it will start to outperform CPU circshift at 1000 X 1000 matrix size.
The implementation only requires you to copy your data to the GPU with a single statement, and then operate circshift on the newly created you array.
A small discussion on its performance can be found here: https://www.mathworks.com/matlabcentral/answers/274619-circshift-slower-on-gpu . Especially, the last post describes a much faster GPU implementation if you actually don't need to circularly shift, but can get away with zero passing on one side, which might be relevant.
Related
This is a part of my code in Matlab. I tried to make it parallel but there is an error:
The variable gax in a parfor cannot be classified.
I know why the error occurs. because I should tell Matlab that v is an incresing vector which doesn't contain repeated elements. Could anyone help me to use this information to parallelize the code?
v=[1,3,6,8];
ggx=5.*ones(15,14);
gax=ones(15,14);
for m=v
if m > 1
parfor j=1:m-1
gax(j,m-1) = ggx(j,m-1);
end
end
if m<nn
parfor jo=m+1:15
gax(jo,m) = ggx(jo,m);
end
end
end
Optimizing a code should be closely related to its purpose, especially when you use parfor. The code you wrote in the question can be written in a much more efficient way, and definitely, do not need to be parallelized.
However, I understand that you tried to simplify the problem, just to get the idea of how to slice your variables, so here is a fixed version the can run with parfor. But this is surely not the way to write this code:
v = [1,3,6,8];
ggx = 5.*ones(15,14);
gax = ones(15,14);
nn = 5;
for m = v
if m > 1
temp_end = m-1;
temp = ggx(:,temp_end);
parfor ja = 1:temp_end
gax(ja,temp_end) = temp(ja);
end
end
if m < nn
temp = ggx(:,m);
parfor jo = m+1:15
gax(jo,m) = temp(jo);
end
end
end
A vectorized implementation will look like this:
v = [1,3,6,8];
ggx = 5.*ones(15,14);
gax = ones(15,14);
nn = 5;
m1 = v>1; % first condition with logical indexing
temp = v(m1)-1; % get the values from v
r = ones(1,sum(temp)); % generate a vector of indicies
r(cumsum(temp)) = -temp+1; % place the reseting locations
r = cumsum(r); % calculate the indecies
r(cumsum(temp)) = temp; % place the ending points
c = repelem(temp,temp); % create an indecies vector for the columns
inds1 = sub2ind(size(gax),r,c); % convert the indecies to linear
mnn = v<nn; % second condition with logical indexing
temp = v(mnn)+1; % get the values from v
r_max = size(gax,1); % get the height of gax
r_count = r_max-temp+1; % calculate no. of rows per value in v
r = ones(1,sum(r_count)); % generate a vector of indicies
r([1 r_count(1:end-1)+1]) = temp; % set the t indicies
r(cumsum(r_count)+1) = -(r_count-temp)+1; % place the reseting locations
r = cumsum(r(1:end-1)); % calculate the indecies
c = repelem(temp-1,r_count); % create an indecies vector for the columns
inds2 = sub2ind(size(gax),r,c); % convert the indecies to linear
gax([inds1 inds2]) = ggx([inds1 inds2]); % assgin the relevant values
This is indeed quite complicated, and not always necessary. A good thing to remember, though, is that nested for loop are much slower than a single loop, so in some cases (depend on the size of the output), this will may be the fastest solution:
for m = v
if m > 1
gax(1:m-1,m-1) = ggx(1:m-1,m-1);
end
if m<nn
gax(m+1:15,m) = ggx(m+1:15,m);
end
end
I have originally written the following Matlab code to find intersection between a set of Axes Aligned Bounding Boxes (AABB) and space partitions (here 8 partitions). I believe it is readable by itself, moreover, I have added some comments for even more clarity.
function [A,B] = AABBPart(bbx,it) % bbx: aabb, it: iteration
global F
IT = it+1;
n = size(bbx,1);
F = cell(n,it);
A = Part([min(bbx(:,1:3)),max(bbx(:,4:6))],it,0); % recursive partitioning
B = F; % matlab does not allow
function s = Part(bx,it,J) % output to be global
s = {};
if it < 1; return; end
s = cell(8,1);
p = bx(1:3);
q = bx(4:6);
h = 0.5*(p+q);
prt = [p,h;... % 8 sub-parts (octa)
h(1),p(2:3),q(1),h(2:3);...
p(1),h(2),p(3),h(1),q(2),h(3);...
h(1:2),p(3),q(1:2),h(3);...
p(1:2),h(1),h(1:2),q(3);...
h(1),p(2),h(3),q(1),h(2),q(3);...
p(1),h(2:3),h(1),q(2:3);...
h,q];
for j=1:8 % check for each sub-part
k = 0;
t = zeros(0,1);
for i=1:n
if all(bbx(i,1:3) <= prt(j,4:6)) && ... % interscetion test for
all(prt(j,1:3) <= bbx(i,4:6)) % every aabb and sub-parts
k = k+1;
t(k) = i;
end
end
if ~isempty(t)
s{j,1} = [t; Part(prt(j,:),it-1,j)]; % recursive call
for i=1:numel(t) % collecting the results
if isempty(F{t(i),IT-it})
F{t(i),IT-it} = [-J,j];
else
F{t(i),IT-it} = [F{t(i),IT-it}; [-J,j]];
end
end
end
end
end
end
Concerns:
In my tests, it seems that probably few intersections are missing, say, 10 or so for 1000 or more setup. So I would be glad if you could help to find out any problematic parts in the code.
I am also concerned about using global F. I prefer to get rid of it.
Any other better solution in terms of speed, will be loved.
Note that the code is complete. And you can easily try it by some following setup.
n = 10000; % in the original application, n would be millions
bbx = rand(n,6);
it = 3;
[A,B] = AABBPart(bbx,it);
I want to get exp() of a large matrix (A) with values that repeat at different indices. To speed-up the exp() operation I only perform it on the unique values of A and then reassemble the matrix. However the reassembly of the matrix is quite slow. The following code provides a working example:
% defintion of a grid
gridSp = 5:5:35*5;
X = repmat(gridSp,35,1);
Z = repmat(gridSp',1,35);
% calculation of the distances
locMat = [X(:) Z(:)];
dist=sqrt(bsxfun(#minus,locMat(:,1),locMat(:,1)').^2 +...
bsxfun(#minus,locMat(:,2),locMat(:,2)').^2);
sizeDist = size(dist);
uniqueDist = unique(dist,'stable');
[~, Locb] = ismember(dist,uniqueDist);
nn_A = exp(1i*2*pi*rand(sizeDist(1),100));
H_A = zeros(size(nn_A));
freq = linspace(10^-3,10,100);
psdA = 4096*length(freq).*10.*4.*22.6./((1 + 6.*freq*22.6).^(5/3));
for jj=1:100
b = exp(-8.8*uniqueDist*sqrt((freq(jj)/15).^2 + 10^-7));
b = b.*psdA(jj);
A = b(Locb);
droptol = max(A(:))*10^-10;
if min(A(:))<droptol
A = sparse(A);
HH_A = ichol(A,struct('type','ict','shape','lower','droptol',droptol));
else
HH_A = chol(A,'lower');
end
H_A(:,jj) = HH_A*nn_A(:,jj);
end
Especially the reassembly of the matrix
A = b(Locb);
and the conversion of the matrix to sparse
A = sparse(A);
in the last for-loop take up a lot of time. Is there a quicker way to do this? Interestingly:
B = A + A;
is much faster than
A = b(Locb);
I have to perfom these operations far more often than the 100 iterations in the example.
Here a condensed version of the code up on request (below).
% defintion of a grid
gridSp = 5:5:28*5;
X = repmat(gridSp,35,1);
Z = repmat(gridSp',1,35);
% calculation of the distances
locMat = [X(:) Z(:)];
dist=sqrt(bsxfun(#minus,locMat(:,1),locMat(:,1)').^2 +bsxfun(#minus,locMat(:,2),locMat(:,2)').^2);
uniqueDist = unique(dist,'stable');
[~, Locb] = ismember(dist,uniqueDist);
for jj=1:100
b = exp(jj.*uniqueDist);
A = b(Locb);
end
In your example, the dimension of dist is just 980 x 980 in which case you would be better off to just perform a dense matrix operation, i.e.
for jj=1:100
A=exp(jj*dist);
end
which is 2 times faster than
for jj=1:100
b = exp(jj.*uniqueDist);
A = b(Locb);
end
for your given example.
This is a follow-up question to How to append an element to an array in MATLAB? That question addressed how to append an element to an array. Two approaches are discussed there:
A = [A elem] % for a row array
A = [A; elem] % for a column array
and
A(end+1) = elem;
The second approach has the obvious advantage of being compatible with both row and column arrays.
However, this question is: which of the two approaches is fastest? My intuition tells me that the second one is, but I'd like some evidence for or against that. Any idea?
The second approach (A(end+1) = elem) is faster
According to the benchmarks below (run with the timeit benchmarking function from File Exchange), the second approach (A(end+1) = elem) is faster and should therefore be preferred.
Interestingly, though, the performance gap between the two approaches is much narrower in older versions of MATLAB than it is in more recent versions.
R2008a
R2013a
Benchmark code
function benchmark
n = logspace(2, 5, 40);
% n = logspace(2, 4, 40);
tf = zeros(size(n));
tg = tf;
for k = 1 : numel(n)
x = rand(round(n(k)), 1);
f = #() append(x);
tf(k) = timeit(f);
g = #() addtoend(x);
tg(k) = timeit(g);
end
figure
hold on
plot(n, tf, 'bo')
plot(n, tg, 'ro')
hold off
xlabel('input size')
ylabel('time (s)')
leg = legend('y = [y, x(k)]', 'y(end + 1) = x(k)');
set(leg, 'Location', 'NorthWest');
end
% Approach 1: y = [y, x(k)];
function y = append(x)
y = [];
for k = 1 : numel(x);
y = [y, x(k)];
end
end
% Approach 2: y(end + 1) = x(k);
function y = addtoend(x)
y = [];
for k = 1 : numel(x);
y(end + 1) = x(k);
end
end
How about this?
function somescript
RStime = timeit(#RowSlow)
CStime = timeit(#ColSlow)
RFtime = timeit(#RowFast)
CFtime = timeit(#ColFast)
function RowSlow
rng(1)
A = zeros(1,2);
for i = 1:1e5
A = [A rand(1,1)];
end
end
function ColSlow
rng(1)
A = zeros(2,1);
for i = 1:1e5
A = [A; rand(1,1)];
end
end
function RowFast
rng(1)
A = zeros(1,2);
for i = 1:1e5
A(end+1) = rand(1,1);
end
end
function ColFast
rng(1)
A = zeros(2,1);
for i = 1:1e5
A(end+1) = rand(1,1);
end
end
end
For my machine, this yields the following timings:
RStime =
30.4064
CStime =
29.1075
RFtime =
0.3318
CFtime =
0.3351
The orientation of the vector does not seem to matter that much, but the second approach is about a factor 100 faster on my machine.
In addition to the fast growing method pointing out above (i.e., A(k+1)), you can also get a speed increase from increasing the array size by some multiple, so that allocations become less as the size increases.
On my laptop using R2014b, a conditional doubling of size results in about a factor of 6 speed increase:
>> SO
GATime =
0.0288
DWNTime =
0.0048
In a real application, the size of A would needed to be limited to the needed size or the unfilled results filtered out in some way.
The Code for the SO function is below. I note that I switched to cos(k) since, for some unknown reason, there is a large difference in performance between rand() and rand(1,1) on my machine. But I don't think this affects the outcome too much.
function [] = SO()
GATime = timeit(#GrowAlways)
DWNTime = timeit(#DoubleWhenNeeded)
end
function [] = DoubleWhenNeeded()
A = 0;
sizeA = 1;
for k = 1:1E5
if ((k+1) > sizeA)
A(2*sizeA) = 0;
sizeA = 2*sizeA;
end
A(k+1) = cos(k);
end
end
function [] = GrowAlways()
A = 0;
for k = 1:1E5
A(k+1) = cos(k);
end
end
Can anyone help vectorize this Matlab code? The specific problem is the sum and bessel function with vector inputs.
Thank you!
N = 3;
rho_g = linspace(1e-3,1,N);
phi_g = linspace(0,2*pi,N);
n = 1:3;
tau = [1 2.*ones(1,length(n)-1)];
for ii = 1:length(rho_g)
for jj = 1:length(phi_g)
% Coordinates
rho_o = rho_g(ii);
phi_o = phi_g(jj);
% factors
fc = cos(n.*(phi_o-phi_s));
fs = sin(n.*(phi_o-phi_s));
Ez_t(ii,jj) = sum(tau.*besselj(n,k(3)*rho_s).*besselh(n,2,k(3)*rho_o).*fc);
end
end
You could try to vectorize this code, which might be possible with some bsxfun or so, but it would be hard to understand code, and it is the question if it would run any faster, since your code already uses vector math in the inner loop (even though your vectors only have length 3). The resulting code would become very difficult to read, so you or your colleague will have no idea what it does when you have a look at it in 2 years time.
Before wasting time on vectorization, it is much more important that you learn about loop invariant code motion, which is easy to apply to your code. Some observations:
you do not use fs, so remove that.
the term tau.*besselj(n,k(3)*rho_s) does not depend on any of your loop variables ii and jj, so it is constant. Calculate it once before your loop.
you should probably pre-allocate the matrix Ez_t.
the only terms that change during the loop are fc, which depends on jj, and besselh(n,2,k(3)*rho_o), which depends on ii. I guess that the latter costs much more time to calculate, so it better to not calculate this N*N times in the inner loop, but only N times in the outer loop. If the calculation based on jj would take more time, you could swap the for-loops over ii and jj, but that does not seem to be the case here.
The result code would look something like this (untested):
N = 3;
rho_g = linspace(1e-3,1,N);
phi_g = linspace(0,2*pi,N);
n = 1:3;
tau = [1 2.*ones(1,length(n)-1)];
% constant part, does not depend on ii and jj, so calculate only once!
temp1 = tau.*besselj(n,k(3)*rho_s);
Ez_t = nan(length(rho_g), length(phi_g)); % preallocate space
for ii = 1:length(rho_g)
% calculate stuff that depends on ii only
rho_o = rho_g(ii);
temp2 = besselh(n,2,k(3)*rho_o);
for jj = 1:length(phi_g)
phi_o = phi_g(jj);
fc = cos(n.*(phi_o-phi_s));
Ez_t(ii,jj) = sum(temp1.*temp2.*fc);
end
end
Initialization -
N = 3;
rho_g = linspace(1e-3,1,N);
phi_g = linspace(0,2*pi,N);
n = 1:3;
tau = [1 2.*ones(1,length(n)-1)];
Nested loops form (Copy from your code and shown here for comparison only) -
for ii = 1:length(rho_g)
for jj = 1:length(phi_g)
% Coordinates
rho_o = rho_g(ii);
phi_o = phi_g(jj);
% factors
fc = cos(n.*(phi_o-phi_s));
fs = sin(n.*(phi_o-phi_s));
Ez_t(ii,jj) = sum(tau.*besselj(n,k(3)*rho_s).*besselh(n,2,k(3)*rho_o).*fc);
end
end
Vectorized solution -
%%// Term - 1
term1 = repmat(tau.*besselj(n,k(3)*rho_s),[N*N 1]);
%%// Term - 2
[n1,rho_g1] = meshgrid(n,rho_g);
term2_intm = besselh(n1,2,k(3)*rho_g1);
term2 = transpose(reshape(repmat(transpose(term2_intm),[N 1]),N,N*N));
%%// Term -3
angle1 = repmat(bsxfun(#times,bsxfun(#minus,phi_g,phi_s')',n),[N 1]);
fc = cos(angle1);
%%// Output
Ez_t = sum(term1.*term2.*fc,2);
Ez_t = transpose(reshape(Ez_t,N,N));
Points to note about this vectorization or code simplification –
‘fs’ doesn’t change the output of the script, Ez_t, so it could be removed for now.
The output seems to be ‘Ez_t’,which requires three basic terms in the code as –
tau.*besselj(n,k(3)*rho_s), besselh(n,2,k(3)*rho_o) and fc. These are calculated separately for vectorization as terms1,2 and 3 respectively.
All these three terms appear to be of 1xN sizes. Our aim thus becomes to calculate these three terms without loops. Now, the two loops run for N times each, thus giving us a total loop count of NxN. Thus, we must have NxN times the data in each such term as compared to when these terms were inside the nested loops.
This is basically the essence of the vectorization done here, as the three terms are represented by ‘term1’,’term2’ and ‘fc’ itself.
In order to give a self-contained answer, I'll copy the original initialization
N = 3;
rho_g = linspace(1e-3,1,N);
phi_g = linspace(0,2*pi,N);
n = 1:3;
tau = [1 2.*ones(1,length(n)-1)];
and generate some missing data (k(3) and rho_s and phi_s in the dimension of n)
rho_s = rand(size(n));
phi_s = rand(size(n));
k(3) = rand(1);
then you can compute the same Ez_t with multidimensional arrays:
[RHO_G, PHI_G, N] = meshgrid(rho_g, phi_g, n);
[~, ~, TAU] = meshgrid(rho_g, phi_g, tau);
[~, ~, RHO_S] = meshgrid(rho_g, phi_g, rho_s);
[~, ~, PHI_S] = meshgrid(rho_g, phi_g, phi_s);
FC = cos(N.*(PHI_G - PHI_S));
FS = sin(N.*(PHI_G - PHI_S)); % not used
EZ_T = sum(TAU.*besselj(N, k(3)*RHO_S).*besselh(N, 2, k(3)*RHO_G).*FC, 3).';
You can check afterwards that both matrices are the same
norm(Ez_t - EZ_T)