I'm computing a function f(x) = exp(-x) in Matlab, where x is a vector of scalars. The function is computed on GPU, e.g.
x_cpu = [4 5 11 1];
x = gpuArray(x_cpu);
f = exp(-x);
then the result would be:
f = exp(-[4, 5, 11, 1]) = [0.183, 0.0067, 1.6702e-005, 0.3679].
Note that f(x(3)) = f(11) = exp(-11) = 1.6702e-005 = 0.000016702, which is a pretty small value. So, I would like to avoid computing the function for all x(i) > 10 by simply setting f(x(i)) = 0.
I can probably use the sparse matrix representation for x. However, the Parallel Computing Toolbox does not support operations on sparse matrices on GPU.
How would you approach this?
While the Parallel Computing Toolbox does not support sparse matrix operations on the GPU, Jacket does. So one possible approach is to simply use the different tool.
Disclaimer is that I work on Jacket, but I really do think it would be beneficial to you on this since it supports the things you want to do and that PCT does not do, and for reasons here.
PLEASE NOTE: This approach is a workaround meant to address the statement in the question:
So, I would like to avoid computing the function for all x(i) > 10 by
simply setting f(x(i)) = 0.
In no way is this a truly "sparse" numerical method. This is simply a means to "avoid computing the function for all x(i) > 10" on the GPU in MATLAB
% original input vector
x_cpu = [4 5 10 1 13 8 9];
% logical indeces of x where exp(-x) is significant
ix = x_cpu <= 10;
% values of x where exp(-x) is significant ("sparse" x)
x_sp = x_cpu(ix);
% Load our "sparse" vector to GPU
x_gpu = gpuArray(x_sp);
% create a vector of zeros for function output on GPU
f_gpu = parallel.gpu.GPUArray.zeros(size(x_cpu));
% do the calculations only for the "sparse" matrix on the GPU
f_gpu(ix) = exp(-x_gpu);
For when you want to get your computations back in the workspace, use gather:
f_cpu = gather(f_gpu); % GPU --> workspace
NOTE: I have not tested this code
You should combine some of these initializations (x_sp or ix, maybe) to conserve memory and speed up the process. Honestly, the initializations and the transfer of data between the workspace and the GPU might actually make this whole process slower than before. Nothing left to do but try it!
Related
My goal is to implement a function which performs fourier synthesis in matlab, as part of learning the language. The function implements the following expression:
y = sum(ak*exp((i*2*pi*k*t)/T)
where k is the index, ak is a vector of fourier coefficients, t is a time vector of sampled times, and T is the period of the signal.
I have tried something like this:
for counter = -N:1:N
k = y+N+1;
y(k) = ak(k)*exp((i*2*pi*k*t)/T);
% y is a vector of length 2N+1
end
However, this gives me an error that the sides do not have equal numbers of items within them. This makes sense to me, since t is a vector of arbitrary length, and thus I am trying to make y(k) equal to numerous things rather than one thing. Instead, I suspect I need to try something like:
for counter = -N:1:N
k=y+N+1;
for t = 0:1/fs:1
%sum over t elements for exponential operation
end
%sum over k elements to generate y(k)
end
However, I'm supposedly able to implement this using purely matrix multiplication. How could I do this? I've tried to wrap my head around what Matlab is doing, but honestly, it's so far from the other languages I know that I don't really have any sense of what matlab's doing under the hood. Understanding how to change between operations on matrices and operations in for loops would be profoundly helpful.
You can use kron to reach your goal without for loops, i.e., matrix representation:
y = a*exp(1j*2*pi*kron(k.',t)/T);
where a,k and t are all assumed as row-vectors
Example
N = 3;
k = -N:N;
t = 1:0.5:5;
T = 15;
a = 1:2*N+1;
y = a*exp(1j*2*pi*kron(k.',t)/T);
such that
y =
Columns 1 through 6:
19.1335 + 9.4924i 10.4721 + 10.6861i 2.0447 + 8.9911i -4.0000 + 5.1962i -6.4721 + 0.7265i -5.4611 - 2.8856i
Columns 7 through 9:
-2.1893 - 4.5489i 1.5279 - 3.9757i 4.0000 - 1.7321i
MATLAB's power function to calculate element-wise exponential for a constant base and an array of exponents becomes noticeably faster when the size of the array becomes 512. I expected to see the computation time increase with the input size, however, there is a noticeable drop when there are 512 elements in the array of exponents. Here is a sample code
x_list = 510:514;
for i = 1:numel(x_list)
x = x_list(i);
tic
for j = 1:10000
y = power(2,1:x);
end
toc
end
The output of the code is
Elapsed time is 0.397649 seconds.
Elapsed time is 0.403687 seconds.
Elapsed time is 0.318293 seconds.
Elapsed time is 0.238875 seconds.
Elapsed time is 0.175525 seconds.
What is happening here?
I see the same effect using random numbers for the exponent, as I see using integers in the range 1:n:
x = 500:540;
t = zeros(size(x));
for ii = 1:numel(x)
%m = 1:x(ii);
m = 500*rand(1,x(ii));
t(ii) = timeit(#()power(2,m));
end
plot(x,t)
When forcing MATLAB to use a single thread with maxNumCompThreads(1), and running the code above again, I see this graph instead (note the y-axis, the peaks are just noise):
It looks to me that MATLAB uses a single core to compute the exponent of 511 values, and fires up all cores if the matrix is larger. There is an overhead in using multithreading, it is not worth while to do so for small arrays. The exact point where the overhead is balanced by the time savings depends on many factors, and so hard-coding a fixed threshold for when to switch to multithreaded computation leads to a jump in execution time on systems with different characteristics to those of the system where the threshold was determined.
Note that #norok2 is not seeing this same jump because on their system MATLAB was limited to a single thread.
This is related to size of the number for which the power is computed rather than the size of the container.
If you use random numbers, for varying container size, one does not observe a jump in the timings:
x = 450:1550;
y = zeros(numel(x), 1);
X = rand(1, 10000);
for i = 1:length(x)
f = #() 2 .^ X(1:x(i));
y(i) = timeit(f);
end
figure()
plot(x, y)
Therefore the issue must be with the computation for very large numbers.
I first I thought that this might be related to overflow, but the overflow happens at 2 ^ 1024 == inf as dictated by the IEEE standards which MATLAB follows, and I thought that for inf this is would have been much faster than computing a number for real.
This is supported by the following benchmark where the size of the array is kept constant:
x = 450:1550;
y = zeros(numel(x), 1);
X = rand(1, 10000);
for i = 1:length(x)
f = #() 2 .^ (ones(1, 500) * x(i));
y(i) = timeit(f);
end
figure()
plot(x, y)
Why exactly this may be relevant for your setup when 2 ^ 512 instead of 2 ^ 1024, I do not really understand.
(Note that I used 2 .^ ... instead of power(2, ...) but the results are the same.)
Also, running #CrisLuengo's code in my system does not really reproduce any jump.
x = 500:540;
t = zeros(size(x));
for ii = 1:numel(x)
%m = 1:x(ii);
m = 500*rand(1,x(ii));
t(ii) = timeit(#()power(2,m));
end
plot(x,t)
all the evidence so far indicate that spike being a related to JIT latency/warm-up.
Here's some confirmation of what Cris found, using a 4-core Windows machine running MATLAB R2018a. I first tested the following code to show that the specific value of the exponent wasn't the culprit for the jump:
t = zeros(4, 1000);
for p = 1:size(t, 1)
for n = 1:size(t, 2)
t(p, n) = timeit(#() power(2, (2.^(p-1)).*ones(1, n)));
end
end
And here are the results:
For degenerate edge cases where the exponent is 1 (return the same value) or 2 (return the value times itself) the computation runs faster, as expected. However, the jump at an array size of 512 or above indicates overhead is added to these edge cases, compared to a reduction in computation time for exponents of 4 and 8 when the array size exceeds 512. Larger exponent values simply reproduce the upper curves.
I then ran two more tests: one with array size between 1 and 511, and a second with array size between 512 and 1024. Here's what the processor load looked like:
Processor 3 shows a large load spike during the first test, while all 4 processors show load spikes during the second test. This confirms that multithreading is employed for array sizes of 512 or above. This also explains the slower computation for edge cases of larger sizes, since the overhead from multithreading outweighs the speedup provided by splitting up the simpler calculations.
I can't seem wrap my head on vectorizing a for loop in my MATLAB code. Basically, I have this code:
% Let that:
%Tp = scalar
%N, = scalar (Say 1000)
%Ac, = 4x4 matrix
%pre = 1x4 matrix
%post = 4x2 matrix
%wy1 = N+1x1 matrix (so it would be 1001*1)
%wy2 = N+1x1 matrix (so it would be 1001*1)
% preallocate
delta_ksi=Tp/N;
AcT =Ac';
sum_matrix=zeros(4,1);
Fl=zeros(4,1);
% calculate the sum
for i=1:N
Fl=delta_ksi*expm(AcT*delta_ksi*i)*post*[wy1(1,i); wy2(1,i);];
sum_matrix= sum_matrix+Fl;
end
%value I need
delta_f_des_ff= pre*sum_matrix;
What I have in mind is to construct a 3D matrix Fl_3D (4 x 1x 1000) and then do array multiplication with i = 1:1000, but I kept getting incompatible dimension error when multiplying with [wy1(1,i); wy2(1,i)] which also use the index i.
Any clue on what is the best approach to do this? Is vectorization still possible?
Thanks!
Edit:
Background:
I am trying to troubleshoot a bottleneck code in a Simulink project.
Code Profiling shows that the above function hogging most of the runtime.
I am hoping vectorizing could solve the performance issue. I also just found out the bottleneck comes from expm() operation. Any suggestion to improve this is also welcomed!
I want a code the below code more efficient timewise. preferably without a loop.
arguments:
t % time values vector
t_index = c % one of the possible indices ranging from 1:length(t).
A % a MXN array where M = length(t)
B % a 1XN array
code:
m = 1;
for k = t_index:length(t)
A(k,1:(end-m+1)) = A(k,1:(end-m+1)) + B(m:end);
m = m + 1;
end
Many thanks.
I'd built from B a matrix of size NxM (call it B2), with zeros in the right places and a triangular from according to the conditions and then all you need to do is A+B2.
something like this:
N=size(A,2);
B2=zeros(size(A));
k=c:length(t);
B2(k(1):k(N),:)=hankel(B)
ans=A+B2;
Note, the fact that it is "vectorized" doesn't mean it is faster these days. Matlab's JIT makes for loops comparable and sometimes faster than built-in vectorized options.
I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)
The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!
This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html
I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);
You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).
The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.