MATLAB Optimization column access - matlab

I'm working on optimizing my code. I'm going through MATLAB's own optimization webpage (https://www.mathworks.com/company/newsletters/articles/programming-patterns-maximizing-code-performance-by-optimizing-memory-access.html) and writing code to makes sure I understand how to increase the performance. Point 2 suggests that when you store and access matrix data, do so with columns. However, I can't construct my own code that shows this performance boost. If i'm reading it right the two simple codes below should differ in performance by about 30%, however I'm getting far, far less than that:
x=zeros(10000);
tic
for i = 1:10000
for j = 1:10000
if x(i,j) == 0
x(i,j)=randn(1);
end
end
end
toc
Elapsed time is 41.102294 seconds.
x=zeros(10000);
tic
for j = 1:10000
for i = 1:10000
if x(i,j) == 0
x(i,j)=randn(1);
end
end
end
toc
Elapsed time is 39.654574 seconds.

Related

Show time in the title of a plot, matlab

I'm working with unsteady state problems, that is problems that depends of time. And I'd like to know how to show computation time on a graph, something like this:
What I have found so far is:
tic
% PROCESS
toc
that tells me the time a process lasts. I'm not looking for tic toc.
Do you know how can I show how the time changes in seconds onto the graph?
For example: (Please read: https://www.mathworks.com/help/matlab/ref/title.html)
tic
% PROCESS
t = toc;
x = 1:10;
y = 1:10;
plot(x,y)
title({'Your title...'; strcat(['t=',num2str(t),'s.'])})
Output

different tic toc results in one-time run or run in steps

i confront with a weird fact! i wrote a matlab code that compare runing time of two algorithms. i embrace codes of each algorithm by tic toc functions, so i can compare time complexities of them. but weird thing is here:
if i run the script at once, algorithm 1 has longer time than algorithm 2, but if i put breakmarks after each toc, algorithm 1 would has shorter time!
in order to understand better what I said, consider following:
tic
% some matlab codes implementing algorithm1
time1 = toc;
disp(['t1 = ',num2str(time1)]) % here is the place for first breakpoint
tic
% some matlab codes implementing algorithm2
time2 = toc;
disp(['t2 = ',num2str(time2)]) % here is the place for second breakpoint
can anyone please explain to me why this happen? thanks

How does TIC TOC in matlab work?

I am working with loops.I am using tic toc to check the time. I get different time when i run the same loop. The time is close to each other eg. 98.2 and 97.7. secondly when i reduce the size of the loop to half, i expect the time to change in half but it doesn't. Can anyone explain me how tic toc actually works?
Thanks.
tic
for i=1:124
for j=1:10
for k=1:11
end
end
end
toc
Secondly i tried to use tic toc inside the loop as shown below. Will it return total time? I get a number but i cant verify if it actually is the total.
for i=1:124
tic
for j=1:10
for k=1:11
end
end
toc
end
tic and toc just measure the elapsed time in seconds. MATLAB has now JIT meaning that the actual time to compute cannot be estimated correctly.
Matlab has (at least in this context) no real-time computation, so you basically always have different elapsed time for the same code.
Read this here, its nicely explained, hopefully it helps: http://www.matlabtips.com/matlab-is-no-longer-slow-at-for-loops/

matlab if statements with CUDA

I have the following matlab code:
randarray = gpuArray(rand(N,1));
N = 1000;
tic
g=0;
for i=1:N
if randarray(i)>10
g=g+1;
end
end
toc
secondrandarray = rand(N,1);
g=0;
tic
for i=1:N
if secondrandarray(i)>10
g=g+1;
end
end
toc
Elapsed time is 0.221710 seconds.
Elapsed time is 0.000012 seconds.
1) Why is the if clause so slow on the GPU? It is slowing down all my attempts at optimisation
2) What can I do to get around this limitation?
Thanks
This is typically a bad thing to do no matter if you are doing it on the cpu or the gpu.
The following would be a good way to do the operation you are looking at.
N = 1000;
randarray = gpuArray(100 * rand(N,1));
tic
g = nnz(randarray > 10);
toc
I do not have PCT and I can not verify if this actually works (number of functions supported on GPU are fairly limited).
However if you had Jacket, you would definitely be able to do the following.
N = 1000;
randarray = gdouble(100 * rand(N, 1));
tic
g = nnz(randarray > 10);
toc
Full disclosure: I am one of the engineers developing Jacket.
No expert on the Matlab gpuArray implementation, but I would suspect that each randarray(i) access in the first loop triggers a PCI-e transaction to retrieve a value from GPU memory, which will incur a very large latency penalty. You might be better served by calling gather to transfer the whole array in a single transaction instead and then loop over a local copy in host memory.
Using MATLAB R2011b and Parallel Computing Toolbox on a now rather old GPU (Tesla C1060), here's what I see:
>> g = 100*parallel.gpu.GPUArray.rand(1, 1000);
>> tic, sum(g>10); toc
Elapsed time is 0.000474 seconds.
Operating on scalar elements of a gpuArray one at a time is always going to be slow, so using the sum method is much quicker.
I cannot comment on a prior solution because I'm too new, but extending on the solution from Pavan. The nnz function is (not yet) implemented for gpuArrays, at least on the Matlab version I'm using (R2012a).
In general, it is much better to vectorize Matlab code. However, in some cases looped code can run fast in Matlab bercause of the JIT compilation.
Check the results from
N = 1000;
randarray_cpu = rand(N,1);
randarray_gpu = gpuArray(randarray_cpu);
threshold = 0.5;
% CPU: looped
g=0;
tic
for i=1:N
if randarray_cpu(i)>threshold
g=g+1;
end
end
toc
% CPU: vectorized
tic
g = nnz(randarray_cpu>threshold);
toc
% GPU: looped
tic
g=0;
for i=1:N
if randarray_gpu(i)>threshold
g=g+1;
end
end
toc
% GPU: vectorized
tic
g_d = sum(randarray_gpu > threshold);
g = gather(g_d); % I'm assuming that you want this in the CPU at some point
toc
Which is (on my core i7+ GeForce 560Ti):
Elapsed time is 0.000014 seconds.
Elapsed time is 0.000580 seconds.
Elapsed time is 0.310218 seconds.
Elapsed time is 0.000558 seconds.
So what we see from this case is:
Loops in Matlab are not considered good praxis, but in your particular case, it does run fast because Matlab somehow "precompiles" it internally. I changed your threshold from 10 to 0.5, as rand will never give you a value higher than 1.
The looped GPU version performs horribly because at each loop iteration, a kernel is launched (or data is read from the GPU, however TMW implemented that...), which is slow. A lot of small memory transfers while calculating basically nothing are the worst thing one could do on the GPU.
From the last (best) GPU result the answer would be: unless the data is already on the GPU, it doesn't make sense to calculate this on the GPU. Since the arithmetic complexity of your operation is basically nonexistent, the memory transfer overhead does not pay off in any way. If this is part of a bigger GPU calculation, it's OK. If not... better stick to the CPU ;)

Octave/Matlab: Efficient calc of Frobenius inner product?

I have two matrices A and B and what I want to get is:
trace(A*B)
If I'm not mistaken this is called Frobenius inner product.
My concern here is about efficiency. I'm just afraid that this strait-forward approach will first do the whole multiplication (my matrices are thousands of rows/cols) and only then take the trace of the product, while the operation I really need is much simplier. Is there a function or a syntax to do this efficiently?
Correct...summing the element-wise products will be quicker:
n = 1000
A = randn(n);
B = randn(n);
tic
sum(sum(A .* B));
toc
tic
sum(diag(A * B'));
toc
Elapsed time is 0.010015 seconds.
Elapsed time is 0.130514 seconds.
sum(sum(A.*B)) avoids doing the full matrix multiplication
How about using vector multiplication?
(A(:)')*B(:)
Run time check
Comparing four options with A and B of size 1000-by-1000:
1. vector inner product: A(:)'*B(:) (this answer) took only 0.0011 sec.
2. Using element wise multiplication sum(sum(A.*B)) (John's answer) took 0.0035 sec.
3. Trace trace(A*B') (proposed by OP) took 0.054 sec.
4. Sum of diagonal sum(diag(A*B')) (option rejected by John) took 0.055 sec.
Take home message: Matlab is extremely efficient when it comes to matrix/vector product. Using vector inner product is x3 times faster even than the efficient element-wise multiplication solution.
Benchmark code
Code used to provide the run time checks
t=zeros(1,4);
n=1000; % size of matrices
it=100; % average results over XX trails
for ii=1:it,
% random inputs
A=rand(n);
B=rand(n);
% John's rejected solution
tic;
n1=sum(diag(A*B'));
t(1)=t(1)+toc;
% element-wise solution
tic;
n2=sum(sum(A.*B));
t(2)=t(2)+toc;
% MOST efficient solution - using vector product
tic;
n3=A(:)'*B(:);
t(3)=t(3)+toc;
% using trace
tic;
n4=trace(A*B');
t(4)=t(4)+toc;
% make sure everything is correct
assert(abs(n1-n2)<1e-8 && abs(n3-n4)<1e-8 && abs(n1-n4)<1e-8);
end;
t./it
You can now run this benchmark in a click.