I test the following command in Octave, Scilab and Matlab prompt.
>> A = rand(10000,10000);
>> B = rand(10000,1);
>> tic,A\B;, toc
The timings were around, respectively, 40, 15.8 and 15.7 sec. For comparison Mathematica's performance was
In[7]:= A = RandomReal[{0, 1}, {10000, 10000}];
In[9]:= B = RandomReal[{0, 1}, 10000];
In[10]:= Timing[LinearSolve[A, B];]
Out[10]= {14.125, Null}
Does this indicate that Octave is not so capable as the rest of the softwares in the field of linear equations?
I think that your tests are flawed. The algorithms behind A\B make use of the special patterns and structures in the systems of equations, so execution time depends very much on what random(10000,10000) has generated. On three different runs with Octave 4.0.0 on my machine, your code returns 7.1s, 95.1s and 16.4s. That indicates that the first matrix generated by random was probably sparse, and that could have been the case when you tested your code with Scilab and Matlab. So unless you make sure that the algorithms are evaluating the same thing, or unless you average the execution time in a sound manner (that is not very trivial to find for me), then it doesn't make sense to compare them as you did.
#dimitris thanks for the question, I found your method quite helpful for a quick comparison, and the somewhat different answers I get now are interesting. I didn't have the problems alluded to by the other respondents, the answers (timings) were consistent and helpful.
Here are the times on my Windows machine (Asus with Ryzen 7 4700U) where I've printed 5 runs for each of Scilab 6.1.0, Octave 6.2.0, Python 3.8.7 and Julia 1.5.3) - I've been exploring the possible use of them for a job I currently have. I don't have (can't afford..) either Matlab or Mathematica, so no results from these..
Octave
>> for i=1:5;A=rand(10000,10000);B=rand(10000,1);tic;A\B;toc,end;
Elapsed time is 9.79964 seconds.
Elapsed time is 9.78249 seconds.
Elapsed time is 9.70953 seconds.
Elapsed time is 9.73357 seconds.
Elapsed time is 9.69932 seconds.
Scilab
--> for i=1:5;A=rand(10000,10000);B=rand(10000,1);tic;A\B;toc,end;
ans = 10.407628
ans = 10.706747
ans = 10.490241
ans = 10.773073
ans = 10.517951
Julia
m=10000;
for i=1:5;
A=rand(m,m);B=rand(m,1);
t=time();A\B;
println(time()-t)
end;
7.833999872207642
7.7170000076293945
7.675999879837036
7.76200008392334
7.740999937057495
Python
from pylab import *
import numpy as np
import datetime as dt
N = 10000
for i in range(5):
A = np.random.random((N,N))
B = np.random.random((N,1))
tic=dt.datetime.now()
np.linalg.solve(A,B)
print((dt.datetime.now()-tic))
0:00:05.567395
0:00:05.703859
0:00:05.050467
0:00:04.995202
0:00:05.294050
You should have run the tests in each one probably around a thousand times or more. Also note they should use the same algorithms but somewhat less fine-tuned. A more sensible approach is to test across many cases over many different dimensions and average the results.
Most matrix math comes from LAPACK. the difference is that Matlab has dlls with fortran and C++ that may be slightly better. I believe that they make a little bit better use of your math coprocessor. It is called the Intel MKL kernel.
The actual algorithm changes dependent on the structure of the matrix and size.
Related
I'm a new Julia user and I need to find eigenvectors of large matrices as quickly as possible*. I'm having trouble getting Julia to run as fast as Matlab for the following example:
Julia
const j = 1000 ::Int
A = Array{Float64}(j,j)
B = Array{Float64}(j,j)
f(x) = eigvecs(x)
A = randn(j,j)
B = f(A)
#time f(A)
output for time: 2.950973 seconds (12.31 k allocations: 76.445 MB, 0.11% gc time)
Matlab
j = 1000;
A = randn(j,j);
tic
[v, d] = eig(A);
toc
Elapsed time is 1.161133 seconds.
I have also checked Matlab with 1 thread to compare using maxNumCompThreads = 1 but it still gives a similar time (1.16s) to before. I've also tried to speed up Julia by running twice to precompile, and also setting blas_set_num_threads(4) but this isn't helping.
I'd really appreciate any advice about how to improve my Julia code!
*(I am using Matlab 2015b and Julia 0.4.7 on OSX El Capitan 10.11.6)
Kind of a duplicate of this discussion.
Usually when talking about Julia performance, you're talking about how the language actually works. In this case, both Julia and MATLAB are just calling well-optimized C/Fortran libraries for doing the eigenvalue calculation. This is reliant on the BLAS configuration. MATLAB ships with a version of MKL, so it's also just using a different library which in many cases is faster than OpenBLAS, but you can build Julia with MKL using the instructions in the README on the Julia Github repo. Maybe rebuilding your sysimg could help:
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)
If you are using a pre-built binary then it's not optimized for your system, and this will enable the optimizations.
Can anyone help to understand what/where is the problem?
I am comparing the speed of a basic matlab function like the mean.m with two matlab version 2013b and 2014b with the same machine.
and surprising, the version 2013b is much faster than 2014b....
Some of you have/had the same problem??
Profile summary of mean with 2014b --> 0,024
Profile summary of mean with 2013b --> 0,013
like in my scripts I use the mean function really often the different in running time of the same program in one or the other version is huge.....
Whats going on?
the code to compute the profile time:
A=rand(100,1)
time_mean=zeros( 1000,1)
for i=1:1000
tic
mean(A);
time_mean(i)= toc;
end
Firstly, it's not wise to use the profiler to compare timings across releases - it's designed to identify slow portions in a single MATLAB release. Secondly, you should use timeit to time this sort of thing. I compared R2013b and R2014b on my Windows machine over a range of sizes, and can see what appears to be a small fixed overhead in R2014b of around 0.1ms.
Code is essentially:
for exp = 1:6
A = rand(10^exp, 1);
t(exp) = timeit(#()mean(A));
end
semilogy(1:6, t);
If you are making lots of individual calls to mean, you might be better off seeing if you can form these into a single call - MATLAB's mean can operate down columns or along rows of a matrix...
(correctly and instructively asnwered, see below)
I'm beginning to do experiments with matlab and gpu (nvidia gtx660).
Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:
function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
for i=1:n
if norm(A(i,:))<1
countr=countr+1;
end
end
pig=(countr/n)*4;
end
This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:
>> tic; mc1vecnocuda(100000);toc;
Elapsed time is 0.092473 seconds.
See, instead, what happens with gpu-ized version of the algorithm:
function pig = mc1veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
if norm(A(i,:))<1
gpucountr=gpucountr+1;
end
end
pig=(gpucountr/n)*4;
end
Now, this takes a LONG time to be executed:
>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.
I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.
Can someone help me? Thanks.
Edit: this is the updated version that avoids IF:
function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
gpucountr = gpucountr+nnz(norm(A(i,:))<1);
end
pig=(gpucountr/n)*4;
end
And this is the code written following Bichoy's guidelines (the
right code to achieve result):
function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);
pig=(gpucountr/n)*4;
end
Please note execution time with n=10 millions:
>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.
I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.
Great Bichoy! ;)
I guess the problem is with parfor!
parfor is supposed to run on MATLAB workers, that is your host not the GPU!
I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.
Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.
Code:
This is the final code after including all corrections and suggestions from several people:
function pig = mc2veccuda(n)
A=gpuArray.rand(n,2); % An nx2 random matrix
Asq = A.^2; % Get the square value of each element
Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
gpucountr = nnz(Anorm<1); % Check the number of elements < 1
pig=(gpucountr/n)*4;
Many reasons like:
Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead
It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.
As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.
Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html
I have access to a 12 core machine and some matlab code that relies heavily on fftn. I would like to speed up my code.
Since the fft can be parallelized I would think that more cores would help but I'm seeing the opposite.
Here's an example:
X = peaks(1028);
ncores = feature('numcores');
ntrials = 20;
mtx_power_times = zeros(ncores,ntrials);
fft_times = zeros(ncores, ntrials);
for i=1:ncores
for j=1:ntrials
maxNumCompThreads(i);
tic;
X^2;
mtx_power_times(i,j) = toc;
tic
fftn(X);
fft_times(i,j) = toc;
end
end
subplot(1,2,1);
plot(mtx_power_times,'x-')
title('mtx power time vs number of cores');
subplot(1,2,2);
plot(fft_times,'x-');
title('fftn time vs num of cores');
Which gives me this:
The speedup for matrix multiplication is great but it looks like my ffts go almost 3x slower when I use all my cores. What's going on?
For reference my version is 7.12.0.635 (R2011a)
Edit: On large 2D arrays taking 1D transforms I get the same problem:
Edit: The problem appears to be that fftw is not seeing the thread limiting that maxNumCompThreads enforces. I'm getting all the cpus going full speed no matter what I set maxNumCompThreads at.
So... is there a way I can specify how many processors I want to use for an fft in Matlab?
Edit: Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
To use different cores, you should use the Parallel Computing Toolbox. For instance, you could use a parfor loop, and you have to pass the functions as a list of handles:
function x = f(n, i)
...
end
m = ones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
More info is available at:
High performance computing
Multithreaded computation
Multithreading
Suppose, in MATLAB, that I have a matrix, A, whose elements are either 0 or 1.
How do I get a vector of the index of the last non-zero element of each column in a faster, vectorized way?
I could do
[B, I] = max(cumsum(A));
and use I, but is there a faster way? (I'm assuming cumsum would cost a bit of time even suming 0's and 1's).
Edit: I guess that I vectorized even more than I need fast - Mr. Fooz' loop is great but each loop in MATLAB seems to cost me a lot in debugging time even if it is fast.
Fast is what you should worry about, not necessarily full vectorization. Recent versions of Matlab are much smarter about handling loops efficiently. If there's a compact vectorized way of expressing something, it's usually faster, but loops should not (always) be feared like they used to be.
clc
A = rand(5000)>0.5;
A(1,find(sum(A,1)==0)) = 1; % make sure there is at least one match
% Slow because it is doing too much work
tic;[B,I1]=max(cumsum(A));toc
% Fast because FIND is fast and it runs the inner loop
tic;
I3=zeros(1,5000);
for i=1:5000
I3(i) = find(A(:,i),1,'last');
end
toc;
assert(all(I1==I3));
% Even faster because the JIT in Matlab is smart enough now
tic;
I2=zeros(1,5000);
for i=1:5000
I2(i) = 0;
for j=5000:-1:1
if A(j,i)
I2(i) = j;
break;
end
end
end
toc;
assert(all(I1==I2));
On R2008a, Windows, x64, the cumsum version takes 0.9 seconds. The loop and find version takes 0.02 seconds. The double loop version takes a mere 0.001 seconds.
EDIT: Which one is fastest depends on the actual data. The double-loop takes 0.05 seconds when you change the 0.5 to 0.999 (because it takes longer to hit the break; on average). cumsum and the loop&find implementation have more consistent speeds.
EDIT 2: gnovice's flipud solution is clever. Unfortunately, on my test machine it takes 0.1 seconds, so it's much faster than cumsum, but slower than the looped versions.
As shown by Mr Fooz, for loops can be pretty fast now with newer versions of MATLAB. However, if you really want to have compact vectorized code, I would suggest trying this:
[B,I] = max(flipud(A));
I = size(A,1)-I+1;
This is faster than your CUMSUM-based answer, but still not quite as fast as Mr Fooz's looping options.
Two additional things to consider:
What results do you want to get for a column that has no ones in it at all? With the above option I gave you, I believe you will get an index of size(A,1) (i.e. the number of rows in A) in such a case. For your option, I believe you will get a 1 in such a case, while the nested-for-loops option from Mr Fooz will give you a 0.
The relative speed of these different options will likely vary based on the size of A and the number of non-zeroes you expect it to have.