Julia vs. Matlab benchmarking eigenvector calculations - matlab

I'm a new Julia user and I need to find eigenvectors of large matrices as quickly as possible*. I'm having trouble getting Julia to run as fast as Matlab for the following example:
Julia
const j = 1000 ::Int
A = Array{Float64}(j,j)
B = Array{Float64}(j,j)
f(x) = eigvecs(x)
A = randn(j,j)
B = f(A)
#time f(A)
output for time: 2.950973 seconds (12.31 k allocations: 76.445 MB, 0.11% gc time)
Matlab
j = 1000;
A = randn(j,j);
tic
[v, d] = eig(A);
toc
Elapsed time is 1.161133 seconds.
I have also checked Matlab with 1 thread to compare using maxNumCompThreads = 1 but it still gives a similar time (1.16s) to before. I've also tried to speed up Julia by running twice to precompile, and also setting blas_set_num_threads(4) but this isn't helping.
I'd really appreciate any advice about how to improve my Julia code!
*(I am using Matlab 2015b and Julia 0.4.7 on OSX El Capitan 10.11.6)

Kind of a duplicate of this discussion.
Usually when talking about Julia performance, you're talking about how the language actually works. In this case, both Julia and MATLAB are just calling well-optimized C/Fortran libraries for doing the eigenvalue calculation. This is reliant on the BLAS configuration. MATLAB ships with a version of MKL, so it's also just using a different library which in many cases is faster than OpenBLAS, but you can build Julia with MKL using the instructions in the README on the Julia Github repo. Maybe rebuilding your sysimg could help:
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)
If you are using a pre-built binary then it's not optimized for your system, and this will enable the optimizations.

Related

Performance of Octave compared to Matlab and Scilab

I test the following command in Octave, Scilab and Matlab prompt.
>> A = rand(10000,10000);
>> B = rand(10000,1);
>> tic,A\B;, toc
The timings were around, respectively, 40, 15.8 and 15.7 sec. For comparison Mathematica's performance was
In[7]:= A = RandomReal[{0, 1}, {10000, 10000}];
In[9]:= B = RandomReal[{0, 1}, 10000];
In[10]:= Timing[LinearSolve[A, B];]
Out[10]= {14.125, Null}
Does this indicate that Octave is not so capable as the rest of the softwares in the field of linear equations?
I think that your tests are flawed. The algorithms behind A\B make use of the special patterns and structures in the systems of equations, so execution time depends very much on what random(10000,10000) has generated. On three different runs with Octave 4.0.0 on my machine, your code returns 7.1s, 95.1s and 16.4s. That indicates that the first matrix generated by random was probably sparse, and that could have been the case when you tested your code with Scilab and Matlab. So unless you make sure that the algorithms are evaluating the same thing, or unless you average the execution time in a sound manner (that is not very trivial to find for me), then it doesn't make sense to compare them as you did.
#dimitris thanks for the question, I found your method quite helpful for a quick comparison, and the somewhat different answers I get now are interesting. I didn't have the problems alluded to by the other respondents, the answers (timings) were consistent and helpful.
Here are the times on my Windows machine (Asus with Ryzen 7 4700U) where I've printed 5 runs for each of Scilab 6.1.0, Octave 6.2.0, Python 3.8.7 and Julia 1.5.3) - I've been exploring the possible use of them for a job I currently have. I don't have (can't afford..) either Matlab or Mathematica, so no results from these..
Octave
>> for i=1:5;A=rand(10000,10000);B=rand(10000,1);tic;A\B;toc,end;
Elapsed time is 9.79964 seconds.
Elapsed time is 9.78249 seconds.
Elapsed time is 9.70953 seconds.
Elapsed time is 9.73357 seconds.
Elapsed time is 9.69932 seconds.
Scilab
--> for i=1:5;A=rand(10000,10000);B=rand(10000,1);tic;A\B;toc,end;
ans = 10.407628
ans = 10.706747
ans = 10.490241
ans = 10.773073
ans = 10.517951
Julia
m=10000;
for i=1:5;
A=rand(m,m);B=rand(m,1);
t=time();A\B;
println(time()-t)
end;
7.833999872207642
7.7170000076293945
7.675999879837036
7.76200008392334
7.740999937057495
Python
from pylab import *
import numpy as np
import datetime as dt
N = 10000
for i in range(5):
A = np.random.random((N,N))
B = np.random.random((N,1))
tic=dt.datetime.now()
np.linalg.solve(A,B)
print((dt.datetime.now()-tic))
0:00:05.567395
0:00:05.703859
0:00:05.050467
0:00:04.995202
0:00:05.294050
You should have run the tests in each one probably around a thousand times or more. Also note they should use the same algorithms but somewhat less fine-tuned. A more sensible approach is to test across many cases over many different dimensions and average the results.
Most matrix math comes from LAPACK. the difference is that Matlab has dlls with fortran and C++ that may be slightly better. I believe that they make a little bit better use of your math coprocessor. It is called the Intel MKL kernel.
The actual algorithm changes dependent on the structure of the matrix and size.

Why is SCIP taking so long and taking so much memory?

I'm using the SCIP solver in the OPTI toolbox in matlab to solve a quadratic optimization problem with integer constraints. I ran it with the following specs and it's been running for a day and has already taken up 55GB of ram in my system and still counting. I'm new to optimization in matlab, am I doing something wrong or is this usual? I tried with less maxnodes and maxtime, but the program stops with the 'Node limit reached' error in those cases. Here's the code (H, Aeq etc. have been defined earlier in the code) -
X = sym('X%d%d', [104 1]);
fun = #(X) 1/2*X'*H*X;
options = optiset('solver', 'SCIP', 'maxnodes', 20000000, 'maxtime', 100000);
Opt = opti('fun', fun, 'eq', Aeq, Beq, 'xtype', xtype, 'options', options);
[xval,fval,exitflag,info] = solve(Opt)
This is not unusual if the quadratic function(s) are nonconvex. This easily leads to hard problems that cannot be solved to proven optimality with today's algorithms in any reasonably finite amount of time. Note that this does not only depend on the size of the problem, but in general smaller problems (of a similar type) will be easier.
This being said, SCIP might already have found a near-optimal solution that is accessible even when the time or node limit is exceeded.

Matlab speed problems

Can anyone help to understand what/where is the problem?
I am comparing the speed of a basic matlab function like the mean.m with two matlab version 2013b and 2014b with the same machine.
and surprising, the version 2013b is much faster than 2014b....
Some of you have/had the same problem??
Profile summary of mean with 2014b --> 0,024
Profile summary of mean with 2013b --> 0,013
like in my scripts I use the mean function really often the different in running time of the same program in one or the other version is huge.....
Whats going on?
the code to compute the profile time:
A=rand(100,1)
time_mean=zeros( 1000,1)
for i=1:1000
tic
mean(A);
time_mean(i)= toc;
end
Firstly, it's not wise to use the profiler to compare timings across releases - it's designed to identify slow portions in a single MATLAB release. Secondly, you should use timeit to time this sort of thing. I compared R2013b and R2014b on my Windows machine over a range of sizes, and can see what appears to be a small fixed overhead in R2014b of around 0.1ms.
Code is essentially:
for exp = 1:6
A = rand(10^exp, 1);
t(exp) = timeit(#()mean(A));
end
semilogy(1:6, t);
If you are making lots of individual calls to mean, you might be better off seeing if you can form these into a single call - MATLAB's mean can operate down columns or along rows of a matrix...

Matlab CUDA basic experiment

(correctly and instructively asnwered, see below)
I'm beginning to do experiments with matlab and gpu (nvidia gtx660).
Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:
function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
for i=1:n
if norm(A(i,:))<1
countr=countr+1;
end
end
pig=(countr/n)*4;
end
This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:
>> tic; mc1vecnocuda(100000);toc;
Elapsed time is 0.092473 seconds.
See, instead, what happens with gpu-ized version of the algorithm:
function pig = mc1veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
if norm(A(i,:))<1
gpucountr=gpucountr+1;
end
end
pig=(gpucountr/n)*4;
end
Now, this takes a LONG time to be executed:
>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.
I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.
Can someone help me? Thanks.
Edit: this is the updated version that avoids IF:
function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
gpucountr = gpucountr+nnz(norm(A(i,:))<1);
end
pig=(gpucountr/n)*4;
end
And this is the code written following Bichoy's guidelines (the
right code to achieve result):
function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);
pig=(gpucountr/n)*4;
end
Please note execution time with n=10 millions:
>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.
I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.
Great Bichoy! ;)
I guess the problem is with parfor!
parfor is supposed to run on MATLAB workers, that is your host not the GPU!
I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.
Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.
Code:
This is the final code after including all corrections and suggestions from several people:
function pig = mc2veccuda(n)
A=gpuArray.rand(n,2); % An nx2 random matrix
Asq = A.^2; % Get the square value of each element
Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
gpucountr = nnz(Anorm<1); % Check the number of elements < 1
pig=(gpucountr/n)*4;
Many reasons like:
Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead
It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.
As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.
Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html

Matlab's fftn gets slower with multithreading?

I have access to a 12 core machine and some matlab code that relies heavily on fftn. I would like to speed up my code.
Since the fft can be parallelized I would think that more cores would help but I'm seeing the opposite.
Here's an example:
X = peaks(1028);
ncores = feature('numcores');
ntrials = 20;
mtx_power_times = zeros(ncores,ntrials);
fft_times = zeros(ncores, ntrials);
for i=1:ncores
for j=1:ntrials
maxNumCompThreads(i);
tic;
X^2;
mtx_power_times(i,j) = toc;
tic
fftn(X);
fft_times(i,j) = toc;
end
end
subplot(1,2,1);
plot(mtx_power_times,'x-')
title('mtx power time vs number of cores');
subplot(1,2,2);
plot(fft_times,'x-');
title('fftn time vs num of cores');
Which gives me this:
The speedup for matrix multiplication is great but it looks like my ffts go almost 3x slower when I use all my cores. What's going on?
For reference my version is 7.12.0.635 (R2011a)
Edit: On large 2D arrays taking 1D transforms I get the same problem:
Edit: The problem appears to be that fftw is not seeing the thread limiting that maxNumCompThreads enforces. I'm getting all the cpus going full speed no matter what I set maxNumCompThreads at.
So... is there a way I can specify how many processors I want to use for an fft in Matlab?
Edit: Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
To use different cores, you should use the Parallel Computing Toolbox. For instance, you could use a parfor loop, and you have to pass the functions as a list of handles:
function x = f(n, i)
...
end
m = ones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
More info is available at:
High performance computing
Multithreaded computation
Multithreading