I've been using scipy.sparse.linalg.eigs on some large matrices, and not surprisingly, it takes a while. So, I've been looking for ways to speed it up. My understand is that, under the hood, the scipy code use ARPACK, and there is a parallel version of ARPACK which uses MPI. Is it possible to have scipy use the parallel version of ARPACK without too much pain? If so, how?
(I should note that MATLAB's equivalent of eigs does seem to be multithreaded, so that may be the least painful option.)
It would seem that the (MPI-) parallel version of ARPACK is an entirely different project called PARPACK:
"A parallel version of the ARPACK library is now availible. The
message passing layers currently supported are BLACS and MPI .
Parallel ARPACK (PARPACK) is provided as an extension to the current
ARPACK library (Release 2.1)."
Have you looked at PETsc4py?
Or maybe even
"explore calling a parallel sparse linear algebra library like CUSP or
cuSPARSE from Python if speed is your concern and you have an NVIDIA
GPU."
(see this answer)
Related
I have two for loops running in my Matlab code. The inner loop is parallelized using Matlabpool in 12 processors (which is maximum Matlab allows in a single machine).
I dont have Distributed computing license. Please help me how to do it using Octave or Scilab. I just want to parallelize 'for' loop ONLY.
There are some broken links given while I searched for it in google.
parfor is not really implemented in octave yet. The keyword is accepted, but is a mere synonym of for (http://octave.1599824.n4.nabble.com/Parfor-td4630575.html).
The pararrayfun and parcellfun functions of the parallel package are handy on multicore machines.
They are often a good replacement to a parfor loop.
For examples, see
http://wiki.octave.org/Parallel_package.
To install, issue (just once)
pkg install -forge parallel
And then, once on each session
pkg load parallel
before using the functions
In Scilab you can use parallel_run:
function a=g(arg1)
a=arg1*arg1
endfunction
res=parallel_run(1:10, g);
Limitations
uses only one core on Windows platforms.
For now, parallel_run only handles arguments and results of scalar matrices of real values and the types argument is not used
one should not rely on side effects such as modifying variables from outer scope : only the data stored into the result variables will be copied back into the calling environment.
macros called by parallel_run are not allowed to use the JVM
no stack resizing (via gstacksize() or via stacksize()) should take place during a call to parallel_run
In GNU Octave you can use the parfor construct:
parfor i=1:10
# do stuff that may run in parallel
endparfor
For more info: help parfor
To see a list of Free and Open Source alternatives to MATLAB-SIMULINK please check its Alternativeto page or my answer here. Specifically for SIMULINK alternatives see this post.
something you should consider is the difference between vectorized, parallel, concurrent, asynchronous and multithreaded computing. Without going much into the details vectorized programing is a way to avoid ugly for-loops. For example map function and list comprehension on Python is vectorised computation. It is the way you write the code not necesarily how it is being handled by the computer. Parallel computation, mostly used for GPU computing (data paralleism), is when you run massive amount of arithmetic on big arrays, using GPU computational units. There is also task parallelism which mostly refers to ruing a task on multiple threads, each processed by a separate CPU core. Concurrent or asynchronous is when you have just one computational unit, but it does multiple jobs at the same time, without blocking the processor unconditionally. Basically like a mom cooking and cleaning and taking care of its kid at the same time but doing only one job at the time :)
Given the above description there are lot in the FOSS world for each one of these. For Scilab specifically check this page. There is MPI interface for distributed computation (multithreading/parallelism on multiple computers). OpenCL interfaces for GPU/data-parallel computation. OpenMP interface for multithreading/task-parallelism. The feval functions is not parallelism but a way to vectorize a conventional function.Scilab matrix arithmetic and parallel_run are vectorized or parallel depending to the platform, hardware and version of the Scilab.
I have built an standalone Matlab application. I was expecting it to be faster than running the application from the Matlab environent but it is indeed a bit slower (1.3 seg per iteration vs 1.5 seg per iteration)
I am not counting the init time required by MCR but the execution of my code.
Is that the expected performance or should I be obtaining a performance improvement?
I haven't found any settings on the deployment tool that could help to reduce execution time.
Thanks in advance
Applications built with MATLAB Compiler should execute at pretty much exactly the same speed as within MATLAB.
MATLAB Compiler does not convert your MATLAB code into machine code in the same way as a C compiler does for C. What it does is to archive and encrypt your MATLAB code (note, it properly encrypts it, not just pcodes it as a comment suggests), create a thin executable wrapper and package them together, possibly also with MATLAB Compiler Runtime (MCR). MCR is very similar to MATLAB itself, without a graphical user interface, and is freely redistibutable.
When you run the executable, it dearchives and decrypts your MATLAB code and runs it against the MCR. It should run exactly the same, both in terms of results and speed.
Very old versions of MATLAB Compiler (pre-version 4.0) worked in a different way, converting a subset of the MATLAB language into C code, and compiling this. This provided a potentially significant speed-up, but only a subset of the language was supported and results, unless you were careful, could sometimes be different. Similar functionality is now available in the separate MATLAB Coder product.
There are a few small things you can do to improve performance: for example, within deploytool you can specify which toolboxes your application uses. deploytool uses a dependency checker to package up all MATLAB functionality that it thinks your code might possibly depend on, but it can't always tell exactly, as the functions your code needs might change at runtime. It therefore errs on the side of caution and includes more than necessary. By specifying only the toolboxes you know to be necessary, you can speed things up a little (it also speeds up the build process quite a bit).
I've been testing various open source codes for solving a linear system of equations in C++. So far the fastest I've found is armadillo, using the OPENblas package as well. To solve a dense linear NxN system, where N=5000 takes around 8.3 seconds on my system, which is really really fast (without openblas installed, it takes around 30 seconds).
One reason for this increase is that armadillo+openblas seems to enable using multiple threads. It runs on two of my cores, whereas armadillo without openblas only uses 1. I have an i7 processor, so I want to increase the number of cores, and test it further. I'm using ubuntu, so from the openblas documentation I can do in the terminal:
export OPENBLAS_NUM_THREADS=4
however, running the code again doesn't seem to increase the number of cores being used or the speed. Am i doing something wrong, or is the 2 the max amount for using armadillo's "solve(A,b)" command? I wasn't able to find armadillo's source code anywhere to take a look.
Incidentally does anybody know the methods armadillo/openblas use for solving Ax=b (standard LU decomposition with parallelism or something else) ? Thanks!
edit: Actually the number of cores stuck at 2 seems to be a bug when installing openblas with synaptic package manager see here. Reinstalling from source allows it to detect how many cores i actutally have (8). Now I can use export OPENBLAS_NUM_THREADS=4 etc to govern it.
Armadillo doesn't prevent OpenBlas from using more cores. It's possible that the current implementation of OpenBlas simply chooses 2 cores for certain operations.
You can see Armadillo's source code directly in the downloadable package (it's open source), in the folder "include". Specifically, have a look at the file "include/armadillo_bits/fn_solve.hpp" (which contains the user accessible solve() function), and the file "include/armadillo_bits/auxlib_meat.hpp" (which contains the wrapper and housekeeping code for calling the torturous Blas and Lapack functions).
If you already have Armadillo installed on your machine, have a look at "/usr/include/armadillo_bits" or "/usr/local/include/armadillo_bits".
Hey all. Im trying to sort out how to get MATLAB running as best as possible. I have a pretty decent new machine.
12GB RAM
Core i7 3.2Ghz Cpu
and lots of free space.
and a strong graphics card.
However when I run the benchmark test of MATLAB (command bench) it lists the computer as being near the worst, around a Windows XP single core 1.7Ghz machine.
Any ideas why and how I can improve this??
Thanks very much
Firstly, I would recommend re-running the bench command a few times to make sure MATLAB has fully loaded all the libraries etc. it needs. Much of MATLAB is loaded on demand, so it's always best to time the second or third run.
MATLAB automatically takes advantage of multiple cores when executing certain operations which are multithreaded. For example lots of elementwise operations such as +, .* and so on as well as BLAS-backed operations (and probably others). This page lists those things which are multithreaded.
Parallel Computing Toolbox is useful when MATLAB's intrinsic multithreading can't help (if it can, then it's usually the fastest way to do things). This gives you explicit parallelism via PARFOR, SPMD and distributed arrays.
You need the Parallel Processing Toolbox. A lot of MATLAB functions are multithreaded but to parallelize your own code, you'll need it. A dumb hack is to open several instances of command-line MATLAB. You could also write multithreaded MEX files but the right way to go about it would be the purchase and use the aforementioned toolbox.
This may be obvious, but make sure that you have enabled multithreaded computation in the preferences (File > Preferences > General > Multithreading). In some versions of MATLAB, it's not enabled by default.
I have a project that requires lots of image processing and wanted to add GPU support to speed things up.
I was wondering if i compiled my matlab into c++ shared library and called it from within OpenCL program, does that mean that the matlab code is going to be run on GPU?
My own (semi-educated) guess is that you are going to find this very difficult to do. But, others have trodden the same path. This paper might be a good place to start your research. And Googling turned up Accelereyes and a couple of references to items on the Mathworks File Exchange which you might want to follow up.
Everything in jacket is written in c/ c++ / cuda.
Infact we now have a beta version libjacket (http://www.accelereyes.com/downloadLibjacket) which can be used to extend not just matlab but other languages if you are willing.
#OSaad
Most of our functions are the fastest options out there. Be it in C or matlab.
The Parallel Computing Toolbox in the upcoming release R2010b (due September 1st) supports GPU processing for several functions. Unfortunately, it only supports CUDA (version 1.3 and later), so with an ATI graphics card, you're out of luck. However, you may just want to buy a dedicated GPU, anyway.
Typically, if you can write your Matlab code in a "vectorized" way, then the packages like AccelerEyes and Jacket have a reasonable chance of making things run on the GPU. You can verify this to some extent beforehand by checking whether Matlab itself is able to run on multiple cores on the CPU (these days Matlab will use multiple cores if things are parallelizable in an obvious way).
If that doesn't work, then you need to drop down to C/C++ via mex and then, from there, call OpenCL yourself. Mex is how Matlab talks to C code, so you write C code that is called by Matlab (and receives the matrices, etc), then initialises and calls OpenCL. This is more work, but may be your only route (and, even if the automated packages work to some extent, this approach can still give more speedups because you can be smarter about memory management, for example, if you know what your are doing).