I have some MATLAB code with mxn matrix.
Initially, I put first row in it and then the code runs through a for loop which appends remaining m-1 rows one by one; one for each iteration of the loop.
As expected, MATLAB recommends me to pre-allocate the matrix because it is expanding with every iteration of loop.
So, if I pre-allocate zeros in all m rows, MATLAB most probably will append rows after the m rows(starting from m+1 for 1st appended row) because m rows are already filled(even though with zeros!)
Is there any way of pre-allocating matrix in this scenario for improving speed?
You cannot pre-allocate a MATLAB array without also changing it's size, at least not manually. However, MATLAB has improved automatic array growth performance a lot in recent versions, so you might not see a huge performance hit. Still, best practice would be to pre-allocate your array with zeros and index the rows with A(i,:) = rowVec; instead of appending a row (A = [A; rowVec];).
Pre-allocation
If you are determined to squeeze every bit of performance out of MATLAB, Yair Altman has a couple of excellent articles on the topic of memory pre-allocation:
Preallocation performance
Preallocation performance and multithreading
Automatic Array Growth Optimization
If you really want to use dynamic array resizing by growing along a dimension, there are ways to do it right. See this this MathWorks blog post by Steve Eddins. The most important thing to note is that you should grow along the last dimension for best performance. (i.e. add columns in your case). Yair also discusses dynamic array resizing in another post on his blog.
Also, there are ways of allocating an array without initializing using some hairy MEX API acrobatics, but that's it.
Related
I am using both Cplex and Gurobi for an LP program whose inequality constraint matrix A can become truly large -- around 5 to 10GB. When I want to use one of those solvers, I have to create a separate struct with all the problem constraints. This means that I have the matrix A in my workspace, and the matrix A in my solver struct at the same time. Even if I clear it in my Workspace as fast as possible, there is still a time when both exist and my RAM is overloaded.
I am asking if there is some clever method to deliver the matrix A into the model without both existing at the same time. The only thing I can think of right now is delivering it in small chunks...
MATLAB using copy-on-write, or lazy copying. This means that, as long as you don't modify one of the copies, all copies of a matrix share the same data:
A = randn(10000);
B = A; % does not take up extra memory
myfunc(B);
function myfunc(matrix)
C = matrix; % does not take up extra memory.
For reference, see for example on Loren's blog and Undocumented Matlab.
I have data-set of epinions website and want to implement the recommendation system
At the first step I should change the structure of data-set an it should be like 120000*780000 rows and columns
Its really big matrix and because of lack of memory it's not possible to do it
In my work every user should have M-dimensional vector , And M is total number of items that is 780000
I cant use sparse matrix because I need indexes and its too slow
What can I do now? How can I have this big data-set in matlab ?
You can try different things to reduce the amount of data:
Take a random subset of your observations: 120.000 observations is quite a lot, try randomly splitting it in several smaller subsets and check which is the performance of the system.
Use PCA to reduce the dimensionality of your data: 780.000 dimensions is A LOT. You will probably get a drastic reduction of the number of dimensions with PCA.
If your data is mostly zero or constant, you can actually use sparse matrices. Sparse matrices keep track of the indexes of your non-zero data, so don't worry about that.
I have three big 3D arrays of the same size [41*141*12403], named in the Matlab code below alpha, beta and ni. From them I need to calculate another 3D array with the same size, which is obtained elementwise from the original matrices through a calculation that combines an infinite sum and definite integral calculations, using the value of each element. It therefore seems inevitible to have to use several nested loops to make this calculation. The code is already running now for several hours(!) and it is still in the first iteration of the outer loop (which needs to be performed 41 times!! According to my calculation, in this way the program will have to run more than two years!!!). I don't know how to optimize the code. Please help me !!
the code I use:
z_len=size(KELDYSH_PARAM_r_z_t,1); % 41 rows
r_len=size(KELDYSH_PARAM_r_z_t,2); % 141 columns
t_len=size(KELDYSH_PARAM_r_z_t,3); % 12403 slices
sumRes=zeros(z_len,r_len,t_len);
for z_ind=1:z_len
z_ind % in order to track the advancement of the calculation
for r_ind=1:r_len
for t_ind=1:t_len
sumCurrent=0;
sumPrevious=inf;
s=0;
while abs(sumPrevious-sumCurrent)>1e-6
kapa=kapa_0+s; %some scalar
x_of_w=(beta(z_ind,r_ind,t_ind).*(kapa-ni...
(z_ind,r_ind,t_ind))).^0.5;
sumPrevious=sumCurrent;
sumCurrent=sumCurrent+exp(-alpha(z_ind,r_ind,t_ind).* ...
(kapa-ni(z_ind,r_ind,t_ind))).*(x_of_w.^(2*abs(m)+1)/2).* ...
w_m_integral(x_of_w,m);
s=s+1;
end
sumRes(z_ind,r_ind,t_ind)=sumCurrent;
end
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=w_m_integral(x_of_w,m)
res=quad(#integrandFun,0,1,1e-6);
function y=integrandFun(t)
y=exp(-x_of_w^2*t).*t.^(abs(m))./((1-t).^0.5);
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Option 1 - more vectorising
It's a pretty complex model you're working with and not all the terms are explained, but some parts can still be further vectorised. Your alpha, beta and ni matrices are presumably static and precomputed? Your s value is a scalar and kapa could be either, so you can probably precompute the x_of_w matrix all in one go too. This would give you a very slight speedup all on its own, though you'd be spending memory to get it - 71 million points is doable these days but will call for an awful lot of hardware. Doing it once for each of your 41 rows would reduce the burden neatly.
That leaves the integral itself. The quad function doesn't accept vector inputs - it would be a nightmare wouldn't it? - and neither does integral, which Mathworks are recommending you use instead. But if your integration limits are the same in each case then why not do the integral the old-fashioned way? Compute a matrix for the value of the integrand at 1, compute another matrix for the value of the integrand at 0 and then take the difference.
Then you can write a single loop that computes the integral for the whole input space then tests the convergence for all the matrix elements. Make a mask that notes the ones that have not converged and recalculate those with the increased s. Repeat until all have converged (or you hit a threshold for iterations).
Option 2 - parallelise it
It used to be the case that matlab was much faster with vectorised operations than loops. I can't find a source for it now but I think I've read that it's become a lot faster recently with for loops too, so depending on the resources you have available you might get better results by parallelising the code you currently have. That's going to need a bit of refactoring too - the big problems are overheads while copying in data to the workers (which you can fix by chopping the inputs up into chunks and just feeding the relevant one in) and the parfor loop not allowing you to use certain variables, usually ones which cover the whole space. Again chopping them up helps.
But if you have a 2 year runtime you will need a factor of at least 100 I'm guessing, so that means a cluster! If you're at a university or somewhere where you might be able to get a few days on a 500-core cluster then go for that...
If you can write the integral in a closed form then it might be amenable to GPU computation. Those things can do certain classes of computation very fast but you have to be able to parallelise the job and reduce the actual computation to something basic comprised mainly of addition and multiplication. The CUDA libraries have done a lot of the legwork and matlab has an interface to them so have a read about those.
Option 3 - reduce the scope
Finally, if neither of the above two results in sufficient speedups, then you may have to reduce the scope of your calculation. Trim the input space as much as you can and perhaps accept a lower convergence threshold. If you know how many iterations you tend to need inside the innermost while loop (the one with the s counter in it) then it might turn out that reducing the convergence criterion reduces the number of iterations you need, which could speed it up. The profiler can help see where you're spending your time.
The bottom line though is that 71 million points are going to take some time to compute. You can optimise the computation only so far, the odds are that for a problem of this size you will have to throw hardware at it.
I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra operations over it. I tried octave and simple single-threaded C code for the multiplication but my system RAM of 40GB gets used up very fast and then I can find the program starts thrashing. Is there any other options available to me. I am not familiar with MathLab or any other matrix operational library that can help me in doing so.
When I run a simple matrix multiplication of two matrices with 10 rows and 3 M cols, and its transpose, it gives the following error :
memory exhausted or requested size too large for range of Octave's index type
I am not sure whether the same would work on Matlab or not. For sparse matrix representation and matrix multiplication, is there another library or code.
if there are few enough nonzero entries, I suggest creating a sparse matrix S with appropriate dimensions and max nonzero entries; see matlab create sparse matrix. Then as #oleg komarov described, load the matrix in blocks and assign the nonzero entries from each block into the correct address in the sparse matrix S. I feel that if your matrix is sparse enough, then loading it is really the only difficulty you face. I had similar issues with large transfer operators.
Have you considered performing your processing in blocks? Transposition and multiplications work very well with block matrix processing (see https://en.wikipedia.org/wiki/Block_matrix) and that will get you around any limitations about the indices.
This wouldn't help you with matrix inversion though unless you can decompose your matrix in blocks when blocks that aren't on the diagonal are completely empty, which isn't stated in your assumptions.
Octave has a limit in both the memory resources of about 2GB and the maximum number of indices a matrix can hold of about 2^32 (for 32 bits Octave). MatLab doesn't have such a memory limit, since it will use all of your memory resources, swapping file included. Thus you could try with MatLab by setting a huge swapfile, you may then compute your operations (but it will anyway take quite along time...).
If you are interested by other approaches, you may take a look into out-of-core computing which aims to promote new methods to process huge datasets that cannot reside all in memory, but rather store it on disk and load efficiently the bits that are necessary.
For a practical approach, you may take a look into Blaze for Python (notice: still in development!).
I have this problem which requires solving for X in AX=B. A is of the order 15000 x 15000 and is sparse and symmetric. B is 15000 X 7500 and is NOT sparse. What is the fastest way to solve for X?
I can think of 2 ways.
Simplest possible way, X = A\B
Using for loop,
invA = A\speye(size(A))
for i = 1:size(B,2)
X(:,i) = invA*B(:,i);
end
Is there a better way than the above two? If not, which one is best between the two I mentioned?
First things first - never, ever compute inverse of A. That is never sparse except when A is a diagonal matrix. Try it for a simple tridiagonal matrix. That line on its own kills your code - memory-wise and performance-wise. And computing the inverse is numerically less accurate than other methods.
Generally, \ should work for you fine. MATLAB does recognize that your matrix is sparse and executes sparse factorization. If you give a matrix B as the right-hand side, the performance is much better than if you only solve one system of equations with a b vector. So you do that correctly. The only other technical thing you could try here is to explicitly call lu, chol, or ldl, depending on the matrix you have, and perform backward/forward substitution yourself. Maybe you save some time there.
The fact is that the methods to solve linear systems of equations, especially sparse systems, strongly depend on the problem. But in almost any (sparse) case I imagine, factorization of a 15k system should only take a fraction of a second. That is not a large system nowadays. If your code is slow, this probably means that your factor is not that sparse sparse anymore. You need to make sure that your matrix is properly reordered to minimize the fill (added non-zero entries) during sparse factorization. That is the crucial step. Have a look at this page for some tests and explanations on how to reorder your system. And have a brief look at example reorderings at this SO thread.
Since you can answer yourself which of the two is faster, I'll try yo suggest the next options.
Solve it using a GPU. Plenty of details can be found online, including this SO post, a matlab benchmarking of A/b, etc.
Additionally, there's the MATLAB add-on of LAMG (Lean Algebraic Multigrid). LAMG is a fast graph Laplacian solver. It can solve Ax=b in O(m) time and storage.
If your matrix A is symmetric positive definite, then here's what you can do to solve the system efficiently and stably:
First, compute the cholesky decomposition, A=L*L'. Since you have a sparse matrix, and you want to exploit it to accelerate the inversion, you should not apply chol directly, which would destroy the sparsity pattern. Instead, use one of the reordering method described here.
Then, solve the system by X = L'\(L\B)
Finally, if are not dealing with potential complex values, then you can replace all the L' by L.', which gives a bit further acceleration because it's just trying to transpose instead of computing the complex conjugate.
Another alternative would be the preconditioned conjugate gradient method, pcg in Matlab. This one is very popular in practice, because you can trade off speed for accuracy, i.e. give it less number of iterations, and it will give you a (usually pretty good) approximate solution. You also never need to store the matrix A explicitly, but just be able to compute matrix-vector product with A, if your matrix doesn't fit into memory.
If this takes forever to solve in your tests, you are probably going into virtual memory for the solve. A 15k square (full) matrix will require 1.8 gigabytes of RAM to store in memory.
>> 15000^2*8
ans =
1.8e+09
You will need some serious RAM to solve this, as well as the 64 bit version of MATLAB. NO factorization will help you unless you have enough RAM to solve the problem.
If your matrix is truly sparse, then are you using MATLAB's sparse form to store it? If not, then MATLAB does NOT know the matrix is sparse, and does not use a sparse factorization.
How sparse is A? Many people think that a matrix that is half full of zeros is "sparse". That would be a waste of time. On a matrix that size, you need something that is well over 99% zeros to truly gain from a sparse factorization of the matrix. This is because of fill-in. The resulting factorized matrix is almost always nearly full otherwise.
If you CANNOT get more RAM (RAM is cheeeeeeeeep you know, certainly once you consider the time you have wasted trying to solve this) then you will need to try an iterative solver. Since these tools do not factorize your matrix, if it is truly sparse, then they will not go into virtual memory. This is a HUGE savings.
Since iterative tools often require a preconditioner to work as well as possible, it can take some study to find the best preconditioner.