Solving Ax=b where A is too big to be stored in a single array - matlab

Problem: A is square, full rank, sparse and banded. It has way too many elements to be stored as a single matrix in Matlab (at least ~4.6*1018 and ideally ~1040, both of which exceed max array size. EDIT: A is stored as sparse, and the problem is not with limited memory but with limited number of elements). Therefore I have to store it as a collection of smaller arrays (rows/diagonals/columns/blocks).
Looking for: a way to solve Ax=b, with A given as a collection of smaller arrays. Ideally in Matlab but not a must.
Alternatively, if not in Matlab: maybe there's a program that can store and solve such a big A?
Found so far: methods if A is tri/pentadiagonal, but my A has N diagonals. Also found something about partitioning A to blocks, but couldn't find a way to then solve a linear system with these blocks.
p.s. The system is 64-bit.
Thanks everyone!

Not using Matlab would allow you to store larger arrays. ROOT is an open source framework developed at CERN that has C++ and Python interfaces and a variety of solvers. It is also capable of handling huge datasets and has a variety of visualization and analysis tools as well.
If you are interested in writing C or Fortran BLAS(Basic Linear Algebra Subroutines) and CBLAS would be good options. There are many open source and proprietary implementations of BLAS that should be available for most Linux/UNIX distributions. There are also plenty of examples showing how to use the BLAS subroutines in C and Fortran code available online.

If you have access to MATLAB's Parallel Computing Toolbox together with MATLAB Distributed Computing Server, you may be able to store A as a distributed array, in other words a single array whose elements are distributed across the memories of multiple machines in a cluster. You can call MATLAB's backslash command directly on a distributed array, and MATLAB handles the parallelization for you.

I wanted to put this as a comment, but I think it is better to state it as an answer.
You have a serious problem. It is not only a problem of indexing, it is also a problem of memory: 4.6x10^18 is huge. That is 4.6 exa elements. If you store them as real single precision, you need 4x4.6 exabyte of memory. A computer which such a huge memory, does not yet exists to my knowledge. You will need to gather all the storage (hard disk, not RAM) of a significant proportion of all computers in the world to store such a matrix. Think about it. Going to 10^40 elements is nearly impractical for the time being. With your 64 bit computers, the 64 bit address space can bearly address 4.6x10^18 elements. 64 bits address (or integer) makes it possible to directly index 2^64 elements which is roughly 16x10^18. So you have to think twice.
Going back to the problem itself, there are chances that you can turn your matrix into an implicit operator. By implicit operator, I mean, you do not need to store it, because it has a pattern that you know how to reproduce, or you can apply it to a vector without actually forming the matrix. If you have the matrix in hand, you are very likely in this situation, considering what I said above.
If that is the case, to solve your problem, you simply need to use an iterative solver and provide a black box that does your matrix multiplication. Going to other directions might be a waste of your time.


Can someone tell me about the kNN search algo that Matlab uses?

I wrote a basic O(n^2) algorithm for a nearest neighbor search. As usual Matlab 2013a's knnsearch(..) method works a lot faster.
Can someone tell me what kind of optimization they used in their implementation?
I am okay with reading any documentation or paper that you may point me to.
PS: I understand the documentation on the site mentions the paper on kd trees as a reference. But as far as I understand kd trees are the default option when column number is less than 10. Mine is 21. Correct me if I'm wrong about it.
The biggest optimization MathWorks have made in implementing nearest-neighbors search is that all the hard stuff is implemented in a MEX file, as compiled C, rather than MATLAB.
With an algorithm such as kNN that (in my limited understanding) is quite recursive and difficult to vectorize, that's likely to give such an improvement that the O() analysis will only be relevant at pretty high n.
In more detail, under the hood the knnsearch command uses createns to create a NeighborSearcher object. By default, when X has less than 10 columns, this will be a KDTreeSearcher object, and when X has more than 10 columns it will be an ExhaustiveSearcher object (both KDTreeSearcher and ExhaustiveSearcher are subclasses of NeighborSearcher).
All objects of class NeighbourSearcher have a method knnsearch (which you would rarely call directly, using instead the convenience command knnsearch rather than this method). The knnsearch method of KDTreeSearcher calls straight out to a MEX file for all the hard work. This lives in matlabroot\toolbox\stats\stats\#KDTreeSearcher\private\knnsearchmex.mexw64.
As far as I know, this MEX file performs pretty much the algorithm described in the paper by Friedman, Bentely, and Finkel referenced in the documentation page, with no structural changes. As the title of the paper suggests, this algorithm is O(log(n)) rather than O(n^2). Unfortunately, the contents of the MEX file are not available for inspection to confirm that.
The code builds a KD-tree space-partitioning structure to speed up nearest neighbor search, think of it like building indexes commonly used in RDBMS to speed up lookup operations.
In addition to nearest neighbor(s) searches, this structure also speeds up range-searches, which finds all points that are within a distance r from a query point.
As pointed by #SamRoberts, the core of the code is implemented in C/C++ as a MEX-function.
Note that knnsearch chooses to build a KD-tree only under certain conditions, and falls back to an exhaustive search otherwise (by naively searching all points for the nearest one).
Keep in mind that in cases of very high-dimensional data (and few instances), the algorithm degenerates and is no better than an exhaustive search. In general as you go with dimensions d>30, the cost of searching KD-trees will increase to searching almost all the points, and could even become worse than a brute force search due to the overhead involved in building the tree.
There are other variations to the algorithm that deals with high dimensions such as the ball trees which partitions the data in a series of nesting hyper-spheres (as opposed to partitioning the data along Cartesian axes like KD-trees). Unfortunately those are not implemented in the official Statistics toolbox. If you are interested, here is a paper which presents a survey of available kNN algorithms.
(The above is an illustration of searching a kd-tree partitioned 2d space, borrowed from the docs)

MATLAB sparse matrix solvers? memory errors

In the context of a finite element problem, I have a 12800x12800 sparse matrix. I'm trying to solve the linear system just using MATLAB's \ operator to solve and I get an out of memory error using mldivide. So I'm just wondering if there's a way to speed this up.
I mean, will something like LU factorization actually help here in terms of not getting the memory error anymore? I increased the heap size to 256 GB in preferences, which is the max I can get it to, and I still get the out of memory error.
Also, just a general question. I have 8GB of RAM on my laptop right now. Will upgrading to 16GB help at all? Or maybe something I can do to allocate more memory to MATLAB? I'm pretty unfamiliar with this stuff.
According to this and this you have some options to avoid out of memory problem in matlab:
Increase operating system's virtual memory
Give Higher priority to MATLAB process in task manager
Use 64-bit version of MATLAB
Few months ago, I was working on integer programming in matlab. I faced "out of memory" problem, so I used sparse matrices and followed the mentioned tips, finally the problem is solved!
Are you locked in to using mldivide? Sounds like the perfect situation for an iterative method - bicg, gmres etc?
While backslash takes advantage of the sparsity of A, the qr method it uses produces full matrices that require (number_occupied_elements)^3 memory to be allocated. A few things you can try
If you're dividing sparse matrices with a few diagonals, you can try try to solve the system with forward/backwards substitution
Try breaking the problem into a smaller you break up the problem into a smaller
Run whos to see what elements are occupying your memory before you start the matrix division, can any of these be cleared beforehand?
Not applicable to your problem as you've stated it here, but if your system is defined (A has more rows than columns) than using the pseudo-inverse (A.'*A)\(A.'*b) produces a result using the smaller columns dimension
As for adding additional memory; Matlab32 uses 2^32 bytes of memory (4 Gb) so increasing the physical RAM on your computer won't help unless you're using the the 64 bit version.
MATLAB \ usually tries several methods to solve a problem. First, if it sees that if the structure of your matrix is symmetric it tries a Cholesky factorization. After several steps if it can not find a suitable answer current version of Matlab uses UMFPACK Suitsparse package.
UMFPack is a specific LU implemenation, and it is known for its speed and good usage of memory in practice. It also tries to reduce fill-in and keep matrix as sparse as possible. It is why MATLAB uses this code.
(I am working on UMFPACK for my PhD under supervision of Dr Tim Davis, its creator)
Therefor, using another LU factorization won't help. It is an LU factorization already.
One of the easiest way to solve your problem is testing your problem on another device with a better memory to see if it works.
I guess matlab do some garbage collection and waste some memory, so if you use the UMFPACK directly it might help you. You can either implement it in C/C++ or use MATLAB interface for it. Take a look at the SuitSparse package.
Based on the structure of your matrix I think MATLAB tries to use Cholesky; I don't know what is the strategy of MATLAB if Cholesky fails in memory management. Take into account that Cholesky is easier to manage in terms of memory.
There are other packages that might help you as well. CSparse is a lightweight package and it might help. There are other famouse packages that might be helpful; search for superLU.

How does MATLAB vectorized code work "under the hood"?

I understand how using vectorization in a language like MATLAB speeds up the code by removing the overhead of maintaining a loop variable, but how does the vectorization actually take place in the assembly / machine code? I mean there still has to be a loop somewhere, right?
Matlab 'vectorization' concept is completely different than the vector instructions concept, such as SSE. This is a common misunderstanding between two groups of people: matlab programmers and C/asm programmers. Matlab 'vectorization', as the word is commonly used, is only about expressing loops in the form of (vectors of) matrix indices, and sometimes about writing things in terms of basic matrix/vector operations (BLAS), instead of writing the loop itself. Matlab 'vectorized' code is not necessarily expressed as vectorized CPU instructions. Consider the following code:
A = rand(1000);
B = (A(1:2:end,:)+A(2:2:end,:))/2;
This code computes mean values for two adjacent matrix rows. It is a 'vectorized' matlab expression. However, since matlab stores matrices column-wise (columns are contiguous in memory), this operation is not trivially changed into operations on SSE vectors: since we perform the operations row-wise the data you need to load into the vectors is not stored contiguously in the memory.
This code on the other hand
A = rand(1000);
B = (A(:,1:2:end)+A(:,2:2:end))/2;
can take advantage of SSE instructions and streaming instructions, since we operate on two adjacent columns at a time.
So, matlab 'vectorization' is not equivalent to using CPU vector instructions. It is just a word used to signify the lack of a loop implemented in MATLAB. To add to the confusion, sometimes people even use the word to say that some loop has been implemented using a built-in function, such as arrayfun, or bsxfun. Which is even more misleading since those functions might be significantly slower than native matlab loops. As robince said, not all loops are slow in matlab nowadays, though you do need to know when they work, and when they don't.
And in any way you always need a loop, it is just implemented in matlab built-in functions / BLAS instead of the users matlab code.
Yes there is still a loop. But it is able to loop directly in compiled code. Loops in Fortran (on which Matlab was originally based) C or C++ are not inherently slow. That they are slow in Matlab is a property of dynamic runtime (they are also slower in other dynamic languages like Python).
Since Matlab has introduced a Just-In-Time compiler loop performance has actually increased dramatically - so the old guidelines to avoid loops are less important with recent versions than they once were.

Matlab and GPU/CUDA programming

I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?
I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).
Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that

How do I obtain the eigenvalues of a huge matrix (size: 2x10^5)

I have a matrix of size 200000 X 200000 .I need to find the eigen values for this .I was using matlab till now but as the size of the matrix is unhandleable by matlab i have shifted to perl and now even perl is unable to handle this huge matrix it is saying out of memory.I would like to know if i can find out the eigen values of this matrix using some other programming language which can handle such huge data. The elements are not zeros mostly so no option of going for sparse matrix. Please help me in solving this.
I think you may still have luck with MATLAB. Take a look into their distributed computing toolbox. You'd need some kind of parallel environment, a computing cluster.
If you don't have a computational cluster, you might look into distributed eigenvalue/vector calculation methods that could be employed on Amazon EC2 or similar.
There is also a discussion of parallel eigenvalue calculation methods here, which may direct you to better libraries and programming approaches than Perl.