Dealing with a large kernel matrix in SVM - matlab

I have a matrix X, size 40-by-60000
while writing the SVM, I need to form a linear kernel: K = X'*X
And of course I would get an error
Requested 60000x60000 (26.8GB) array exceeds maximum array size preference.
How is it usually done? The data set is Mnist, so someone must have done this before. In this case rank(K) <= 40, I need a way to store K and later pass it to quadprog.

How is it usually done?
Usually kernel matrices for big datasets are not precomputed. Since optimisation methods used (like SMO or gradient descent) do only need access to a subset of samples in each iteration, you simply need a data structure which is a lazy kernel matrix, in other words - each time an optimiser requests K[i,j] you literally compute K(xi,xj) then. Often, there are also caching mechanisms to make sure that often requested kernel values are already prepared etc.

If you're willing to commit to a linear kernel (or any other kernel whose corresponding feature transformation is easily computed) you can avoid allocating O(N^2) memory by using a primal optimization method, which does not construct the full kernel matrix K.
Primal methods represent the model using a weighted sum of the training samples' features, and so will only take O(NxD) memory, where N and D are the number of training samples and their feature dimension.
You could also use liblinear (if you resolve the C++ issues).
Note this comment from their website: "Without using kernels, one can quickly train a much larger set via a linear classifier."

This problem occurs due to the large size of your data set, thus it exceeds the amount of RAM available in your system. In 64-bit systems data processing performs better than in 32-bit, so you'll want to check which of the two your system is.

Related

Hierarquical clustering with big vector

I'm working in MATLAB and I have a vector of 4 million values. I've tried to use the linkage function but I get this error:
Error using linkage (line 240) Requested 1x12072863584470 (40017.6GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
I've found that some people avoid this error using the kmeans functions however I would like to know if there is a way to avoid this error and still use the linkage function.
Most hierarchical clustering needs O(n²) memory. So no, you don't want to use these algorithms.
There are some exceptions, such as SLINK and CLINK. These can be implemented with just linear memory. I just don't know if Matlab has any good implementations.
Or you use kmeans or DBSCAN, which also need only linear memory.
Are you absolutely sure that these 4 million values are statistically correct? If yes, your a lucky person and if no, then do the data pre-processing. You'll see the 4 million values have drastically decreased to a meaningful sample which can easily be fit into the memory (RAM) to do hierarchical clustering.

Solving Ax=b where A is too big to be stored in a single array

Problem: A is square, full rank, sparse and banded. It has way too many elements to be stored as a single matrix in Matlab (at least ~4.6*1018 and ideally ~1040, both of which exceed max array size. EDIT: A is stored as sparse, and the problem is not with limited memory but with limited number of elements). Therefore I have to store it as a collection of smaller arrays (rows/diagonals/columns/blocks).
Looking for: a way to solve Ax=b, with A given as a collection of smaller arrays. Ideally in Matlab but not a must.
Alternatively, if not in Matlab: maybe there's a program that can store and solve such a big A?
Found so far: methods if A is tri/pentadiagonal, but my A has N diagonals. Also found something about partitioning A to blocks, but couldn't find a way to then solve a linear system with these blocks.
p.s. The system is 64-bit.
Thanks everyone!
Not using Matlab would allow you to store larger arrays. ROOT is an open source framework developed at CERN that has C++ and Python interfaces and a variety of solvers. It is also capable of handling huge datasets and has a variety of visualization and analysis tools as well.
If you are interested in writing C or Fortran BLAS(Basic Linear Algebra Subroutines) and CBLAS would be good options. There are many open source and proprietary implementations of BLAS that should be available for most Linux/UNIX distributions. There are also plenty of examples showing how to use the BLAS subroutines in C and Fortran code available online.
If you have access to MATLAB's Parallel Computing Toolbox together with MATLAB Distributed Computing Server, you may be able to store A as a distributed array, in other words a single array whose elements are distributed across the memories of multiple machines in a cluster. You can call MATLAB's backslash command directly on a distributed array, and MATLAB handles the parallelization for you.
I wanted to put this as a comment, but I think it is better to state it as an answer.
You have a serious problem. It is not only a problem of indexing, it is also a problem of memory: 4.6x10^18 is huge. That is 4.6 exa elements. If you store them as real single precision, you need 4x4.6 exabyte of memory. A computer which such a huge memory, does not yet exists to my knowledge. You will need to gather all the storage (hard disk, not RAM) of a significant proportion of all computers in the world to store such a matrix. Think about it. Going to 10^40 elements is nearly impractical for the time being. With your 64 bit computers, the 64 bit address space can bearly address 4.6x10^18 elements. 64 bits address (or integer) makes it possible to directly index 2^64 elements which is roughly 16x10^18. So you have to think twice.
Going back to the problem itself, there are chances that you can turn your matrix into an implicit operator. By implicit operator, I mean, you do not need to store it, because it has a pattern that you know how to reproduce, or you can apply it to a vector without actually forming the matrix. If you have the matrix in hand, you are very likely in this situation, considering what I said above.
If that is the case, to solve your problem, you simply need to use an iterative solver and provide a black box that does your matrix multiplication. Going to other directions might be a waste of your time.

MATLAB's svmtrain : save support vector indexes instead of support vectors themselves

I'm working on a machine learning problem which requires me to use multiple support vector machines. It works relatively well; however, the problem is that the number of support vector machines for each SVM tends to be large (~2,000), and the number of input features is on the order of 50,000. I need about a 100 SVMs.
Running this on my laptop quickly exhausts all of the available memory; I think this is because svmtrain creates an SVM which saves (i.e. has another copy) of all the support vectors. Since I'm keeping the original training data around, I was wondering if there was a way to instruct it to save indexes to these support vectors instead, which would use much less memory? Or another way to reduce the amount of memory it needs per SVM?
svmtrain creates SVMStruct object, which stores the indices of support vectors in SupportVectorIndices field. So simply store this variable's values in some container and free the rest of the model.
Accodring to documentation
SupportVectorIndices — Vector of indices that specify the rows in Training, the training data, that were selected as support vectors after the data was normalized, according to the AutoScale argument.

Large and Sparse Matrix Multiplcation

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra operations over it. I tried octave and simple single-threaded C code for the multiplication but my system RAM of 40GB gets used up very fast and then I can find the program starts thrashing. Is there any other options available to me. I am not familiar with MathLab or any other matrix operational library that can help me in doing so.
When I run a simple matrix multiplication of two matrices with 10 rows and 3 M cols, and its transpose, it gives the following error :
memory exhausted or requested size too large for range of Octave's index type
I am not sure whether the same would work on Matlab or not. For sparse matrix representation and matrix multiplication, is there another library or code.
if there are few enough nonzero entries, I suggest creating a sparse matrix S with appropriate dimensions and max nonzero entries; see matlab create sparse matrix. Then as #oleg komarov described, load the matrix in blocks and assign the nonzero entries from each block into the correct address in the sparse matrix S. I feel that if your matrix is sparse enough, then loading it is really the only difficulty you face. I had similar issues with large transfer operators.
Have you considered performing your processing in blocks? Transposition and multiplications work very well with block matrix processing (see https://en.wikipedia.org/wiki/Block_matrix) and that will get you around any limitations about the indices.
This wouldn't help you with matrix inversion though unless you can decompose your matrix in blocks when blocks that aren't on the diagonal are completely empty, which isn't stated in your assumptions.
Octave has a limit in both the memory resources of about 2GB and the maximum number of indices a matrix can hold of about 2^32 (for 32 bits Octave). MatLab doesn't have such a memory limit, since it will use all of your memory resources, swapping file included. Thus you could try with MatLab by setting a huge swapfile, you may then compute your operations (but it will anyway take quite along time...).
If you are interested by other approaches, you may take a look into out-of-core computing which aims to promote new methods to process huge datasets that cannot reside all in memory, but rather store it on disk and load efficiently the bits that are necessary.
For a practical approach, you may take a look into Blaze for Python (notice: still in development!).

Histogram computational efficiency

I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.