Hierarquical clustering with big vector - matlab

I'm working in MATLAB and I have a vector of 4 million values. I've tried to use the linkage function but I get this error:
Error using linkage (line 240) Requested 1x12072863584470 (40017.6GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
I've found that some people avoid this error using the kmeans functions however I would like to know if there is a way to avoid this error and still use the linkage function.

Most hierarchical clustering needs O(n²) memory. So no, you don't want to use these algorithms.
There are some exceptions, such as SLINK and CLINK. These can be implemented with just linear memory. I just don't know if Matlab has any good implementations.
Or you use kmeans or DBSCAN, which also need only linear memory.

Are you absolutely sure that these 4 million values are statistically correct? If yes, your a lucky person and if no, then do the data pre-processing. You'll see the 4 million values have drastically decreased to a meaningful sample which can easily be fit into the memory (RAM) to do hierarchical clustering.

Related

Taking big chunk of Time while Running K means on Python Spark

I have a nparray vector with 0s and 1s with 37k rows and 6k columns.
When I try to run Kmeans Clustering in Pyspark, it takes almost forever to load and I cannot get the output. Is there any way to reduce the processing time or any other tricks to solve this issue?
I think that you may have too many columns, you could have faced the dimensionality course. Wikipedia link
[...] The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [...]
In order to solve this problem, did you consider reducing your columns, using only relevant ones? Check again this Wikipedia link
[...] Feature projection transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. [...]

Matlab optimization toolbox, optimizing hessian

Never used this toolbox before, I have a very large problem (i.e. number of variables) to be optimzed. I'm aware it's possible to optimize the hessian computation, which is my issue given the error:
Error using eye
Requested 254016x254016 (480.7GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may
take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
But according to this quote (from a forum) it must be possible to optimize the hessian computation:
If you are going to use the trust-region algorithm, you will need to
choose some combination of the options 'Hessian', 'HessMult', and
'HessPattern' to avoid full, explicit computation of the Hessian.
I struggle to find examples of this settings, does anyone know?
My problem is a sparse problem, if such information is necessary.
Basically I'm sure there's some extra options to be put in a line like:
option = optimoptions(#fminunc,...
'Display','iter','GradObj','on','MaxIter',30,...
'ObjectiveLimit',10e-10,'Algorithm','quasi-newton');
You probably need to add 'HessPattern',Hstr to optimoptions. An example is given here (In this example, Hstr is defined in brownhstr.mat; you need to calculate your own hessian sparsity pattern matrix Hstr).

Clustering, Large dataset, learning large number vocabulary words

I am try to do clustering from a large dataset dim:
rows: 1.4 million
cols:900
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.
Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.

Dealing with a large kernel matrix in SVM

I have a matrix X, size 40-by-60000
while writing the SVM, I need to form a linear kernel: K = X'*X
And of course I would get an error
Requested 60000x60000 (26.8GB) array exceeds maximum array size preference.
How is it usually done? The data set is Mnist, so someone must have done this before. In this case rank(K) <= 40, I need a way to store K and later pass it to quadprog.
How is it usually done?
Usually kernel matrices for big datasets are not precomputed. Since optimisation methods used (like SMO or gradient descent) do only need access to a subset of samples in each iteration, you simply need a data structure which is a lazy kernel matrix, in other words - each time an optimiser requests K[i,j] you literally compute K(xi,xj) then. Often, there are also caching mechanisms to make sure that often requested kernel values are already prepared etc.
If you're willing to commit to a linear kernel (or any other kernel whose corresponding feature transformation is easily computed) you can avoid allocating O(N^2) memory by using a primal optimization method, which does not construct the full kernel matrix K.
Primal methods represent the model using a weighted sum of the training samples' features, and so will only take O(NxD) memory, where N and D are the number of training samples and their feature dimension.
You could also use liblinear (if you resolve the C++ issues).
Note this comment from their website: "Without using kernels, one can quickly train a much larger set via a linear classifier."
This problem occurs due to the large size of your data set, thus it exceeds the amount of RAM available in your system. In 64-bit systems data processing performs better than in 32-bit, so you'll want to check which of the two your system is.

One-vs-One Classification and Out of Memory Error in MATLAB

I'm trying to classify 5 kinds of data, each has about 10,000 samples.
Using one-vs-one and voting method, I have to run classification for 5(5-1)/2=10 times,
which I have written a loop for that. But I get the massage,
Error using svmclassify (line 114)
An error was encountered during classification.
Out of memory. Type HELP MEMORY for your options.
What should I do ?
How can I rewrite the code ?
Compress data to reduce memory fragmentation.
If possible, break large matrices into several smaller matrices so that less memory is used at any one time.
If possible, reduce the size of your data.
Add more memory to the system.
With reference: http://www.mathworks.com/help/matlab/matlab_prog/resolving-out-of-memory-errors.html