One-vs-One Classification and Out of Memory Error in MATLAB - matlab

I'm trying to classify 5 kinds of data, each has about 10,000 samples.
Using one-vs-one and voting method, I have to run classification for 5(5-1)/2=10 times,
which I have written a loop for that. But I get the massage,
Error using svmclassify (line 114)
An error was encountered during classification.
Out of memory. Type HELP MEMORY for your options.
What should I do ?
How can I rewrite the code ?

Compress data to reduce memory fragmentation.
If possible, break large matrices into several smaller matrices so that less memory is used at any one time.
If possible, reduce the size of your data.
Add more memory to the system.
With reference: http://www.mathworks.com/help/matlab/matlab_prog/resolving-out-of-memory-errors.html

Related

Taking big chunk of Time while Running K means on Python Spark

I have a nparray vector with 0s and 1s with 37k rows and 6k columns.
When I try to run Kmeans Clustering in Pyspark, it takes almost forever to load and I cannot get the output. Is there any way to reduce the processing time or any other tricks to solve this issue?
I think that you may have too many columns, you could have faced the dimensionality course. Wikipedia link
[...] The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [...]
In order to solve this problem, did you consider reducing your columns, using only relevant ones? Check again this Wikipedia link
[...] Feature projection transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. [...]

Hierarquical clustering with big vector

I'm working in MATLAB and I have a vector of 4 million values. I've tried to use the linkage function but I get this error:
Error using linkage (line 240) Requested 1x12072863584470 (40017.6GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
I've found that some people avoid this error using the kmeans functions however I would like to know if there is a way to avoid this error and still use the linkage function.
Most hierarchical clustering needs O(n²) memory. So no, you don't want to use these algorithms.
There are some exceptions, such as SLINK and CLINK. These can be implemented with just linear memory. I just don't know if Matlab has any good implementations.
Or you use kmeans or DBSCAN, which also need only linear memory.
Are you absolutely sure that these 4 million values are statistically correct? If yes, your a lucky person and if no, then do the data pre-processing. You'll see the 4 million values have drastically decreased to a meaningful sample which can easily be fit into the memory (RAM) to do hierarchical clustering.

Clustering, Large dataset, learning large number vocabulary words

I am try to do clustering from a large dataset dim:
rows: 1.4 million
cols:900
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.
Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.

How can I train neural network with more data than can fit the memory?

Trying to train a classifier with a recurrent layer using a lot of data. As a result, all data can not fit into the memory. It provides me with the following error:
Error using zeros
Requested 1x2114046976 (15.8GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
Error in nnMex.perfsGrad (line 3)
TEMP = zeros(1,ceil(hints.tempSizeBG/8)*8);
Error in nnCalcLib/perfsGrad (line 294)
lib.calcMode.perfsGrad(calcNet,lib.calcData,lib.calcHints);
Error in trainscg>initializeTraining (line 153)
[worker.perf,worker.vperf,worker.tperf,worker.gWB,worker.gradient] = calcLib.perfsGrad(calcNet);
Error in nnet.train.trainNetwork>trainNetworkInMainThread (line 28)
worker = localFcns.initializeTraining(archNet,calcLib,calcNet,tr);
Error in nnet.train.trainNetwork (line 16)
[archNet,tr] = trainNetworkInMainThread(archNet,rawData,calcLib,calcNet,tr,feedback,localFcns);
Error in trainscg>train_network (line 147)
[archNet,tr] = nnet.train.trainNetwork(archNet,rawData,calcLib,calcNet,tr,localfunctions);
Error in trainscg (line 59)
[out1,out2] = train_network(varargin{2:end});
Error in network/train (line 369)
[net,tr] = feval(trainFcn,'apply',net,data,calcLib,calcNet,tr);
It should be noted that currently my training input is 11x52266 and the network has ~3k weight elements due to the recurrent layer. I would like, however, to provide 15 times as much data for training.
How can I cope? Are there any techniques to map the local variable it's trying to initialize on my SSD instead of memory?
There is the "reduction" option for training, but it does not seem to do any difference on this matter. The same error occurs regardless.
In general, if your dataset is too big to fit into memory, you'll have to process it in chunks. For training large networks, it's typical to use stochastic gradient descent (which only requires access to a single data point at a time), or minibatch training (which only requires access to the data points in the minibatch). Besides requiring less memory, these methods also tend to converge much faster than batch gradient descent (which uses the entire dataset for each weight update). Disk access is slow so, even though only a few data points are required per update, you should still load as many points as you can, then split them into minibatches, etc. There are other tricks you can play to reduce the number of disk reads, like performing multiple updates before loading the next set of data.
Another point is specific to recurrent neural networks (RNNs). When you train a RNN using backpropagation through time (BPTT), the network has to be 'unrolled' in time, and is treated as a very deep feedforward network with a copy of the recurrent layer at each time step. This means that performing BPTT over more timesteps requires more memory (and more computation time). A solution is to use truncated BPTT, where the gradient is only propagated back over a fixed number of time steps.

MATLAB sparse matrix solvers? memory errors

In the context of a finite element problem, I have a 12800x12800 sparse matrix. I'm trying to solve the linear system just using MATLAB's \ operator to solve and I get an out of memory error using mldivide. So I'm just wondering if there's a way to speed this up.
I mean, will something like LU factorization actually help here in terms of not getting the memory error anymore? I increased the heap size to 256 GB in preferences, which is the max I can get it to, and I still get the out of memory error.
Also, just a general question. I have 8GB of RAM on my laptop right now. Will upgrading to 16GB help at all? Or maybe something I can do to allocate more memory to MATLAB? I'm pretty unfamiliar with this stuff.
According to this and this you have some options to avoid out of memory problem in matlab:
Increase operating system's virtual memory
Give Higher priority to MATLAB process in task manager
Use 64-bit version of MATLAB
Few months ago, I was working on integer programming in matlab. I faced "out of memory" problem, so I used sparse matrices and followed the mentioned tips, finally the problem is solved!
Are you locked in to using mldivide? Sounds like the perfect situation for an iterative method - bicg, gmres etc?
While backslash takes advantage of the sparsity of A, the qr method it uses produces full matrices that require (number_occupied_elements)^3 memory to be allocated. A few things you can try
If you're dividing sparse matrices with a few diagonals, you can try try to solve the system with forward/backwards substitution
Try breaking the problem into a smaller you break up the problem into a smaller
Run whos to see what elements are occupying your memory before you start the matrix division, can any of these be cleared beforehand?
Not applicable to your problem as you've stated it here, but if your system is defined (A has more rows than columns) than using the pseudo-inverse (A.'*A)\(A.'*b) produces a result using the smaller columns dimension
As for adding additional memory; Matlab32 uses 2^32 bytes of memory (4 Gb) so increasing the physical RAM on your computer won't help unless you're using the the 64 bit version.
MATLAB \ usually tries several methods to solve a problem. First, if it sees that if the structure of your matrix is symmetric it tries a Cholesky factorization. After several steps if it can not find a suitable answer current version of Matlab uses UMFPACK Suitsparse package.
UMFPack is a specific LU implemenation, and it is known for its speed and good usage of memory in practice. It also tries to reduce fill-in and keep matrix as sparse as possible. It is why MATLAB uses this code.
(I am working on UMFPACK for my PhD under supervision of Dr Tim Davis, its creator)
Therefor, using another LU factorization won't help. It is an LU factorization already.
One of the easiest way to solve your problem is testing your problem on another device with a better memory to see if it works.
I guess matlab do some garbage collection and waste some memory, so if you use the UMFPACK directly it might help you. You can either implement it in C/C++ or use MATLAB interface for it. Take a look at the SuitSparse package.
Based on the structure of your matrix I think MATLAB tries to use Cholesky; I don't know what is the strategy of MATLAB if Cholesky fails in memory management. Take into account that Cholesky is easier to manage in terms of memory.
There are other packages that might help you as well. CSparse is a lightweight package and it might help. There are other famouse packages that might be helpful; search for superLU.