I have a set of data that I am trying to learn using SVM. For context, the data has a dimensionality of 35 and contains approximately 30'000 data-points.
I have previously trained decision trees in Matlab with this dataset and it took approximately 20 seconds. Not being totally satisfied with the error rate, I decided to try SVM.
I first tried svmtrain(X,Y). After about 5 seconds, I get the following message:
??? Error using ==> svmtrain at 453
Error calculating the kernel function:
Out of memory. Type HELP MEMORY for your options.
When I looked up this error, it was suggested to me that I use the SMO method: svmtrain(X, Y, 'method', 'SMO');. After about a minute, I get this:
??? Error using ==> seqminopt>seqminoptImpl at 236
No convergence achieved within maximum number (15000) of main loop passes
Error in ==> seqminopt at 100
[alphas offset] = seqminoptImpl(data, targetLabels, ...
Error in ==> svmtrain at 437
[alpha bias] = seqminopt(training, groupIndex, ...
I tried using the other methods (LS and QP), but I get the first behaviour again: 5 second delay then
??? Error using ==> svmtrain at 453
Error calculating the kernel function:
Out of memory. Type HELP MEMORY for your options.
I'm starting to think that I'm doing something wrong because decision trees were so effortless to use and here I'm getting stuck on what seems like a very simple operation.
Your help is greatly appreciated.
Did you read the remarks near the end about the algorithm memory usage?
Try setting the method to SMO and use a kernelcachelimit value that is appropriate to the memory you have available on your machine.
During learning, the algorithm will build a double matrix of size kernelcachelimit-by-kernelcachelimit. default value is 5000
Otherwise subsample your instances and use techniques like cross-validation to measure the performance of the classifier.
Here is the relevant section:
Memory Usage and Out of Memory Error
When you set 'Method' to 'QP', the svmtrain function operates on a
data set containing N elements, and it creates an (N+1)-by-(N+1)
matrix to find the separating hyperplane. This matrix needs at least
8*(n+1)^2 bytes of contiguous memory. If this size of contiguous
memory is not available, the software displays an "out of memory"
error message.
When you set 'Method' to 'SMO' (default), memory consumption is
controlled by the kernelcachelimit option. The SMO algorithm stores
only a submatrix of the kernel matrix, limited by the size specified
by the kernelcachelimit option. However, if the number of data points
exceeds the size specified by the kernelcachelimit option, the SMO
algorithm slows down because it has to recalculate the kernel matrix
elements.
When using svmtrain on large data sets, and you run out of memory or
the optimization step is very time consuming, try either of the
following:
Use a smaller number of samples and use cross-validation to test the performance of the classifier.
Set 'Method' to 'SMO', and set the kernelcachelimit option as large as your system permits.
Related
I wrote my bottleneck function into a parallelized MEX function with MATLAB coder and it works fine so far.
But at certain points the function crashes with an Unexpected unknown exception from MEX file.
I marked some points in the function that they will display letters like 'a' so that a see where in the MEX function the error happens since I can't put breakpoint inside anymore. It didn't even come to the first letter, so the error must happen while initializing the function.
I recognized that the error always happens when the input variable size exceeds a certain limit.
The main input variables alongside some double 1x1 values are 3 nx1 double arrays and one nxn logical matrix. n can vary in size and will change size during the iterations, so I declared the as :Inf:1 and :Inf:Inf: in coder.
The problem appears when n exceeds a value of about 45000. I don't know the exact value since in one iterations when its below the value everything is fine and in the next one when its above that one it crashes.
When analysing it further I saw that it seems to happen when the nxn matrix exceeds a size of 2^31 bytes.
When I want to translate my code into MEX with the matrix that doesn't load into the MEX function I get an error with something like "error with intmax()... not supported"
So I guess that coder writes MEX functions for 32-bit systems and when loading a matrix larger than that there is no pointer to point to the right element in the matrix due to the 32 bit system?
How do I bypass that problem?
Since the matrix is a boolean matrix I now found out that MATLAB supports sparse matrices and that this would save a lot of memory since only 1/10^4 of my elements are 1, but I still would like to know if the problem above results from the 32bit MEX functions and if so how to solve this problem?
Yes, this issue is caused by the MEX-file being limited to 32-bit indexing. This is a limit in Coder:
For variable-size arrays that use dynamic memory allocation, the maximum number of elements is the smaller of:
intmax('int32').
The largest power of 2 that fits in the C int data type on the target hardware.
These restrictions apply even on a 64-bit platform.
See this documentation page
As far as I can tell, the only workaround is to have Coder generate C or C++ code for you, and you manually editing the code to use size_t instead of int for indexing (and of course remove any checks for array size). I have never worked with code generated by Coder, so am not sure how hard this is. I have written many MEX-files in C and C++ though, and these MEX-files work perfectly fine with 64-bit indexing and larger than 2 Gb arrays.
My code is:
function eigs_mem_test
N = 20000;
density = 0.2;
numOfModes = 250;
A = sprand(N, N, density);
profile -memory on
eigs(A, numOfModes, 0.0)
profile off
profsave(profile('info'), 'eigs_test')
profview
end
And this returns
i.e. it says that MATLAB allocated 18014398508117708.00 Kb or 1.8e10 Gb -- completely impossible. How did this happen? The code finishes with correct output and in htop I can see the memory usage vary quite a bit, but staying under 16G.
For N = 2000, I get sensible results (i.e. 0.2G allocated.)
How can I profile this case effectively, if I want to obtain an upper bound on memory used for large sparse matrices?
I use MATLAB R2017a.
I cannot reproduce your issue in R2017b, with 128GB of RAM on my machine. Here is the result after running your example code:
Notably, the function peaked at 14726148Kb, or ~1.8GB. I'm more confused by the units MATLAB has used here, as I saw nearer 14GB of usage in the task manager, which matches your large observed usage (and 1.4e7KB in GB), I can only think the profiler is meant to state KB (kilobytes) instead of Kb (kilobits).
Ridiculously large, unexpected values like this are often the result of overflow, so this could be an internal overflow bug.
You could use whos to get the size on disk of a variable
w = whos('A'); % get details of variable A
sizeOnDisk = w.bytes; % get size on disk
This doesn't necessarily tell you how much memory a function like eigs in your example uses though. You could poll memory within your function to get the current usage.
I'll resist exploring this further, since the question of how to profile for memory usage has already been asked and answered.
N.B. I'm not sure why my machine was ~100x slower than yours, I assume the image of your memory usage didn't come from actually running your example code? Or my RAM is awful...
I am trying to train a feedforward neural network, for binary classification. My Dataset is 6.2M with 1.5M dimension. I am using PyBrain. I am unable to load even a single datapoint. I am getting MemoryError.
My Code snippet is:
Train_ds = SupervisedDataSet(FV_length, 1) #FV_length is a computed value. 150000
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
#activate feature values
for index in nonzero_index_list:
feature_vector[index] = 1
Train_ds.addSample(feature_vector,class_label) # both the arguments are tuples
It looks like your computer just does not have the memory to add your feature and class label arrays to the supervised data set Train_ds.
If there is no way for you to allocate more memory to your system it might be a good idea to random sample from your data set and train on the smaller sample.
This should still give accurate results assuming the sample is large enough to be representative.
Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?
I want to use MATLAB linprog to solve a problem, and I check it by a much smaller, much simpler example.
But I wonder if MATLAB can support my real problem, there may be a 300*300*300*300 matrix...
Maybe I should give the exact problem. There is a directed graph of network nodes, and I want to get the lowest utilization of the edge capacity under some constraints. Let m be the number of edges, and n be the number of nodes. There are mn² variables and nm² constraints. Unfortunately, n may reach 300...
I want to use MATLAB linprog to solve it. As described above, I am afraid MATLAB can not support it...Lastly the matrix must be sparse, can some way simplify it?
First: a 300*300*300*300 array is not called a matrix, but a tensor (or simply array). Therefore you can not use matrix/vector algebra on it, because that is not defined for arrays with dimensionality greater than 2, and you can certainly not use it in linprog without some kind of interpretation step.
Second: if I interpret that 300⁴ to represent the number of elements in the matrix (and not the size), it really depends if MATLAB (or any other software) can support that.
As already answered by ben, if your matrix is full, then the answer is likely to be no. 300^4 doubles would consume almost 65GB of memory, so it's quite unlikely that any software package is going to be capable of handling that all from memory (unless you actually have > 65 GB of RAM). You could use a blockproc-type scheme, where you only load parts of the matrix in memory and leave the rest on harddisk, but that is insanely slow. Moreover, if you have matrices that huge, it's entirely possible you're overlooking some ways in which your problem can be simplified.
If you matrix is sparse (i.e., contains lots of zeros), then maybe. Have a look at MATLAB's sparse command.
So, what exactly is your problem? Where does that enormous matrix come from? Perhaps I or someone else sees a way in which to reduce that matrix to something more manageable.
On my system, with 24GByte RAM installed, running Matlab R2013a, memory gives me:
Maximum possible array: 44031 MB (4.617e+10 bytes) *
Memory available for all arrays: 44031 MB (4.617e+10 bytes) *
Memory used by MATLAB: 1029 MB (1.079e+09 bytes)
Physical Memory (RAM): 24574 MB (2.577e+10 bytes)
* Limited by System Memory (physical + swap file) available.
On a 64-bit version of Matlab, if you have enough RAM, it should be possible to at least create a full matrix as big as the one you suggest, but whether linprog can do anything useful with it in a realistic time is another question entirely.
As well as investigating the use of sparse matrices, you might consider working in single precision: that halves your memory usage for a start.
well you could simply try: X=zeros( 300*300*300*300 )
on my system it gives me a very clear statement:
>> X=zeros( 300*300*300*300 )
Error using zeros
Maximum variable size allowed by the program is exceeded.
since zeros is a build in function, which only fills a array of the given size with zeros you can asume that handling such a array will not be possible
you can also use the memory command
>> memory
Maximum possible array: 21549 MB (2.260e+10 bytes) *
Memory available for all arrays: 21549 MB (2.260e+10 bytes) *
Memory used by MATLAB: 685 MB (7.180e+08 bytes)
Physical Memory (RAM): 12279 MB (1.288e+10 bytes)
* Limited by System Memory (physical + swap file) available.
>> 2.278e+10 /8
%max bytes avail for arrays divided by 8 bytes for double-precision real values
ans =
2.8475e+09
>> 300*300*300*300
ans =
8.1000e+09
which means I dont even have the memory to store such a array.
while this may not answer your question directly it might still give you some insight.