Clustering in Matlab - matlab

Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.

Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?

Related

scikit-learn: Hierarchal Agglomerative Clustering performance with increasing dataset

scikit-learn==0.21.2
Hierarchal Agglomerative Clustering algorithm response time is increasing exponentially when increasing the dataset.
My Data set is textual. Each Document is 7-10 words long.
Using the following code to perform the Clustering.
hac_model = AgglomerativeClustering(affinity=consine,
linkage=complete,
compute_full_tree=True,
connectivity=None, memory=None,
n_clusters=None,
distance_threshold=0.7)
cluster_matrix = hac_model.fit_predict(matrix)
where the matrix of size are:
5000x1500 taking 17 seconds
10000*2000 taking 113 seconds
13000*2418 taking 228 seconds
I can't control 5000, 10000, 15000 as that is the size of input, or the feature set size(i.e 1500,2000,2418) since I am using BOW model(TFIDF).
I end up using all the unique words(after removing stopwords) as my feature list. this list grows as the input size increases.
So two questions.
How do I avoid increase in feature set size irrespective of increase in the size of input data set
Is there a way I can improve on the performance of the Algorithm without compromising on the quality?
Standard AGNES hierarchical clustering is O(n³+n²d) in complexity. So the number of instances is much more a problem than the number of features.
There are approaches that typically run in O(n²d), although the worst case remains the same, so they will be much faster than this. With these you'll usually run into memory limits first... Unfortunately, this isn't implemented in sklearn for all I know, so you'll have to use other clustering tools - or write the algorithm yourself.

Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

I am working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).
After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.
One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).
Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.
However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.
My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.
My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.
So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).
For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).
I have tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:
logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
i = i+1;
m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order
logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;
% if the small blocks were parallelised perfectly, i.e. all
% cores take the same time, the time would be proportional to
% i*m. To try to discount the different sizes (and imperfect
% parallelisation), we only scale with a power of i less than
% one to not end up with a few blocks which hold up the rest
end
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);
(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)
I have also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.
I would be very grateful if someone has a good idea for how to solve this.
I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.
Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:
function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% c: Estimated cost of parallel computation
num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);
i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
i=i+1;
[~,i_min]=min(c); % which core finished first; is fed with next block
c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end
c=max(c);
end
The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.
I use the above routine to determine the cut-off in the following way:
% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime
% Inputs:
% cost_blocks: Estimate of cost per block in arbitrary units. For
% consistency with the other code this must be in the reverse order
% that the scheduler is fed, i.e. cost should be ascending!
% cost_it: Base cost of iteration (regardless of number of entries)
% in the same units as cost_blocks.
% num_cores: Number of cores
%
% Output:
% i: Number of blocks to be calculated serially
num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);
for i=0:num_blocks
cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end
[~,i]=min(sum(cost,2));
i=i-1;
end
In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.
Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).
Compared to the approach in my wordy question I see two main advantages:
The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).
Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.
Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

Fast Algorithms for Finding Pairwise Euclidean Distance (Distance Matrix)

I know matlab has a built in pdist function that will calculate pairwise distances. However, my matrix is so large that its 60000 by 300 and matlab runs out of memory.
This question is a follow up on Matlab euclidean pairwise square distance function.
Is there any workaround for this computational inefficiency. I tried manually coding the pairwise distance calculations and it usually takes a full day to run (sometimes 6 to 7 hours).
Any help is greatly appreciated!
Well, I couldn't resist playing around. I created a Matlab mex C file called pdistc that implements pairwise Euclidean distance for single and double precision. On my machine using Matlab R2012b and R2015a it's 20–25% faster than pdist(and the underlying pdistmex helper function) for large inputs (e.g., 60,000-by-300).
As has been pointed out, this problem is fundamentally bounded by memory and you're asking for a lot of it. My mex C code uses minimal memory beyond that needed for the output. In comparing its memory usage to that of pdist, it looks like the two are virtually the same. In other words, pdist is not using lots of extra memory. Your memory problem is likely in the memory used up before calling pdist (can you use clear to remove any large arrays?) or simply because you're trying to solve a big computational problem on tiny hardware.
So, my pdistc function likely won't be able to save you memory overall, but you may be able to use another feature I built in. You can calculate chunks of your overall pairwise distance vector. Something like this:
m = 6e3;
n = 3e2;
X = rand(m,n);
sz = m*(m-1)/2;
for i = 1:m:sz-m
D = pdistc(X', i, i+m); % mex C function, X is transposed relative to pdist
... % Process chunk of pairwise distances
end
This is considerably slower (10 times or so) and this part of my C code is not optimized well, but it will allow much less memory use – assuming that you don't need the entire array at one time. Note that you could do the same thing much more efficiently with pdist (or pdistc) by creating a loop where you passed in subsets of X directly, rather than all of it.
If you have a 64-bit Intel Mac, you won't need to compile as I've included the .mexmaci64 binary, but otherwise you'll need to figure out how to compile the code for your machine. I can't help you with that. It's possible that you may not be able to get it to compile or that there will be compatibility issues that you'll need to solve by editing the code yourself. It's also possible that there are bugs and the code will crash Matlab. Also, note that you may get slightly different outputs relative to pdist with differences between the two in the range of machine epsilon (eps). pdist may or may not do fancy things to avoid overflows for large inputs and other numeric issues, but be aware that my code does not.
Additionally, I created a simple pure Matlab implementation. It is massively slower than the mex code, but still faster than a naïve implementation or the code found in pdist.
All of the files can be found here. The ZIP archive includes all of the files. It's BSD licensed. Feel free to optimize (I tried BLAS calls and OpenMP in the C code to no avail – maybe some pointer magic or GPU/OpenCL could further speed it up). I hope that it can be helpful to you or someone else.
On my system the following is the fastest (Even faster than the C code pdistc by #horchler):
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2);
mD = sqrt(bsxfun(#plus, vSsqX.', vSsqX) - (2 * (mX.' * mX)));
end
You'll need a very well tuned C code to beat this, I think.
Update
Since MATLAB R2016b MATLAB supports implicit broadcasting without the use of bsxfun().
Hence the code can be written:
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2, 1);
mD = sqrt(vSsqX.'+ vSsqX - (2 * (mX.' * mX)));
end
A generalization is given in my Calculate Distance Matrix project.
P. S.
Using MATLAB's pdist for comparison: squareform(pdist(mX.')) is equivalent to CalcDistMtx(mX).
Namely the input should be transposed.
Computers are not infinitely large, or infinitely fast. People think that they have a lot of memory, a fast CPU, so they just create larger and larger problems, and then eventually wonder why their problem runs slowly. The fact is, this is NOT computational inefficiency. It is JUST an overloaded CPU.
As Oli points out in a comment, there are something like 2e9 values to compute, even assuming you only compute the upper or lower half of the distance matrix. (6e4^2/2 is approximately 2e9.) This will require roughly 16 gigabytes of RAM to store, assuming that only ONE copy of the array is created in memory. If your code is sloppy, you might easily double or triple that. As soon as you go into virtual memory, things get much slower.
Wanting a big problem to run fast is not enough. To really help you, we need to know how much RAM is available. Is this a virtual memory issue? Are you using 64 bit MATLAB, on a CPU that can handle all the needed RAM?

Correlating large matrix

I have a matrix with 100000 columns (variables) and 100 rows (observation).
I need to correlate (pearson) all with all.
I use corrcoef as I found it much faster comparing to corr.
When I take a matrix of 25000 columns the operation takes 15 seconds. However when I increase the size to 50000 after several minutes my matlab RAM increases to 16Gb and matlab (including windows) begins to freeze. Any suggestions? Any patent for splitting? Calculating column by columns turns as extremely inefficient...
Thanks for help,
Vadim
Brute force calculation of such a large array is impossible without a 64 bit version of matlab plus enough memory to store that large array, or storing the array in some other way. You can store the array offline, only bringing in what you need as you use it.
Additionally, if these numbers will always be small integers, then use uint8 or int8, or a logical array, even a single array, all of which will reduce the memory requirements compared to double arrays. Even better if the array is sparse, then use sparse array operations.
An alternative is to use the Parallel Computing Toolbox (and the MATLAB Distributed Computing Server) to harness the memory of several machines simultaneously. This would allow you to write:
matlabpool open <a large number>
x = distributed.zeros( 100000, 100 );
See also this thread for dealing with big matrices...

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.