How to use Mahout Streaming K-Means - cluster-analysis

I have seen that there is a new implementation of K-Means in mahout called the Streaming-Kmeans, which achieves the k-means clustering without chained Mapper-Reducer cycles:
https://github.com/dfilimon/mahout/tree/epigrams
I am not finding any articles for the its usage anywhere. Could anyone point out any useful links for its usage, which have some code examples on how to use the same.

StreamingKMeans is a new feature in mahout .8.
For more details, of its algorithms see:
"Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni
http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
"Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson,
http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
As you mentioned, There is no article for its usage. As other version of clustering algorithm there is a Driver which you can pass some configuration parameters as a string array and it will cluster your data :
String[] args1 = new String[] {"-i","/home/name/workspace/XXXXX-vectors/tfidf-vectors","-o","/home/name/workspace/XXXXX-vectors/tfidf-vectors/SKM-Main-result/","--estimatedNumMapClusters","200","--searchSize","2","-k","12", "--numBallKMeansRuns","3", "--distanceMeasure","org.apache.mahout.common.distance.CosineDistanceMeasure"};
StreamingKMeansDriver.main(args1);
for get description of important parameters just do a mistake like "-iiii" as first parameter. it will show you the parameters , their descriptions and default values.
but if you don't want to use it in this way, just read StreamingKMeansMapper, StreamingKmeansReducer, StreamingKmeansThread, these 3 classes code help you understand the usage of algorithm and costumaize it for your need.
Mapper use StreamingKMeans to produce estimated clusters of input data. for get k final cluster Reducer get intermediate points (the generated centroid in previous step) and by using ballKmeans it cluster these intermediate point to K cluster.

Here are the steps for running Streaming k-means:
Generate Sparse vectors via seq2sparse.
mahout streamingkmeans -i "" -o ""
--tempDir "" -ow
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch
-k -km
-k = no of clusters
-km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer
You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter.

Related

Problem-Based formulation in MatLab: Sets and Subsets

I am an GAMS user how has to go over to MatLab due to company policies.
I have written a model in GAMS that I am now wrtining in Matlab. I am using the problem based approuch.
The question I have is about sets and subsets
For example in GAMS
sets
NodeIndex Nodes of the system /1*3/
GenIndex(NodeIndex) Generator Index /1/
NoGenIndex(NodeIndex) Nodes with no generation
NoGenIndex(NodeIndex) = not GenIndex(NodeIndex)
As seen, GenIndex(NodeIndex) and NoGenIndex(NodeIndex) are a subset of NodeIndex
Example of an optimization variable:
PG(NodeIndex) Generated active power
Theta0(GenIndex)
Then when I bound the problem, I can say that certain sets should have zero generation.
PG.fx(NoGenIndex) = 0;
However, when reading the instructions in MatLab for problembased I can't find something similar. Is possible to define subsets in Matlab problem-based formulation?
Cheers!
Yes you can index into OptimizationVariables or OptimizationExpressions using either numeric index vectors or string. See for example:
https://www.mathworks.com/help/optim/ug/optimvar.html#mw_9da91e17-8359-4deb-9b42-b08b64a3646b

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.

Cannot get clustering output Mahout

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E

LIBSVM in MATLAB/Octave - what's the output of libsvmread?

The second output of the libsvmread command is a set of features for each given training example.
For example, in the following MATLAB command:
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
This second variable (heart_scale_inst) holds content in a form that I don't understand, for example:
<1, 1> -> 0.70833
What is the meaning of it? How is it to be used (I can't plot it, the way it is)?
PS. If anyone could please recommend a good LIBSVM tutorial, I'd appreciate it. I haven't found anything useful and the README file isn't very clear... Thanks.
The definitive tutorial for LIBSVM for beginners is called: A Practical Guide to SVM Classification it is available from the site of the authors of LIBSVM.
The second parameter returned is called the instance matrix. It is a matrix, let call it M, M(1,:) are the features of data point 1 and so on. The matrix is sparse that is why it prints out weirdly. If you want to see it fully print full(M).
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
with heart_scale_label and heart_scale_inst you should be able to train an SVM by issuing:
mod = svmtrain(heart_scale_label,heart_scale_inst,'-c 1 -t 0');
I strong suggest you read the above linked guide to learn how to set the c parameter (and possibly, in case of RBF kernel the gamma parameter), but the above line is how you would train with that data.
I think it is the probability with which test case has been predicted to heart_scale label category

Unsure about the way mahout produces clusters

So I'm trying to figure out how to interpret/analyse this clustering output I have. I have 50 folders, called clusters-0, clusters-1, clusters-2 and so on. This is because I said '-k 50' in my command. I thought these folders each contained one cluster, but now I'm not sure.
Using '--help' kmeans says that the '-cl' switch will: "If present, run clustering after the iterations have taken place."
So, does that mean that you need to use '-cl' for the clustering to actually happen?
If "-cl" is not used, are all those fifty folders just iterations of the k-means algorithm output and it doesn't produce an output that actually has the clusters.
Does each of those folders contain fifty clusters, and the final one is the best, most refined set of clusters?
About the folder structure that Mahout Kmeans generate:
/clusters - contains initial centroids of the clusters, based on these points distance measures are found for each individual data points.
/output/clusterPoints - contains the sequenceFile which has cluster id and data used for clustering in (key,value) format.
/output/clusters-* - Each of these folder contains data about the newly computed cluster centroid for each iterations.
/output/clusters-*-final - contains the final cluster details
Heres what I have in it.
VL-1123{n=615 c=[0.655, 0.175, -1.042] r=[0.254, 0.086, 0.271]}
VL-376{n=1607 c=[-0.068, 0.184, 0.787] r=[0.152, 0.020, 0.113]}
VL-3492{n=375 c=[0.616, 0.111, 0.803] r=[0.289, 0.068, 0.227]}
VL-347{n=507 c=[-0.496, 0.166, 0.574] r=[0.169, 0.078, 0.196]}
VL-992{n=595 c=[0.154, 0.267, -0.394] r=[0.212, 0.083, 0.282]}
VL-2468{n=189 c=[-0.696, -0.008, -0.494] r=[0.247, 0.213, 0.372]}
Here I have 6 clusters, so it gives
ClusterID(1123), number of record in cluster(n=615), cluster centroid(c) and radius(r)
Also, VL represents the clusters have converged and it`s a good thing.
Hope it helps!!