Unsure about the way mahout produces clusters - cluster-analysis

So I'm trying to figure out how to interpret/analyse this clustering output I have. I have 50 folders, called clusters-0, clusters-1, clusters-2 and so on. This is because I said '-k 50' in my command. I thought these folders each contained one cluster, but now I'm not sure.
Using '--help' kmeans says that the '-cl' switch will: "If present, run clustering after the iterations have taken place."
So, does that mean that you need to use '-cl' for the clustering to actually happen?
If "-cl" is not used, are all those fifty folders just iterations of the k-means algorithm output and it doesn't produce an output that actually has the clusters.
Does each of those folders contain fifty clusters, and the final one is the best, most refined set of clusters?

About the folder structure that Mahout Kmeans generate:
/clusters - contains initial centroids of the clusters, based on these points distance measures are found for each individual data points.
/output/clusterPoints - contains the sequenceFile which has cluster id and data used for clustering in (key,value) format.
/output/clusters-* - Each of these folder contains data about the newly computed cluster centroid for each iterations.
/output/clusters-*-final - contains the final cluster details
Heres what I have in it.
VL-1123{n=615 c=[0.655, 0.175, -1.042] r=[0.254, 0.086, 0.271]}
VL-376{n=1607 c=[-0.068, 0.184, 0.787] r=[0.152, 0.020, 0.113]}
VL-3492{n=375 c=[0.616, 0.111, 0.803] r=[0.289, 0.068, 0.227]}
VL-347{n=507 c=[-0.496, 0.166, 0.574] r=[0.169, 0.078, 0.196]}
VL-992{n=595 c=[0.154, 0.267, -0.394] r=[0.212, 0.083, 0.282]}
VL-2468{n=189 c=[-0.696, -0.008, -0.494] r=[0.247, 0.213, 0.372]}
Here I have 6 clusters, so it gives
ClusterID(1123), number of record in cluster(n=615), cluster centroid(c) and radius(r)
Also, VL represents the clusters have converged and it`s a good thing.
Hope it helps!!

Related

How do I compare two weighted regressions in MatLab?

I've been using MatLab as a statistics tool. I like how much I can customise and code myself.
I was delighted to find that it's fairly straightforward to do a weighted linear regression in MatLab. As a slightly silly example, I can load the "carbig" data file and compare horsepower vs mileage for US cars to that of cars from other countries, but decide I only trust 8-cylinder cars.
load carbig
w=(Cylinders==8)+0.5*(Cylinders~=8)%1 if 8 cylinders, 0.5 otherwise.
for i=1:length(org)
o(i,1)=strcmp(org(i,:),org(1,:));%strcmp only works on one string.
end
x1=Horsepower(o==1)
x2=Horsepower(o==0)
y1=MPG(o==1)
y2=MPG(o==0)
w1=w(o==1)
w2=w(o==0)
lm1=fitlm(x1,y1,'Weights',w1)
lm2=fitlm(x2,y2,'Weights',w2)
This way, data from 8-cylinder cars will count as one data-point, and data frm 3,4,5,6-cylinder cars will count as half a data point.
Problem is, the obvious way to compare the two regressions is to use ANCOVA, which MatLab has a function for:
aoctool(Horsepower,MPG,o)
This function compares linear regressions on the two groups, but I haven't found an obvious way to include weights.
I suspect I can have a closer look at what the ANCOVA does and include the weights manually. Any easier solution?
I figured if I give the "trusted" measuremets weight 2, the "untrusted" measurements weight 1, for regression purposes that's the same thing as having an extra 1 identical measurement for each trusted one. Setting the weight to 1 and 0.5 should do the same thing. I can do this with a script.
That also increases the degrees of freedom quite a bit, so I manually set the degrees of freedom to sum(w)-rank instead on n-rank.
x=[];
y=[];
g=[];
w=(Cylinders==8)+0.5*(Cylinders~=8);
df=sum(w)
for i=1:length(w)
while w(i)>0
x=[x;Horsepower(i)];
y=[y;MPG(i)];
g=[g;o(i)];
w(i)=w(i)-0.5
end
end
I then copied the aoctool.m file (edit aoctool) and inserted the value of df somewhere in the new file. It isn't elegant, but it seems to work.
edit aoctool.m
%(insert new df somewhere. Save as aoctool2.m)
aoctool2(x,y,g)

How to use binning method for identifying the incoming point belongs to which bin?

I have small query. I have two data sets. In one data sets for example I did binning and calculated the mean value and std value along with group binning. Now in I have second data sets of same parameters say X. I would like identify this X data sets belong to which bin groups of my previous data sets using matlab.
Could you give some example how to identify the incoming data points belongs to which bin group...??
I used following binning which is available in matlab :
binEdges = linspace(botEdge, topEdge, numBins+1);
[h,whichBin] = histc(x, binEdges);
Well... you already have your bin edges. Anything inside specific edges is in that bin.
If you know that the data is inside the ranges you defined then, for each new data
newdatabin=find(newdata>binedges,1,'last'); %this is the bin number where the new data goes in
h(newdatabin)=h(newdatabin)+1; %add one!
Also consider using histcounts if your MATLAB version is new enough.

Cannot get clustering output Mahout

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E

How to use Mahout Streaming K-Means

I have seen that there is a new implementation of K-Means in mahout called the Streaming-Kmeans, which achieves the k-means clustering without chained Mapper-Reducer cycles:
https://github.com/dfilimon/mahout/tree/epigrams
I am not finding any articles for the its usage anywhere. Could anyone point out any useful links for its usage, which have some code examples on how to use the same.
StreamingKMeans is a new feature in mahout .8.
For more details, of its algorithms see:
"Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni
http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
"Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson,
http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
As you mentioned, There is no article for its usage. As other version of clustering algorithm there is a Driver which you can pass some configuration parameters as a string array and it will cluster your data :
String[] args1 = new String[] {"-i","/home/name/workspace/XXXXX-vectors/tfidf-vectors","-o","/home/name/workspace/XXXXX-vectors/tfidf-vectors/SKM-Main-result/","--estimatedNumMapClusters","200","--searchSize","2","-k","12", "--numBallKMeansRuns","3", "--distanceMeasure","org.apache.mahout.common.distance.CosineDistanceMeasure"};
StreamingKMeansDriver.main(args1);
for get description of important parameters just do a mistake like "-iiii" as first parameter. it will show you the parameters , their descriptions and default values.
but if you don't want to use it in this way, just read StreamingKMeansMapper, StreamingKmeansReducer, StreamingKmeansThread, these 3 classes code help you understand the usage of algorithm and costumaize it for your need.
Mapper use StreamingKMeans to produce estimated clusters of input data. for get k final cluster Reducer get intermediate points (the generated centroid in previous step) and by using ballKmeans it cluster these intermediate point to K cluster.
Here are the steps for running Streaming k-means:
Generate Sparse vectors via seq2sparse.
mahout streamingkmeans -i "" -o ""
--tempDir "" -ow
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch
-k -km
-k = no of clusters
-km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer
You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter.

Text Categorization datasets for MATLAB

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.
It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);