Cannot get clustering output Mahout - cluster-analysis

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!

I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}

check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E

Related

How to use binning method for identifying the incoming point belongs to which bin?

I have small query. I have two data sets. In one data sets for example I did binning and calculated the mean value and std value along with group binning. Now in I have second data sets of same parameters say X. I would like identify this X data sets belong to which bin groups of my previous data sets using matlab.
Could you give some example how to identify the incoming data points belongs to which bin group...??
I used following binning which is available in matlab :
binEdges = linspace(botEdge, topEdge, numBins+1);
[h,whichBin] = histc(x, binEdges);
Well... you already have your bin edges. Anything inside specific edges is in that bin.
If you know that the data is inside the ranges you defined then, for each new data
newdatabin=find(newdata>binedges,1,'last'); %this is the bin number where the new data goes in
h(newdatabin)=h(newdatabin)+1; %add one!
Also consider using histcounts if your MATLAB version is new enough.

How to access the variable range for leaf node in classregtree of matlab?

I used classregtree to fit a tree to my data set in order to classify the data. All of predictors and the response are quantitative. I want to save the range of each variable on terminal nodes, because I am gonna use those ranges in another function.
So is there any way that I can have access to those ranges? I can see the variable ranges in view(tree) plot but I need to save them in like a matrix to use them.
I am not totally sure that this is what you were asking for but this gives you the split criterions for all trees
B = TreeBagger(nTrees,M,tag, 'Method', 'classification','OOBPred','on');
view(B.Trees{1:B.NTrees})
where M is your trainig data set and tag are the classes.

matlab zplane function: handles of vectors

I'm interested in understanding the variety of zeroes that a given function produces with the ultimate goal of identifying the what frequencies are passed in high/low pass filters. My idea is that finding the lowest value zero of a filter will identify the passband for a LPF specifically. I'm attempting to use the [hz,hp,ht] = zplane(z,p) function to do so.
The description for that function reads "returns vectors of handles to the zero lines, hz". Could someone help me with what a vector of a handle is and what I do with one to be able to find the various zeros?
For example, a simple 5-point running average filter:
runavh = (1/5) * ones(1,5);
using zplane(runavh) gives an acceptable pole/zero plot, but running the [hz,hp,ht] = zplane(z,p) function results in hz=175.1075. I don't know what this number represents and how to use it.
Many thanks.
Using the get command, you can find out things about the data.
For example, type G=get(hz) to get a list of properties of the zero lines. Then the XData is given by G.XData, i.e. X=G.XData.
Alternatively, you can only pull out the data you want
X=get(hz,'XData')
Hope that helps.

Unsure about the way mahout produces clusters

So I'm trying to figure out how to interpret/analyse this clustering output I have. I have 50 folders, called clusters-0, clusters-1, clusters-2 and so on. This is because I said '-k 50' in my command. I thought these folders each contained one cluster, but now I'm not sure.
Using '--help' kmeans says that the '-cl' switch will: "If present, run clustering after the iterations have taken place."
So, does that mean that you need to use '-cl' for the clustering to actually happen?
If "-cl" is not used, are all those fifty folders just iterations of the k-means algorithm output and it doesn't produce an output that actually has the clusters.
Does each of those folders contain fifty clusters, and the final one is the best, most refined set of clusters?
About the folder structure that Mahout Kmeans generate:
/clusters - contains initial centroids of the clusters, based on these points distance measures are found for each individual data points.
/output/clusterPoints - contains the sequenceFile which has cluster id and data used for clustering in (key,value) format.
/output/clusters-* - Each of these folder contains data about the newly computed cluster centroid for each iterations.
/output/clusters-*-final - contains the final cluster details
Heres what I have in it.
VL-1123{n=615 c=[0.655, 0.175, -1.042] r=[0.254, 0.086, 0.271]}
VL-376{n=1607 c=[-0.068, 0.184, 0.787] r=[0.152, 0.020, 0.113]}
VL-3492{n=375 c=[0.616, 0.111, 0.803] r=[0.289, 0.068, 0.227]}
VL-347{n=507 c=[-0.496, 0.166, 0.574] r=[0.169, 0.078, 0.196]}
VL-992{n=595 c=[0.154, 0.267, -0.394] r=[0.212, 0.083, 0.282]}
VL-2468{n=189 c=[-0.696, -0.008, -0.494] r=[0.247, 0.213, 0.372]}
Here I have 6 clusters, so it gives
ClusterID(1123), number of record in cluster(n=615), cluster centroid(c) and radius(r)
Also, VL represents the clusters have converged and it`s a good thing.
Hope it helps!!

MATLAB Saving and Loading Feature Vectors

I am trying to load feature vectors into classifiers such as a k-nearest neighbors classifier.
I have my code for GLCM, so I get contrast, correlation, energy, homogeneity in numbers (feature vectors).
My question is, how can I save every set of feature vectors from all the training images? I have seen somewhere that people had a .set file to load into classifiers (may be it is a special case for the particular classifier toolbox).
load 'mydata.set';
for example.
I suppose it does not have to be a .set file.
I'd just need a way to store all the feature vectors from all the training images in a separate file that can be loaded.
I've google,
and I found this that may be useful
but I am not entirely sure.
Thanks for your time and help in advance.
Regards.
If you arrange your feature vectors as the columns of an array called X, then just issue the command
save('some_description.mat','X');
Alternatively, if you want the save file to be readable, say in ASCII, then just use this instead:
save('some_description.txt', 'X', '-ASCII');
Later, when you want to re-use the data, just say
var = {'X'}; % <-- You can modify this if you want to load multiple variables.
load('some_description.mat', var{:});
load('some_description.txt', var{:}); % <-- Use this if you saved to .txt file.
Then the variable named 'X' will be loaded into the workspace and its columns will be the same feature vectors you computed before.
You will want to replace the some_description part of each file name above and instead use something that allows you to easily identify which data set's feature vectors are saved in the file (if you have multiple data sets). Your array of feature vectors may also be called something besides X, so you can change the name accordingly.