Clustering classifier and clustering policy - cluster-analysis

I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.

The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.

Related

ELKI dbscan examples

I want to use dbscan algorithm in ELKI (not on GUI). Please give me some examples including loading data, running algorithm, displaying results? The docs of ELKI don't have any examples.
Use the example at
https://elki-project.github.io/howto/java_api#PureJavaAPI
Rather than KMeansLloyd, you would use DBSCAN, of course.
This example does not set up an index, so the runtime will be O(n²). For larger data sets, indexes such as the cover tree give considerable performance benefits.
Setting up indexes etc. often is easier with the parameterization API.
Alternatively, you could use the parameterization API, as done in the unit test:
https://github.com/elki-project/elki/blob/master/elki-clustering/src/test/java/de/lmu/ifi/dbs/elki/algorithm/clustering/DBSCANTest.java

Using precision recall metric on a hierarchy of recovered clusters

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will run the algorithm on popular GitHub repositories and compare the created clusters with the originally existent namespaces. Our work will closely follow the works of this paper. In the paper the authors mentions the use of the “precision recall metric” to measure the accuracy of the clustering algorithm. However looking more closely on the metric and its origin, it seems to be dedicated to flat (non-hierarchical) clusters.
Question:
Is there a way to use the precision recall metric to measure the accuracy of a hierarchy of recovered clusters? If not, what other options exists?

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

text classification methods? SVM and decision tree

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.
i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?
what is the best method i can use for classifying docs in this way?
thanks!
Naive Bayes
Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.
KNN
KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.
SVM
SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.
Random Forest (decision tree)
I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.
FYI
These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.
Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.
And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.
Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.
The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.
EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.
If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.

ELKI implementation of OPTICS clustering algorithm detects only one cluster

I'm having issue with using OPTICS implementation in ELKI environment. I have used the same data for DBSCAN implementation and it worked like a charm. Probably I'm missing something with parameters but I can't figure it out, everything seems to be right.
Data is a simple 300х2 matrix, consists of 3 clusters with 100 points in each.
DBSCAN result:
Clustering result of DBSCAN
MinPts = 10, Eps = 1
OPTICS result:
Clustering result of OPTICS
MinPts = 10
You apparently already found the solution yourself, but here is the long story:
The OPTICS class in ELKI only computes the cluster order / reachability diagram.
In order to extract clusters, you have different choices, one of which (the one from the original OPTICS publication) is available in ELKI.
So in order to extract clusters in ELKI, you need to use the OPTICSXi algorithm, which will in turn use either OPTICS or the index based DeLiClu to compute the cluster order.
The reason why this is split into two parts in ELKI probably is so that you can on one hand implement another logic for extracting the clusters, and on the other hand implement different methods like DeLiClu for computing the cluster order. That would align well with the modular architecture of ELKI.
IIRC there is at least one more method (apparently not yet in ELKI) that extracts clusters by looking for local maxima, then extending them horizontally until they hit the end of the valley. And there was a different one that used "inflexion points" of the plot.
#AnonyMousse pretty much put it right. I just can't upvote or comment yet.
We hope to have some students contribute the other cluster extraction methods as small student projects over time. They are not essential for our research, but they are good tasks for students that want to learn about ELKI to get started.
ELKI is a fast moving project, and it lives from community contributions. We would be happy to see you contribute some code to it. We know that the codebase is not easy to get started with - it is fairly large, and the generality of the implementation and the support for index structures make it a bit hard to get started. We try to add Tutorials to help you to get started. And once you are used to it, you will actually benefit from the architecture: your algorithms get the benfits of indexing and arbitrary distance functions, while if you would implement from scratch, you would likely only support Euclidean distance, and no index acceleration.
Seeing that you struggled with OPTICS, I will try to write an OPTICS tutorial in the new year. In particular, OPTICS can benefit a lot from using an appropriate index structure.