calculating clustering validity of k-means using rapidminer - cluster-analysis

Well, I have been studying up on the different algorithms used for clustering like k-means, k-mediods etc and I was trying to run the algorithms and analyze their performance on the leaf dataset right here:
http://archive.ics.uci.edu/ml/datasets/Leaf
I was able to cluster the dataset via k-means by first reading the csv file, filtering out unneeded attributes and applying k-means on it. The problem that I am facing here that I wish to calculate measures such as entropy, precision, recall and f-measure for the model developed via k-means. Is there an operator avialable that allows me to do this so that I can quantitatively compare the different clustering algorithms available on rapid-miner?
P.S I know about performance operators like Performance(Classification) that allows me to calculate precision and recall for a model but I dont know any that allow me to calculate entropy.
Help would be much appreciated.

The short answer is to use R. Here's a link to a book chapter about this very subject. There is a revised version coming soon that works for the most recent version of RapidMiner.

Related

Spatial features and pattern analysis of a plan?

I am working on instances from the TSPLIB, which are simply coordinates of nodes in a plan. I'm looking to analyze spatial characteristics and features of a set of instances (e.g. clustered, not clustered, dispersed, etc) and I would like to implement some code in Matlab to analyze and compute specific features.
For example, so far, I have used Nearest Neighbor analysis to identify clusters, as well as quadrant analysis. Can anyone suggest any other spatial features and patterns that could be computed with some relatively simple code? Anybody maybe expert in the Traveling Salesman Problem. Thank you so much!
K-means is a very useful clustering tool that you can use.
https://www.mathworks.com/help/stats/kmeans.html
Nearest Neighbor is a classification methods. if you want to do classification you can use K Nearest Neighbors, SVM or Neural Networks Pattern recognition toolbox. these are all already in Matlab.
Also, check out Matlab Apps. there are some very cool clustering tools available as well with examples.

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

RapidMiner and WEKA : Different clustering result

I am new in Data Mining analytic and Machine Learning. I have been trying to compare the use of Predictive analysis and Clustering analysis using RapidMiner and Weka for my college assignment.
Just after I study the advantages and disadvantages from both tools and starting to do the analyzing process I found some problems. I tried doing Clustering using K-means and simpleKmeans for Weka and Regression analysis using LinearRegression and I am not quite satisfied with the result, since they contain result that significantly different. all of that I used a same datasets. numerical datasets.
I have been spending a lot of my time trying to figure something out by studying the initialization for each algorithm each tools since the interface is different and there are some parameter that is on RapidMiner but not in Weka or otherwise, so I am a bit confused. (is it the problem?)
Despite that what do you think is wrong? is there some initialization process that I missed? or is it because the code is different in each tools even they use the same algorithm?
Thank you for your answer!
Weka often uses built-in normalization at least in k-means and other algorithms.
Make sure you have disabled this if you want to make results comparable.
Also understand that k-means is a randomized algorithm. Different results even from the same package are to be expected (and desirable).
did you use WEKA itself or rapidminer's WEKA extension? Did you try to compare the results of WEKA with RM WEKA?

ELKI implementation of OPTICS clustering algorithm detects only one cluster

I'm having issue with using OPTICS implementation in ELKI environment. I have used the same data for DBSCAN implementation and it worked like a charm. Probably I'm missing something with parameters but I can't figure it out, everything seems to be right.
Data is a simple 300х2 matrix, consists of 3 clusters with 100 points in each.
DBSCAN result:
Clustering result of DBSCAN
MinPts = 10, Eps = 1
OPTICS result:
Clustering result of OPTICS
MinPts = 10
You apparently already found the solution yourself, but here is the long story:
The OPTICS class in ELKI only computes the cluster order / reachability diagram.
In order to extract clusters, you have different choices, one of which (the one from the original OPTICS publication) is available in ELKI.
So in order to extract clusters in ELKI, you need to use the OPTICSXi algorithm, which will in turn use either OPTICS or the index based DeLiClu to compute the cluster order.
The reason why this is split into two parts in ELKI probably is so that you can on one hand implement another logic for extracting the clusters, and on the other hand implement different methods like DeLiClu for computing the cluster order. That would align well with the modular architecture of ELKI.
IIRC there is at least one more method (apparently not yet in ELKI) that extracts clusters by looking for local maxima, then extending them horizontally until they hit the end of the valley. And there was a different one that used "inflexion points" of the plot.
#AnonyMousse pretty much put it right. I just can't upvote or comment yet.
We hope to have some students contribute the other cluster extraction methods as small student projects over time. They are not essential for our research, but they are good tasks for students that want to learn about ELKI to get started.
ELKI is a fast moving project, and it lives from community contributions. We would be happy to see you contribute some code to it. We know that the codebase is not easy to get started with - it is fairly large, and the generality of the implementation and the support for index structures make it a bit hard to get started. We try to add Tutorials to help you to get started. And once you are used to it, you will actually benefit from the architecture: your algorithms get the benfits of indexing and arbitrary distance functions, while if you would implement from scratch, you would likely only support Euclidean distance, and no index acceleration.
Seeing that you struggled with OPTICS, I will try to write an OPTICS tutorial in the new year. In particular, OPTICS can benefit a lot from using an appropriate index structure.

Matlab: K-means clustering with predefined populations

I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.