recall and precision in rapidminer - cluster-analysis

there is a dataset in excel containing some labels in column A(I call it cluster label) and some attributes in column B(I call them cluster component). These data show the best clustering result.
But I don't know how to compute recall and precision of other clustering method using these data in rapidminer!
can anybody help me?

The following link gives an example of using the RapidMiner operator "Map Clustering on Labels". This maps known cluster labels to the cluster allocated by the clustering algorithm. From this, the output can be used to create a confusion matrix where precision and recall can be determined.
Hope it helps...

Please note that when you have more number of classes/labels the precision and recall has to be for individual classes.

Related

Negative Silhouette Score for k-means

In sklearn's description of the silhouette_score method, it says that negative values stand for data points that are wrongly assigned to a cluster. I am wondering, how this is possible for the k-means algorithm for which each data point is assigned to nearest cluster, so lowest distance. If this is done then how can we find negative silhouette-scores? Is this only possible under non-equally weighting of different objects?
Thanks in advance!

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Result of overlapping clustering

I'm using function fcm from Matlab for overlapping clustering. The output of this function is a matrix of size kxn with k being the number of clusters and n being the number of examples.
Now my problem is that how do I choose clusters for an example? For each example, I have scores for all clusters so I can easily find the best matched cluster, but what about other clusters?
Many thanks.
It depends on the clustering algorithm, but you can probably interpret those soft clustering values as probabilities. This gives two well-founded options for extracting a hard clustering:
Sample each point's cluster from its cluster distribution (a column in your kxn matrix).
Assign each point to its most probable cluster. This corresponds to the MAP (max a posteriori) solution to the clustering problem.
Option 2 is probably the way to go - a single sample may not be a great representation of what's going on; with MAP, you're at least guaranteed to get something probable.

K-medioids with Dynamic Time Warping in RapidMiner

How to perform K-medioids clustering with Dynamic Time Warping as a distance measure in RapidMiner?
The idea with Dynamic Time Warping is to perform it on time series of different length. How can I do that in RapidMiner? I get this error message
The data contains missing values which is not allowed for KMediods
How can I cluster time series of different length?
You could fill the missing values with zeroes. The operator Replace Missing Values does this. I don't know the details of your data nor how RapidMiner calculates DTW distances so I therefore can't tell if this approach would yield valid results.
Faced with this, I might use the R extension with the dtw and cluster packages to investigate how distances between different length time series could be used to make clusters. Once you have R working, you can call it from RapidMiner.

How to do overlapping cluster analysis in Matlab or R?

I have a binary matrix of size 20 by 300. I want to cluster the 20 variables into five or six groups. So far I used kmeans and hierarchical clustering algorithms in matlab with different distance metrics but both give me non-overlapping clusters. I see on my data that some of the variables should be located in more than one group. Does anyone know if there is a way to do overlapping clusters either in matlab ot R? Any help is greatly appreciated.
Thanks in advance!
Have a look at Fuzzy clustering in MATLAB documentation http://www.mathworks.com/help/toolbox/fuzzy/fp310.html
look for Weka4OC (java) or ADPROCLUS(R) which are able to build overlapping clusters