How to see the data points of most similar cluster(s) to another cluster in k-means clustering? - cluster-analysis

I'm having a dataset which consists of 2500+ houses information like price, sqft area, etc. I wanted to group similar houses on the basis of the house's location, price, number of bedrooms, and bathrooms. So taking these 4 as the input parameters, I applied k-means clustering and for determining the number of clusters (i.e. the value of k) I used the Silhouette analysis technique and get k = 208 as the one with best silhouette score so the divided my dataset into 208 clusters with K-means clustering.
Now I created a single sample data having random location, price, number of bedrooms, and bathrooms and predicted the cluster-number this sample belongs to and analyzed the data points of that cluster. My problem is that I also want to analyze the data points of the most similar cluster(s) to the single sample instance I created. How can we do this?

Related

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

Clustering based on correlation

I have a pearson correlation matrix with how different foods are correlated with each other.
I would like to create groups of foods that can be analyzed together, therefore I would like to categorize them into clusters.
I want to cluster these foods into categories using the following criteria:
1) I would like to maximize the correlation within each of the clusters
2) I would like to setup a minimum correlation for each group (i.e. each cluster needs to have a correlation of >0.7).
Is there a machine learning algorithm that would be applicable for this case.
Use hierarchical clustering with
Complete linkage
Cut at height 0.7
Transform similaritires into distances

Clustering techniques for similarity matrix

I have a binary data of 128 respondants based on the features of digital camera that they have selected. where '1' represents the selection of feature and '0' represents that feature not selected. i have 92 product features in columns and respondants in rows. Each respondant has exactly selected 20 features out of set of 92 features. I want to create the clusters of different user groups based on the features they selected. I have tried some clustering algorithms like fuzzy clustering and hierarichal on these binaray data but it didnt gave me any good results and the clusters created were really bad. So now i have applied the dice coefficient similarity matrix on the data w.r.t the respondants, that basically gives me the similarity score for each respondant with all the other respondants. Is it possible to apply clustering technique on this similarity matrix to get good clusters? also what clustering techniques are available that i could apply on this user similarity matrix so that i could identify the clusters of users based on their similairty score. Any suggestion and comment would be really appreciated
Since your data set is tiny, go with hierarchical clustering.
It can be implemented with distance or with similarity.

Determine Cluster Label in K-means

I have dataset that is contain 150 data that is actually divided into 3 group. Each group has it’s own label.
I do clustering process with K-means algorithm to group the data.
I need to assign the label of each group that is created by K-means process. So I could compare the result of K-means with the data training.
Anybody could help to explain how to determine the label of each group?
Read up on cluster evaluation in Wikipedia.
No clustering algorithm will assign a label such as iris_setosa to the cluster, unless you provide the labels to the clustering algorithm somehow (but then it is no longer clustering, actually, but classification).
So you will only have first_cluster, second_cluster, third_cluster type of labels.
There are various measures proposed to compare the structure of the clusters in comparison to the original data set. But usually there will not be a 1:1 correspondence to the original labels.

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox