Pass a new DataPoint and predict which cluster it belongs to in kmeans clustering (autompg dataset) - cluster-analysis

How can we pass a new DataPoint and predict which cluster it belongs to in kmeans clustering (using auto mpg dataset)?

Related

KMeans Clustering Can it predict unseen data?

I have been attempting to fit my training set onto the KMeans Cluster and predict it onto the testing test however it hasn't been working for me trying for atleast a week now. I'm curious if maybe I'm interpreting how KMeans is used? I am told its unsupervised. Does that mean that It can not be use to predict clusters if it knows how the training data is clustered?
Thank you.
Yes you can use k-means to predict clusters. Once you have clustered your training data, you will receive cluster centers for the chosen number of clusters. E.g., if you have chosen k=3, your dataset will be divided into 3 clusters and hence you will receive 3 cluster centers.
Therefore, now you can take your test data and for each test data point you can find the euclidean distance among the the three cluster centers. The one for which the distance is minimum will be the predicted cluster for you.
If you are using scikit-learn there is also a predict method with K-Means, which should do the above basically.
The KMeans Cluster is unsupervised ML model. That means there won't be any labelled data for training and prediction also. It takes training data and based on model tuning it tries cluster the training data and assign cluster labels for each cluster.
And on this trained model you can pass values so that it predicts the optimal cluster label for given input. Here is example python code snippet.
import numpy as np
import matplotlib.pyplot as pyplot
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
model = KMeans(n_clusters=2)
model = model.fit(scale(data)) # data is your training data
print(model.labels_) # prints labels for clusters. you can map to meaningful labels
model.predict(scale(test)) # test is your data to predict the cluster

How to see the data points of most similar cluster(s) to another cluster in k-means clustering?

I'm having a dataset which consists of 2500+ houses information like price, sqft area, etc. I wanted to group similar houses on the basis of the house's location, price, number of bedrooms, and bathrooms. So taking these 4 as the input parameters, I applied k-means clustering and for determining the number of clusters (i.e. the value of k) I used the Silhouette analysis technique and get k = 208 as the one with best silhouette score so the divided my dataset into 208 clusters with K-means clustering.
Now I created a single sample data having random location, price, number of bedrooms, and bathrooms and predicted the cluster-number this sample belongs to and analyzed the data points of that cluster. My problem is that I also want to analyze the data points of the most similar cluster(s) to the single sample instance I created. How can we do this?

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

Find the cluster of an input pattern

Suppose that I performed clustering of iris.data using SOM Toolbox in Matlab. After clustering, I have an input vector and I want to see which cluster this input belongs to? Any tips please on how to map an input pattern into a trained SOM map.
Once you have trained the SOM, you can classify a new input vector by assigning it to the nearest node in the grid (Best Matching Unit BMU) which have the closest weights. We predict the majority class of the training vectors belonging to that BMU node as the target class of the test instance.

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox