KMeans Clustering Can it predict unseen data? - classification

I have been attempting to fit my training set onto the KMeans Cluster and predict it onto the testing test however it hasn't been working for me trying for atleast a week now. I'm curious if maybe I'm interpreting how KMeans is used? I am told its unsupervised. Does that mean that It can not be use to predict clusters if it knows how the training data is clustered?
Thank you.

Yes you can use k-means to predict clusters. Once you have clustered your training data, you will receive cluster centers for the chosen number of clusters. E.g., if you have chosen k=3, your dataset will be divided into 3 clusters and hence you will receive 3 cluster centers.
Therefore, now you can take your test data and for each test data point you can find the euclidean distance among the the three cluster centers. The one for which the distance is minimum will be the predicted cluster for you.
If you are using scikit-learn there is also a predict method with K-Means, which should do the above basically.

The KMeans Cluster is unsupervised ML model. That means there won't be any labelled data for training and prediction also. It takes training data and based on model tuning it tries cluster the training data and assign cluster labels for each cluster.
And on this trained model you can pass values so that it predicts the optimal cluster label for given input. Here is example python code snippet.
import numpy as np
import matplotlib.pyplot as pyplot
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
model = KMeans(n_clusters=2)
model = model.fit(scale(data)) # data is your training data
print(model.labels_) # prints labels for clusters. you can map to meaningful labels
model.predict(scale(test)) # test is your data to predict the cluster

Related

Using clustering classification as regression feature?

I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

Hybrid SOM (with MLP)

Could someone please provide some information on how to properly combine a self organizing map with a multilayer perceptron?
I recently read some articles about this technique in comparison to regular MLPs and it performed way better in prediction tasks. So, I want to use the SOM as front-end for dimension reduction by clustering the input data and pass the results to an MLP back-end.
My current idea of implementing it is it to train the SOM with a couple of training sets and to determine the clusters. Afterwards, I initialize the MLP with as many input units as SOM clusters. Next step would be to train the MLP using the SOM's output (which value?...weights of BMU?) as in input for the network (SOM's Output for the Cluster matching Input Unit and zeros for any other Input Units?).
There is no single way of doing that. Let me list some possibilities:
The one you describe. But then, your MLP will need to have K*D inputs, where K is the number of clusters and D is the input dimension. There is no dimensionality reduction.
Similar to your idea, but instead of using the weights, just send 1 for the BMU and 0 for the remaining clusters. Then your MLP will need K inputs.
Same as above, but instead of 1 or 0, send the distance from the input vector to each cluster.
Same as above, but instead of the distance, compute a Gaussian activation for each cluster.
Since the SOM preserves topology, send only the 2D coordinates of the BMU (possibly normalized between 0 and 1). Then your MLP will need only 2 inputs and you achieve real extreme dimensionality reduction.
You can read about those ideas and some more here: Principal temporal extensions of SOM: Overview. It is not about feeding the output of a SOM to a MLP, but a SOM to itself. But you'll be able to understand the various possibilities when trying to produce some output from a SOM.

Find the cluster of an input pattern

Suppose that I performed clustering of iris.data using SOM Toolbox in Matlab. After clustering, I have an input vector and I want to see which cluster this input belongs to? Any tips please on how to map an input pattern into a trained SOM map.
Once you have trained the SOM, you can classify a new input vector by assigning it to the nearest node in the grid (Best Matching Unit BMU) which have the closest weights. We predict the majority class of the training vectors belonging to that BMU node as the target class of the test instance.

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox