When to use k means clustering algorithm?

When to use k means clustering algorithm? - cluster-analysis

Can I use k-means algorithm for a single attribute?
Is there any relationship between the attributes and the number of clusters?
I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.
Is it possible to create 3 clusters with one attribute?

K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.
Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.
You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations
1
2
100
101
500
499
501
Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.

If you have one dimensional data, search stackoverflow for better approaches than k-means.
K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.
One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...
Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.
However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?

As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.
Color segmentation
Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.
Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.
R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.
Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation.
In the above example, the RGB values might be assigned a flat integral representation
R,G,B
128,100,20 => 1
120,9,30 => 2
255,255,255=> 3
128,100,20 => 1
120,9,30 => 2
You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.
YouTube video
https://www.youtube.com/watch?v=yR7k19YBqiw
I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.
Original image
After segmentation using K means

Yes it is possible to use clustering with single attribute.
No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.
The problem you are looking with performance attribute is more a classification problem than a clustering problem
Difference between classification and clustering in data mining?

With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.
If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.
Feel free to send me comments if you are still confused.
Rowen

Related

What is the importance of clustering?

During unsupervised learning we do cluster analysis (like K-Means) to bin the data to a number of clusters.
But what is the use of these clustered data in practical scenario.
I think during clustering we are losing information about the data.
Are there some practical examples where clustering could be beneficial?

The information loss can be intentional. Here are three examples:
PCM signal quantification (Lloyd's k-means publication). You know that are certain number (say 10) different signals are transmitted, but with distortion. Quantifying removes the distortions and re-extracts the original 10 different signals. Here, you lose the error and keep the signal.
Color quantization (see Wikipedia). To reduce the number of colors in an image, a quite nice method uses k-means (usually in HSV or Lab space). k is the number of desired output colors. Information loss here is intentional, to better compress the image. k-means attempts to find the least-squared-error approximation of the image with just k colors.
When searching motifs in time series, you can also use quantization such as k-means to transform your data into a symbolic representation. The bag-of-visual-words approach that was the state of the art for image recognition prior to deep learning also used this.
Explorative data mining (clustering - one may argue that above use cases are not data mining / clustering; but quantization). If you have a data set of a million points, which points are you going to investigate? clustering methods try ro split the data into groups that are supposed to be more homogeneous within and more different to another. Thrn you don't have to look at every object, but only at some of each cluster to hopefully learn something about the whole cluster (and your whole data set). Centroid methods such as k-means even can proviee a "prototype" for each cluster, albeit it is a good idea to also lool at other points within the cluster. You may also want to do outlier detection and look at some of the unusual objects. This scenario is somewhere inbetween of sampling representative objects and reducing the data set size to become more manageable. The key difference to above points is that the result is usually not "operationalized" automatically, but because explorative clustering results are too unreliable (and thus require many iterations) need to be analyzed manually.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?

I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Do you have to normalize the data for a neural net if it is already scaled?

I'm currently trying to preprocess my training data ready for a multi-layered perceptron. The data I downloaded consists of 20,000 instances and 16 attributes, all of which are coordinate values of pixels as part of letter recognition. The data itself has already been scaled from its original form into values between 0 - 15 before being published.
However since it's already been scaled, is it still necessary to perform normalization on it? I've tried to read around and look at previous examples but have come up with conflicting points. In some papers, it has stated that scaling is a form of normalization, where as others have said that normalization would be bringing that values to a range of 0-1.
Since I'm using WEKA I've attempted their normalize filter during a pre-processing stage and it caused the accuracy to decrease by around 2% which makes me think it could be unnecessary. But again, I've read that it may only have a positive effect later in training.
So my question is:
What is the difference between scaling to a range such as 0 - 15 and normalizing it? Should I still normalize it on top of this scaling thats already done?

In your case you do not need to. Normalizing data is done so that an attribute with a different scale will not decide outcome of distance operations, ultimately decide clustering or classification results.
An example you have two attributes weight and income. Weight will be 10 and 200kg at most. While income can be 10,000$ and 20,000,000$. But most of the people's income will be 10,000 and 120,000, while above this values will be outliers. If you do not normalize your data before using Multi Layer Perceptron, outcome of your neural network will be decided by these outliers.
In your case this situation is already mitigated due to your scaling therefore you do not need normalizing.

MANOVA - huge matrices

First, sorry by the tag as "ANOVA", it is about MANOVA (yet to become a tag...)
From the tutorials I found, all the examples use small matrices, following them would not be feasible for the case of big ones as it is the case of many studies.
I got 2 matrices for my 14 sampling points, 1 for the organisms IDs (4493 IDs) and other to chemical profile (190 variables).
The 2 matrices were correlated by spearman and based on the correlation, split in 4 clusters (k-means regarding the square euclidian clustering values), the IDs on the row and chemical profile on line.
The differences among them are somewhat clear, but to have it in a more robust way I want to perform MANOVA to show the differences between and within the clusters - that is a key factor for the conclusion, of course.
Problem is that, after 8h trying, could not even input the data in a format acceptable to the analysis.
The tutorials I found are designed to very few variables and even when I think I overcame that, the program says that my matrices can't be compared by their difference in length.
Each cluster has its own set of IDs sharing all same set of variables.
What should I do?
Thanks in advance.
Diogo Ogawa

If you have missing values in your data (which practically all data sets seem to contain) you can either remove those observations or you can create a model using those observations. Use the first approach if something about your methodology gives you conviction that there is something different about those observations. Most of the time, it is better to run the model using the missing values. In this case, use the general linear model instead of a balanced ANOVA model. The balanced model will struggle with those missing data.

Clustering on non-numeric dimensions

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV

Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !

In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.

You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse