Clustering on non-numeric dimensions - cluster-analysis

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV

Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !

In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.

You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps

Related

In DBSCAN, what does eps represent actually?

Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf
If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood.
What does eps represent actually? It is a radius to draw a circle right? So, why the radius is so small, smaller than distances between two objects? If so, the MinPts will be 0 forever.
Yes, if used with Euclidean distance, then it is a radius.
It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead.
Use the original paper to understand the algorithm, not some indian journal variant of it.
In Euclidean distance, it is the radius. Selection of Eps is a little difficult.
This problem is related to model selection, i.e., the selection of a particular model and its corresponding parametrization. In the case of k-means (which requires from the user the number of clusters as input) there is a plethora of measures in the literature that can help in the selection of the best number of clusters, for instance: silhouette, c-index, dunn, davies-bouldin. These measures are the so-called relative validity criteria.
In the case of Density-based clustering algorithms, there are some measures too, for instance: CDbw and DBCV.

how to set threshold for retrieving images in dataset by trail and error method?

I have calculated the feature vectors for all images that are present in dataset. I have used euclidean distance for calculating distance between them and retrieving top 10 similar images from the dataset with every query. Setting
'Threshold' value is totally new to me please suggest some examples for selecting it.
Thanks in advance.
There are several possible answers of different degrees of difficulty:
Simple: return the 10 images with smallest distances. If the query image is not very similar to anything in the data set, the returned images won't be very similar, but they would be the most similar anyway. No threshold needed.
More complex: get a few people to rate pairs of images as being similar or non-similar (either yes/no or 0-10 scale). You can figure out from that what euclidean distance most people would say is "not similar" (or would give a score below 5, etc). That is your empirical threshold (but: it may be different for different kinds of images - I still think you will find a typical distance that works pretty well in most cases).
Even more complex: cluster your images with kNN. Try this for many possible numbers of clusters; measure the average cluster size, eg as median(distance(feature vector, centroid of cluster) for each image in cluster). Similarly measure the distances between pairs of clusters. This gives you an idea of what is "close" and "not close": but again ideally you should use for each image the size of the cluster it is in.

When to use k means clustering algorithm?

Can I use k-means algorithm for a single attribute?
Is there any relationship between the attributes and the number of clusters?
I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.
Is it possible to create 3 clusters with one attribute?
K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.
Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.
You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations
1
2
100
101
500
499
501
Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.
If you have one dimensional data, search stackoverflow for better approaches than k-means.
K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.
One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...
Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.
However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?
As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.
Color segmentation
Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.
Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.
R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.
Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation.
In the above example, the RGB values might be assigned a flat integral representation
R,G,B
128,100,20 => 1
120,9,30 => 2
255,255,255=> 3
128,100,20 => 1
120,9,30 => 2
You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.
YouTube video
https://www.youtube.com/watch?v=yR7k19YBqiw
I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.
Original image
After segmentation using K means
Yes it is possible to use clustering with single attribute.
No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.
The problem you are looking with performance attribute is more a classification problem than a clustering problem
Difference between classification and clustering in data mining?
With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.
If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.
Feel free to send me comments if you are still confused.
Rowen

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.