Clustering techniques for similarity matrix - cluster-analysis

I have a binary data of 128 respondants based on the features of digital camera that they have selected. where '1' represents the selection of feature and '0' represents that feature not selected. i have 92 product features in columns and respondants in rows. Each respondant has exactly selected 20 features out of set of 92 features. I want to create the clusters of different user groups based on the features they selected. I have tried some clustering algorithms like fuzzy clustering and hierarichal on these binaray data but it didnt gave me any good results and the clusters created were really bad. So now i have applied the dice coefficient similarity matrix on the data w.r.t the respondants, that basically gives me the similarity score for each respondant with all the other respondants. Is it possible to apply clustering technique on this similarity matrix to get good clusters? also what clustering techniques are available that i could apply on this user similarity matrix so that i could identify the clusters of users based on their similairty score. Any suggestion and comment would be really appreciated

Since your data set is tiny, go with hierarchical clustering.
It can be implemented with distance or with similarity.

Related

Dimensionality reduction for high dimensional sparse data before clustering or spherical k-means?

I am trying to build my first recommender system where i create a user feature space and then cluster them into different groups. Then for the recommendation to work for a particular user , first i find out the cluster to which the user belongs and then recommend entities(items) in which his/her nearest neighbor showed interest. The data which i am working on is high dimensional and sparse. Before implementing the above approach, there are few questions, whose answers might help me in adopting a better approach.
As my data is high dimensional and sparse, should i go for dimensionality reduction and then apply clustering or should I go for an algorithm like spherical K-means which works on sparse high dimensional data?
How should I find the nearest neighbors after creating clusters of users.(Which distance measure should i take as i have read that Euclidean distance is not a good measure for high dimensional data)?
It's not obvious that clustering is the right algorithm here. Clustering is great for data exploration and analysis, but not always for prediction. If your end product is based around the concept of "groups of like users" and the items they share, then go ahead with clustering and simply present a ranked list of items that each user's cluster has consumed (or a weighted average rating, if you have preference information).
You might try standard recommender algorithms that work in sparse high-dimensional situations, such as item-item collaborative filtering or sparse SVD.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

When to use k means clustering algorithm?

Can I use k-means algorithm for a single attribute?
Is there any relationship between the attributes and the number of clusters?
I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.
Is it possible to create 3 clusters with one attribute?
K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.
Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.
You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations
1
2
100
101
500
499
501
Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.
If you have one dimensional data, search stackoverflow for better approaches than k-means.
K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.
One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...
Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.
However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?
As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.
Color segmentation
Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.
Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.
R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.
Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation.
In the above example, the RGB values might be assigned a flat integral representation
R,G,B
128,100,20 => 1
120,9,30 => 2
255,255,255=> 3
128,100,20 => 1
120,9,30 => 2
You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.
YouTube video
https://www.youtube.com/watch?v=yR7k19YBqiw
I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.
Original image
After segmentation using K means
Yes it is possible to use clustering with single attribute.
No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.
The problem you are looking with performance attribute is more a classification problem than a clustering problem
Difference between classification and clustering in data mining?
With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.
If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.
Feel free to send me comments if you are still confused.
Rowen

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.