Clustering words into groups - cluster-analysis

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(

As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction

That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.

About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

Related

Cluster data into good and bad

I need to divide data points into those that are similar to each other("good" points) and everyone else("bad" points).
It looks like some kind of clustering problem and what do I do:
I am assuming that there are at least two "good" points.
Find pairwise distance between all types of points.
Find minimum distance (minDist).
Do hierarchical clustering for all points.
Make a cut at the height of 5*minDist.
Say that all points that are in the same cluster as pair with minDist and under that cut belong to the desired "good" cluster.
And this works pretty well, but if there are two points that are very close to each other. minDist is very small and this 5*minDist cut is also small => only these 2 points are in the desired "good" cluster.
I would think that either I need to change this approach completely and here is question number 1:
[1] "What methods do exist to separate similar points from everyone else?"
Or I need to modify this 5*minDist to some other function of minDist. And question is:
[2] "What may I choose as reasonable alternative to 5*minDist?"
Vladimir
Instead of doing clustering, you want to do outlier detection.
There are dozens of algorithms for this (see ELKI for a large collection). Some very basic methods may solve your problem:
The number of neighbors in radius r. If +i < threshold, the point is an outlier.
Distance to the k nearest neighbor. Choose k>1 to avoid these 2 element clusters you are seeing.
Also, DBSCAN clustering could work for you. Consider all clusters to be good, and only noise to be bad!

In DBSCAN, what does eps represent actually?

Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf
If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood.
What does eps represent actually? It is a radius to draw a circle right? So, why the radius is so small, smaller than distances between two objects? If so, the MinPts will be 0 forever.
Yes, if used with Euclidean distance, then it is a radius.
It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead.
Use the original paper to understand the algorithm, not some indian journal variant of it.
In Euclidean distance, it is the radius. Selection of Eps is a little difficult.
This problem is related to model selection, i.e., the selection of a particular model and its corresponding parametrization. In the case of k-means (which requires from the user the number of clusters as input) there is a plethora of measures in the literature that can help in the selection of the best number of clusters, for instance: silhouette, c-index, dunn, davies-bouldin. These measures are the so-called relative validity criteria.
In the case of Density-based clustering algorithms, there are some measures too, for instance: CDbw and DBCV.

Is it possible to calculate Euclidean Distance between two varying length matrices?

I have started working on online signature data-set for verification purpose. I have two matrices containing digitized data of two signatures of varying length (the number of rows differ). e.g. one is 177×7 and second is 170×7.
I want to treat each column as one time series and I'd like to compare one time series of a signature with the corresponding time series of second signature.
How should I align the two time series?
I think this question really belongs on Math.StackExchange, but I will do my best to answer it here. The short answer is that the Euclidean distance cannot be applied in this case and you will need to define your own notion of distance. This may or may not actually be feasible.
The notion of distance relies on the existence of a "metric" defined on the space of interest. If your vectors are of different lengths then traditional metrics (including the Euclidean distance) are ill-defined and you need to define a new metric that works for you.
There are two things you'll need to do here:
Define the space you're working with. This seems to be the set of vectors of length 177 or length 170. This is a very unusual set.
Define your metric (and ensure that it actually meets all the properties of a metric).
The most obvious solution is to project vectors of length 177 into the space of vectors of length 170 and then compute the Euclidean distance as usual. For example, you could just ignore the last 7 elements of the vector. Note that this is not a metric on your original set as it violates the condition ( d(x,y)=0 iff x=y ), but it is a metric on the projected vectors. There may be a clever solution on the original set, but there is not an obvious one. Again, the people on Math.StackExchange may be able to help you more.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.