ELKI DBSCAN epsilon value issue - cluster-analysis

i am trying to cluster word vectors using ELKI DBSCAN. I wish to use cosine distance to cluster the word vectors of 300 dimensions. The size of the dataset is 19,000 words (19000*300 size matrix). These are wordvectors computed using gensim word2vec and the list output is saved as a CSV
Below is the command i passed in the UI
KDDCLIApplication -dbc.in "D:\w2v\vectors.csv" -parser.colsep '","' -algorithm clustering.DBSCAN -algorithm.distancefunction CosineDistanceFunction -dbscan.epsilon 1.02 -dbscan.minpts 5 -vis.window.single
I played around with the epsilon value and while doing so i tried 3 values 0.8, 0.9, 1.0.
For 0.8 & 0.9 - i got "There are very few neighbors found. Epsilon may be too small."
while for 1.0 - i got "There are very many neighbors found. Epsilon may be too large."
What am i doing wrong here? I am quite new to ELKI so any help is appreciated

At 300 dimensions, you will be seeing the curse of dimensionality.
Contrary to popular claims, the curse of dimensionality does exist for cosine (as cosine is equivalent to Euclidean on normalized vectors, it can be at best 1 dimension "better" than Euclidean). What often makes cosine applications still work is that the intrinsic dimensionality is much less than the representation dimensionality on text (i.e., while your vocabulary may have a thousands of words, only few occur in the intersection of two documents).
Word vectors are usually not sparse, so your intrinsic dimension may be quite high, and you will see the curse of dimensionality.
So it is not surprising to see the Cosine distances to concentrate, and then you may need to choose a threshold with a few digits of precision.
For obvious reasons, 1.0 is a nonsense threshold for cosine distance. The maximum cosine distance is 1.0! So yes, you will need to try 0.95 and 0.99, for example.
You can use the KNNDistancesSampler to help you choose DBSCAN parameters, or you can use for example OPTICS (which will allow you to find clusters with different thresholds, not just one single threshold).
Beware that word vectors are trained for a very specific scenario: substitutability. They are by far not as universal as popularly interpreted based on the "king-man+woman=queen" example. Just try "king-man+boy", which often also returns "queen" (or "kings")... the result is mostly because that the nearest neighbors of king are "queen" and "kings". And the "capital" example is similarly overfitted due to the training data. It's trained on news articles, which often begin the text with "capital, country, blah blah". If you omit "capital", and if you omit "country", you get almost the exact same context. So the word2vec model learns that they are "substitutable". This works as long as the capital is also where the major US newspapers are based (e.g., Berlin, Paris). It often fails for countries like Canada, U.S., or Australia, where the main reporting hubs are located, e.g., in Toronto, New York, Sydney. And it does not really prove that the vectors have learned what a capital is. The reason that it worked in the first place is by overfitting on the news training data.

Related

In DBSCAN, what does eps represent actually?

Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf
If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood.
What does eps represent actually? It is a radius to draw a circle right? So, why the radius is so small, smaller than distances between two objects? If so, the MinPts will be 0 forever.
Yes, if used with Euclidean distance, then it is a radius.
It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead.
Use the original paper to understand the algorithm, not some indian journal variant of it.
In Euclidean distance, it is the radius. Selection of Eps is a little difficult.
This problem is related to model selection, i.e., the selection of a particular model and its corresponding parametrization. In the case of k-means (which requires from the user the number of clusters as input) there is a plethora of measures in the literature that can help in the selection of the best number of clusters, for instance: silhouette, c-index, dunn, davies-bouldin. These measures are the so-called relative validity criteria.
In the case of Density-based clustering algorithms, there are some measures too, for instance: CDbw and DBCV.

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Combining different similarities to build one final similarity

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
city
education
interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1 means the presence of the interest and n is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.
There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.
Here's the usual trick in machine learning.
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
I take this to mean you use a one-of-K coding. That's good.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.