Combining different similarities to build one final similarity - cluster-analysis

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
city
education
interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1 means the presence of the interest and n is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.

There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.

Here's the usual trick in machine learning.
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
I take this to mean you use a one-of-K coding. That's good.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.

Related

ELKI DBSCAN epsilon value issue

i am trying to cluster word vectors using ELKI DBSCAN. I wish to use cosine distance to cluster the word vectors of 300 dimensions. The size of the dataset is 19,000 words (19000*300 size matrix). These are wordvectors computed using gensim word2vec and the list output is saved as a CSV
Below is the command i passed in the UI
KDDCLIApplication -dbc.in "D:\w2v\vectors.csv" -parser.colsep '","' -algorithm clustering.DBSCAN -algorithm.distancefunction CosineDistanceFunction -dbscan.epsilon 1.02 -dbscan.minpts 5 -vis.window.single
I played around with the epsilon value and while doing so i tried 3 values 0.8, 0.9, 1.0.
For 0.8 & 0.9 - i got "There are very few neighbors found. Epsilon may be too small."
while for 1.0 - i got "There are very many neighbors found. Epsilon may be too large."
What am i doing wrong here? I am quite new to ELKI so any help is appreciated
At 300 dimensions, you will be seeing the curse of dimensionality.
Contrary to popular claims, the curse of dimensionality does exist for cosine (as cosine is equivalent to Euclidean on normalized vectors, it can be at best 1 dimension "better" than Euclidean). What often makes cosine applications still work is that the intrinsic dimensionality is much less than the representation dimensionality on text (i.e., while your vocabulary may have a thousands of words, only few occur in the intersection of two documents).
Word vectors are usually not sparse, so your intrinsic dimension may be quite high, and you will see the curse of dimensionality.
So it is not surprising to see the Cosine distances to concentrate, and then you may need to choose a threshold with a few digits of precision.
For obvious reasons, 1.0 is a nonsense threshold for cosine distance. The maximum cosine distance is 1.0! So yes, you will need to try 0.95 and 0.99, for example.
You can use the KNNDistancesSampler to help you choose DBSCAN parameters, or you can use for example OPTICS (which will allow you to find clusters with different thresholds, not just one single threshold).
Beware that word vectors are trained for a very specific scenario: substitutability. They are by far not as universal as popularly interpreted based on the "king-man+woman=queen" example. Just try "king-man+boy", which often also returns "queen" (or "kings")... the result is mostly because that the nearest neighbors of king are "queen" and "kings". And the "capital" example is similarly overfitted due to the training data. It's trained on news articles, which often begin the text with "capital, country, blah blah". If you omit "capital", and if you omit "country", you get almost the exact same context. So the word2vec model learns that they are "substitutable". This works as long as the capital is also where the major US newspapers are based (e.g., Berlin, Paris). It often fails for countries like Canada, U.S., or Australia, where the main reporting hubs are located, e.g., in Toronto, New York, Sydney. And it does not really prove that the vectors have learned what a capital is. The reason that it worked in the first place is by overfitting on the news training data.

Cosine similarity for user recommendations

Is cosine similarity a good approach for deciding if 2 users are similar based on responses to questions?
I'm trying to have users answer 10 questions and resolving those responses to a 10-dimensional vector of integers. I then plan to use cosine similarity to find similar users.
I considered resolving each question to an integer and summing the integers to resolve each user to a single integer, but the problem with this approach is that the similarity measure isn't question specific: in other words, if a user gives an answer to question 1 that resolves to 5 and an answer to question 2 that resolves to 0, and another user responds to question 1 with 0 and question 2 with 5, both users "sum to 5", but answered each question fundamentally differently.
So will cosine similarity give a good similarity measure based on each attribute?
Summing all integers to resolve to a single integer per user does not seem to be right.
I think cosine similarity actually helps here as a similarity measure, you can try others as well like Jaccard, Euclidean, Mahalanobis etc.
What might help is the intuition behind cosine similarity. The idea is that once you create the 10 dimensional vectors you are working in a 10 dimensional space. Each row is a vector in that space, so the numbers in each components are important, the cosine between two vectors give an idea of how good/bad aligned those vectors are, if they are parallel and the angle is 0 means they go to the same direction, means the components are all proportional, similarity is maximum in this case, (example two users answered with exact the same numbers in all questions). If the components start to differ like in your example users gives 5 to a question and other gives 0 then the vectors fill have different directions, the larger the difference between the answers the more separated the vectors will be, the larger the angle between them, which results in lower cosine and hence similarity.
There are other similarity measures as I mentioned above, one thing ppl usually try is several of these measures vs a test set and sees which one performs better.

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Mutual Information and Chi Square relationship

I've used the following code to compute the Mutual Information and Chi Square values for feature selection in Sentiment Analysis.
MI = (N11/N)*math.log((N*N11)/((N11+N10)*(N11+N01)),2) + (N01/N)*math.log((N*N01)/((N01+N00)*(N11+N01)),2) + (N10/N)*math.log((N*N10)/((N10+N11)*(N00+N10)),2) + (N00/N)*math.log((N*N00)/((N10+N00)*(N01+N00)),2)
where N11,N01,N10 and N00 are the observed frequencies of the two features in my data set.
NOTE : I am trying to calculate the mutual information and Chi Squared values between 2 features and not the mutual information between a particular feature and a class. I'm doing this so I'll know if the two features are related in any way.
The Chi Squared formula I've used is :
E00 = N*((N00+N10)/N)*((N00+N01)/N)
E01 = N*((N01+N11)/N)*((N01+N00)/N)
E10 = N*((N10+N11)/N)*((N10+N00)/N)
E11 = N*((N11+N10)/N)*((N11+N01)/N)
chi = ((N11-E11)**2)/E11 + ((N00-E00)**2)/E00 + ((N01-E01)**2)/E01 + ((N10-E10)**2)/E10
Where E00,E01,E10,E11 are the expected frequencies.
By the definition of Mutual Information, a low value should mean that one feature does not give me information about the other and by the definition of Chi Square, a low value of Chi Square means that the two features must be independent.
But for a certain two features, i got a Mutual information score of 0.00416 and a Chi Square value of 4373.9. This doesn't make sense to me since the Mutual information score indicates the features aren't closely related but the Chi Square value seems to be high enough to indicate they aren't independent either. I think I'm going wrong with my interpretation
The values I got for the observed frequencies are
N00 = 312412
N01 = 276116
N10 = 51120
N11 = 68846
MI and Pearson's Large Sample Statistic are, under the usual conditions concerning sample size, directly proportional. This is quite well known. S proof is given here.
Morris, A.C. (2002) "An information theoretic measure of sequence recognition performance".
Can be downloaded from this page.
https://sites.google.com/site/andrewcameronmorris/Home/publications
Therefore, unless there is some mistake in your calculations, if one is high/low the other must be high/low.
The chi-squared independence test is examining raw counts while the mutual information score is examining only marginal and joint probability distributions. Hence, chi-squared also takes into account the sample size.
If the dependence between x and y is very subtle, then knowing one won't help very much in terms of predicting the other. However, as the size of the dataset increases we can become increasingly confident that some relationship exists.
You can try https://github.com/ranmoshe/Inference - it calculates both MI, and the p-value statistic using chi-square.
It also knows to calculate the degrees of freedom for each feature, including taking into account a conditional group (where the dof for a feature may be different between different values)