How to evaluate a Content-based Recommender System - recommendation-engine

I'm building a content-based movie recommender system. It's simple, just let a user enter a movie title and the system will find a movie which has the most similar features.
After calculating similarity and sorting the scores in descending order, I find the corresponding movies of 5 highest similarity scores and return to users.
Everything works well till now when I want to evaluate the accuracy of the system. Some formulas that I found on Google just evaluate the accuracy based on rating values (comparing predicted rating and actual rating like RMSE). I did not change similarity score into rating (scale from 1 to 5) so I couldn't apply any formula.
Can you suggest any way to convert similarity score into predicted rating so that I can apply RMSE then? Or is there any idea of solution to this problem ?

Do you have any ground truth? For instance, do you have information about the movies that a user has liked/seen/bought in the past? It doesn't have to be a rating but in order to evaluate the recommendation you need to know some information about the user's preferences.
If you do, then there are other ways to measure the accuracy besides RMSE. RMSE is used when we predict ratings (as you said is the error between the real rating and the prediction) but in your case you are generating top N recommendations. In that case you can use precision and recall to evaluate your recommendations. They are very used in Information Retrieval applications (see Wikipedia) and they are also very common in Recommender Systems. You can also compute F1 metric which is an harmonic mean of precision and recall. You'll see they are very simple formulas and easy enough to implement.
"Evaluating Recommendar Systems" by Guy Shani is a very good paper on how to evaluate recommender systems and will give you a good insight into all this. You can find the paper here.

Related

Which clustering algorithms can be used with Word Mover's Distance from M. Kusner's paper?

I am new to machine learning and now I am interested in document clustering (short texts with different lengths) according to their semantic similarity (I just want to go beyond the standard TF/IDF approach). I read the paper http://proceedings.mlr.press/v37/kusnerb15.pdf where the Word Mover's distance for word embeddings is explained. In the paper they used it for classification. My question is now - can I use it for clustering? If so, is there a paper where this kind of usage is discribed?
P.S.: I am basically interested in clustering which takes into account the semantic similarity, so even a word2vec or doc2vec approach will do the job - I just couldn't find any papers where they are used in a clustering problem.
If you could afford to compute an entire distance matrix, then you could do hierarchical clustering, for example.
It's easy today find other clusterings that accept any distance and use a threshold. These could even use the bounds for performance. But it's not obvious that they will work on such data.

Minimum amount of data for an item-based collaborative filter

I'm working on a recommendation engine which uses an item-based collaborative filter to create recommendations for restaurants. Each restaurant has reviews with a rating from 1-5.
Every recommendation algorithm struggles with the data sparsity issue, so I have been looking for solutions to calculate a correct correlation.
I'm using an adjusted cosine similarity between restaurants.
When you want to compute a similarity between restaurants, you need users who have rated both restaurants. But what would be the minimum of users who have rated both restaurants to get a correct correlation?
From testing, I have discovered that 1 set of users who have rated both restaurants results in bad similarities (Obviously). Often it's -1 or 1. So I have increased it to 2 set of users who have both restaurants, which gave me better similarities. I just find it difficult to determine if this similarity is good enough. Is there a method which either tests the accuracy of this similarity or are there guidelines on how what the minimum is?
The short answer is a parameter sweep: try several values of "minimum users who have rated both restaurants" and measure the outcomes. With more users, you'll get a better sense of the similarity between items (restaurants). But your similarity information will be sparser. That is, you'll focus on the more popular items and be less able to recommend items in the long tail. This means you'll always have a tradeoff, and you should measure everything that will allow you to make the tradeoff. For instance, measure predictive accuracy (e.g., RMSE) as well as the number of items possible to recommend.
If your item space becomes too sparse, you may want to find other ways to do item-item similarity beyond user ratings. For instance, you can use content-based filtering methods to include information about each restaurants' cuisine, then create an intermediate step to learn each user's cuisine preferences. That will allow you to do recommendations even when you don't have item-item similarity scores.

What is the metric for testing item-item similarity?

For item-item collaborative filtering, the similarity score between two items is sim(x,y) = dot(x,y)/(norm(x)*norm(y)). But how do you check if the result you get is accurate?
Different similarity measures may return different results. For instance; one's appearance may be more similar to his father than his mother and on the other hand his attitude may be more similar to his mother than his father. So in this case, which similarity measure is more accurate? They are both accurate from different perspectives.
Accuracy depends on prediction (MAE, RMSE, etc.) and the recommendation results (precision, recall, etc.). In order to find best similarity measure for data set, you should try different similarity measures on same conditions.

Dimensionality reduction for high dimensional sparse data before clustering or spherical k-means?

I am trying to build my first recommender system where i create a user feature space and then cluster them into different groups. Then for the recommendation to work for a particular user , first i find out the cluster to which the user belongs and then recommend entities(items) in which his/her nearest neighbor showed interest. The data which i am working on is high dimensional and sparse. Before implementing the above approach, there are few questions, whose answers might help me in adopting a better approach.
As my data is high dimensional and sparse, should i go for dimensionality reduction and then apply clustering or should I go for an algorithm like spherical K-means which works on sparse high dimensional data?
How should I find the nearest neighbors after creating clusters of users.(Which distance measure should i take as i have read that Euclidean distance is not a good measure for high dimensional data)?
It's not obvious that clustering is the right algorithm here. Clustering is great for data exploration and analysis, but not always for prediction. If your end product is based around the concept of "groups of like users" and the items they share, then go ahead with clustering and simply present a ranked list of items that each user's cluster has consumed (or a weighted average rating, if you have preference information).
You might try standard recommender algorithms that work in sparse high-dimensional situations, such as item-item collaborative filtering or sparse SVD.

Combining different similarities to build one final similarity

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
city
education
interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1 means the presence of the interest and n is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.
There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.
Here's the usual trick in machine learning.
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
I take this to mean you use a one-of-K coding. That's good.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.