Minimum amount of data for an item-based collaborative filter - recommendation-engine

I'm working on a recommendation engine which uses an item-based collaborative filter to create recommendations for restaurants. Each restaurant has reviews with a rating from 1-5.
Every recommendation algorithm struggles with the data sparsity issue, so I have been looking for solutions to calculate a correct correlation.
I'm using an adjusted cosine similarity between restaurants.
When you want to compute a similarity between restaurants, you need users who have rated both restaurants. But what would be the minimum of users who have rated both restaurants to get a correct correlation?
From testing, I have discovered that 1 set of users who have rated both restaurants results in bad similarities (Obviously). Often it's -1 or 1. So I have increased it to 2 set of users who have both restaurants, which gave me better similarities. I just find it difficult to determine if this similarity is good enough. Is there a method which either tests the accuracy of this similarity or are there guidelines on how what the minimum is?

The short answer is a parameter sweep: try several values of "minimum users who have rated both restaurants" and measure the outcomes. With more users, you'll get a better sense of the similarity between items (restaurants). But your similarity information will be sparser. That is, you'll focus on the more popular items and be less able to recommend items in the long tail. This means you'll always have a tradeoff, and you should measure everything that will allow you to make the tradeoff. For instance, measure predictive accuracy (e.g., RMSE) as well as the number of items possible to recommend.
If your item space becomes too sparse, you may want to find other ways to do item-item similarity beyond user ratings. For instance, you can use content-based filtering methods to include information about each restaurants' cuisine, then create an intermediate step to learn each user's cuisine preferences. That will allow you to do recommendations even when you don't have item-item similarity scores.

Related

What mechanism can be used to quantify similarity between non-numeric lists?

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. If you are given a recipe how would you identify similar recipes allowing for variations and omissions? For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour.
The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. Is there another way to do this? Something to the equivalent of perceptual hashing would be ideal!
How about cosine similarity?
This technique is commonly used in Machine Learning for text recognition as a similarity measure. With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike).
Take a look at this great question that explains cosine similarity in a simple way. In general, you could use any similarity measure to obtain a distance to compare your recipe. This article talks about different similarity measures, you can check it out if you wish to know more.

Collaborative Filtering Algorithm

If I have the following users with the following ratings for movies they watched:
User1 Movie1-5 Movie2-4
User2 Movie2-5 Movie2-3 Movie3-4
User3 Movie1-4 Movie2-4 Movie4-4
How would I use collaborative filtering to suggest movie3 to user1 and how do I calculate the probability of user1 giving movie3 a 4 or better?
Well there are a few different ways of generating recommendations using collaborative filtering, I'll explain user-based and item-based collaborative filtering methods. These methods are most used in recommendation algorithms.
User-based collaborative filtering
This basically calculates a similarity between users. The similarity can be a pearson correlation or cosine similarity. There are more correlation numbers, but those are most used. This article gives a good explanation on how to calculate this.
User-based filtering does come with a few challenges. First is the data sparsity issue, this occurs when there are a lot of movies with a few reviews. This makes it difficult to calculate a correlation between users. This wikipedia page explains more about this.
Second is the scalability issue. When you have millions of users with thousands of movies, the performance of calculating correlations between users is going to drop tremendously.
Item-based collaborative filtering
This method differs from user-based filtering because it calculates a similarity between movies instead of users. You can then use this similarity to predict a rating for a user. I have found that this presentation explains it very well.
Item-based filters have outperformed user-based filters, but they also suffer from the same issues, but a little less.
Content-based filtering
Seeing your data, it's going to be difficult to generate recommendations because you have too little data from users. I would suggest using a content-based filter until you have enough data to use collaborative filtering methods. It's a very simple method which basically looks at the user's profile and compares it to certain tags of a movie. This page explains it in more detail.
I hope this answered some of your questions!
You can either calculate similarity between users or among items. Some easy methods to find similarity are 'cosine similarity', 'Pearson similarity'.
This GFG page explains user-based approach, with an example to find similarity among users, and thus make predictions on items they didn't watch yet.

What is the metric for testing item-item similarity?

For item-item collaborative filtering, the similarity score between two items is sim(x,y) = dot(x,y)/(norm(x)*norm(y)). But how do you check if the result you get is accurate?
Different similarity measures may return different results. For instance; one's appearance may be more similar to his father than his mother and on the other hand his attitude may be more similar to his mother than his father. So in this case, which similarity measure is more accurate? They are both accurate from different perspectives.
Accuracy depends on prediction (MAE, RMSE, etc.) and the recommendation results (precision, recall, etc.). In order to find best similarity measure for data set, you should try different similarity measures on same conditions.

What kind of analysis to use in SPSS for finding out groups/grouping?

My research question is about elderly people and I have to find out underlying groups. The data comes from a questionnaire. I have thought about cluster analysis, but the thing is that I would like to search perceived health and which things affect on the perceived health, e.g. what kind of groups of elderly rank their health as bad.
I have some 30 questions I would like to check with the analysis, to see if for example widows have better or worse health than the average. I also have weights in my data so I need to use complex samples.
How can I use an already existing function, or what analysis should I use?
The key challenge you have to solve first is to specify a similarity measure. Once you can measure similarity, various clustering algorithms become available.
But questionnaire data doesn't make a very good vector space, so you can't just use Euclidean distance.
If you want to generate clusters using SPSS, standard options include: k-means, hierarhical cluster analysis, or 2-step. I have some general notes on cluster analysis in SPSS here. See from slide 34.
If you want to see if widows differ in their health, then you need to form a measure of health and compare means on that measure between widows and non-widows (presumably using a between groups t-test). If you have 30 questions related to health, then you may want to do a factor analysis to see how the items group together.
If you are trying to develop a general model of whats predicts perceived health then there are a wide range of modelling options available. Multiple regression would be an obvious starting point. If you have many potential predictors then you have a lot of choices regarding whether you are going to be testing particular models or doing a more data driven model building approach.
More generally, it sounds like you need to clarify the aims of your analyses and the particular hypotheses that you want to test.

How to evaluate a Content-based Recommender System

I'm building a content-based movie recommender system. It's simple, just let a user enter a movie title and the system will find a movie which has the most similar features.
After calculating similarity and sorting the scores in descending order, I find the corresponding movies of 5 highest similarity scores and return to users.
Everything works well till now when I want to evaluate the accuracy of the system. Some formulas that I found on Google just evaluate the accuracy based on rating values (comparing predicted rating and actual rating like RMSE). I did not change similarity score into rating (scale from 1 to 5) so I couldn't apply any formula.
Can you suggest any way to convert similarity score into predicted rating so that I can apply RMSE then? Or is there any idea of solution to this problem ?
Do you have any ground truth? For instance, do you have information about the movies that a user has liked/seen/bought in the past? It doesn't have to be a rating but in order to evaluate the recommendation you need to know some information about the user's preferences.
If you do, then there are other ways to measure the accuracy besides RMSE. RMSE is used when we predict ratings (as you said is the error between the real rating and the prediction) but in your case you are generating top N recommendations. In that case you can use precision and recall to evaluate your recommendations. They are very used in Information Retrieval applications (see Wikipedia) and they are also very common in Recommender Systems. You can also compute F1 metric which is an harmonic mean of precision and recall. You'll see they are very simple formulas and easy enough to implement.
"Evaluating Recommendar Systems" by Guy Shani is a very good paper on how to evaluate recommender systems and will give you a good insight into all this. You can find the paper here.