Results of test statistics evaluation of Recommender Systems given data - recommendation-engine

I was wondering if there is a source, where given some train data and test data, the test statistics of evaluation of Recommender Systems are also provided. For example, given two files train.dat and test.dat, where the data have already been split into a training and a test set which contain user_id, item_id and ratings (just like in grouplens dataset) and in the end some answer for precision or recall or map#k test is provided for the performance of a nonpersonalized recommender system (like the top rated items-most viewed) or any other recommender system.
Thank you in advance,
Regards,
Marios

The fastest way to find out how known recommenders performance on the new dataset is to grab existing implementation and run it. If You have prepared split - it's OK, but make sure that splitting strategy is adequate what You are trying to measure.
Helpful project for evaluation:
- librec
- project Rival
- MediaLens
And make sure, the right measure is used for the right purpose. In the question the rating prediction example is given and then the ranking evaluation measures are listed.

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

Clustering techniques for Binary Data

I want to use clustering techniques for binary data analysis. I have collected the data through survey in which i asked the users to select exactly 20 features out of list of 94 product features. The columns in my data represents the 94 product features and the rows represents the participants. I am trying to cluster the similar users in different user groups based on the product features they selected. Each user cluster should also tell me the product features associated with each cluster. I am using some open source clustering tools like NCSS and JMP. I was trying to use fuzzy clustering technique for achieveing my goal but unfortunately these tools do not deal with binary data. Can you please suggest me which technique would really be appropriate for my tasks , also which online tool i can use for using the cluster analysis on my data? As beacuse of the time limitation, I am not looking to code myself and i am only looking for some open source tools that have all the functionality available in them which i can use as it is.
Clustering for binary data is not really well defined.
Rather than looking for some tool/function that may or may not work by trial and error, you should first try to answer a 'simple" question:
What is a good cluster, mathematically?
Vague terms not allowed. The next questions to answer then are: I) when is clustering A better than clustering B (I.e. how does the computer compute quality), and ii) how can this be found efficiently.
You won't get far if you don't understand what you are doing just by calling random functions...
Also, is clustering actually what you are looking for? Most of the time with binary data e.g. frequent itemset mining is the better choice.

how can I set a goal on recommendation system ?(mean average precision, baselineRmse)

I starting to develop offline recommendation system using ALS algorithm.
and I need to set a goal about system.
so I wanna know what criteria used to evaluate recommendation system.
I already know MAP (mean average precision) and improvement to baselineRmse and I would like to know: how is the performance on these criterions in modern recommendation systems to set my goal.
Back in the early days of recommenders people thought predicting ratings was a good idea. This has since proven to be nearly useless of itself. If you have enough space in a UI to show a few recommendations are you going to pick the one you think the user will pick with the highest ratings? That will always result in bad performance. Rating prediction is what RMSE was designed to measure.
MAP#k on the other hand is meant to find the predictiveness in a recommender. It measures how well the training data predicts what is in the test data. It also accounts for the ordering of recommendations. Ranking/ordering of recommendations has more recently been discovered to have a much greater effect on the effectiveness of recommendations because if you can only show a limited number they had better be the most likely to cause a user to take action.
MAP#k also takes account of ranking in the sense that if you measure MAP#1 and MAP#10, you will see decreasing MAP scores if your first recommendation was more likely to be in the test data than the 10th. This means you are ordering recommendations roughly correct.
For these reason we use MAP#k. Split the "gold standard" dataset you will use in later rests and keep the split static—something like 80%-20% will work split by random choice or by time, the most recent 20% used as the test split. Build you model on the 80%, then for each interaction in the 20% get recommendations and see if the recommendations contain the item actually interacted with in the test set. The aggregate of all these will go into the MAP#k calculation, k is based on how many recommendation you ask for.
See these references and some tools we have to do this:
Kaggle blog references python code they and we ActionML use. https://www.kaggle.com/wiki/MeanAveragePrecision
ActionML analysis python code to split data sets and run MAP#k, where we use the Kaggle function. https://github.com/actionml/analysis-tools

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?