Matrix factorization based recommendation for like/dislike/unknown data - recommendation-engine

Most literature focus on either explicit rating data or implicit (like/unknown) data. Are there any good publications to handle like/dislike/unknown data? That is, in the data matrix there are three values, and I'd like to recommend from unknown entries.
And are there any good open source implementations on this?
Thanks.

With like and dislike, you already have explicit rating data. You can use standard collaborative filtering with user and item normalization. You can also check out OrdRec: An Ordinal Model for Predicting Personalized Item Rating Distributions, which just takes an ordinal ranking of item ratings. That is, you can say that Like is better than Dislike, and let the algorithm figure out the best ranking-to-rating mapping before doing standard item-item collaborative filtering. Download LensKit and use the included OrdRec algorithm.

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

Clustering techniques for Binary Data

I want to use clustering techniques for binary data analysis. I have collected the data through survey in which i asked the users to select exactly 20 features out of list of 94 product features. The columns in my data represents the 94 product features and the rows represents the participants. I am trying to cluster the similar users in different user groups based on the product features they selected. Each user cluster should also tell me the product features associated with each cluster. I am using some open source clustering tools like NCSS and JMP. I was trying to use fuzzy clustering technique for achieveing my goal but unfortunately these tools do not deal with binary data. Can you please suggest me which technique would really be appropriate for my tasks , also which online tool i can use for using the cluster analysis on my data? As beacuse of the time limitation, I am not looking to code myself and i am only looking for some open source tools that have all the functionality available in them which i can use as it is.
Clustering for binary data is not really well defined.
Rather than looking for some tool/function that may or may not work by trial and error, you should first try to answer a 'simple" question:
What is a good cluster, mathematically?
Vague terms not allowed. The next questions to answer then are: I) when is clustering A better than clustering B (I.e. how does the computer compute quality), and ii) how can this be found efficiently.
You won't get far if you don't understand what you are doing just by calling random functions...
Also, is clustering actually what you are looking for? Most of the time with binary data e.g. frequent itemset mining is the better choice.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

How to generate recommendation with matrix factorization

I've read some papers of Matrix Factorization(Latent Factor Model) in Recommendation System,and I can implement the algorithm.I can get the similar RMSE result like the paper said on the MovieLens dataset.
However I find out that,if I try to generate a top-K(e.g K=10) recommended movies list for every user by rank the predicted rating,it seems that the movies that are thought to be rated high point of all users are the same.
Is that just what it works or I've got something wrong?
This is a known problem in recommendation.
It is sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter.
So most automated procedures will find out which items are generally popular, and recommend those to the users.
You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is.

Using Conditional Random Fields for Named Entity Recognition

What is Conditional Random Field?
How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text?
For example: This product is ordered by StackOverFlow Inc.
What does Conditional Random Field do to identify StackOverFlow Inc. as an organization?
A CRF is a discriminative, batch, tagging model, in the same general family as a Maximum Entropy Markov model.
A full explanation is book-length.
A short explanation is as follows:
Humans annotate 200-500K words of text, marking the entities.
Humans select a set of features that they hope indicate entities. Things like capitalization, or whether the word was seen in the training set with a tag.
A training procedure counts all the occurrences of the features.
The meat of the CRF algorithm search the space of all possible models that fit the counts to find a pretty good one.
At runtime, a decoder (probably a Viterbi decoder) looks at a sentence and decides what tag to assign to each word.
The hard parts of this are feature selection and the search algorithm in step 4.
Well to understand that you got to study a lot of things.
For start
Understand the basic of markov and bayesian networks.
Online course available in coursera by daphne coller
https://class.coursera.org/pgm/lecture/index
CRF is a special type of markov network where we have observation and hidden states.
The objective is to find the best State Assignment to the unobserved variables also known as MAP problem.
Be Prepared for a lot of probability and Optimization. :-)