how to categorize "tastekid" and "clerkdogs" recommender system - recommendation-engine

I wonder what category of recommender system do Tastekid and clerkdogs belong to? Both do not seem to require any rating from users.

I'm not sure about Clerkdogs but Tastekid seems to be a collaborative filtering system (see here). Even when users don't provide ratings you can still apply collaborative filtering techniques. For instance, you can use a binary representation where 1 means a user liked an item and 0 when they don't like it (or unknown).

Related

How to implement Associative Rules Analysis or Market Basket Analysis from scratch?

I tried to went through numerous articles trying to understand what should be my first step to incorporate associative analysis (may be Market Basket analysis) into my system. They all go deep into implementation of algorithm but no one talked about how to store data in the first place.
I will really appreciate if someone can give me some start pointers or article links that I can begin with.
The first thing I want to implement is to track user clicks and provide suggestions based on tracked data.
E.g. User clicked on link A and subsequently on link B and link C. I can track this activity with some metadata associated (user, user organization, user role etc.)
I do not want it to be limited only to links. In future, I want to add number of similar usecases into the system and want to make it smart. E.g. If user set specific values for fields A and B, most likely he/she will set value <bla> for field C.
My system may generate several thousand such data points in a day (E.g. user clicks, field selection etc.).
Below are my questions:
How should I store my data? Go SQL or No SQL (I briefly looked into Mongo DB and it looked promising)
What tool should I use to perform the associative analysis? Are there any open source tools I can use?
It depend. Does your data suitable for NoSql databases? To answer this question it's better to read CAP Theorem and it's case studies: https://en.wikipedia.org/wiki/CAP_theorem or http://robertgreiner.com/2014/06/cap-theorem-explained/
. Some time you want Consistency(depending to your data) and Availability => so that it's better to use Relational Databases like Mysql(Try to read case studies and analyse your data to pick the best tools)
There is large number of open source libraries, but in my opinion it's better to first read some concepts and algorithms. Try searching for Apriori,ECLAT, FP-GROWTH Algorithms and get concepts of them. then you can pick a tool or write the code your self. Some usefull tools(depending to your programming language):
Python: https://github.com/asaini/Apriori, https://github.com/enaeseth/python-fp-growth, https://github.com/enaeseth/python-fp-growth/blob/master/fp_growth.py
PHP: https://github.com/sigidhanafi/fp-growth-php
JAVA: https://github.com/goodinges/FP-Growth-Java, http://www.philippe-fournier-viger.com/spmf/
Also you can use Spark: https://spark.apache.org/docs/1.1.1/mllib-guide.html

What are the non-rating based recommendation systems

What are the non-rating based recommendation systems?
I could have used Recommenderlab but it needs a rating matrix as a input.
If you are looking for recommender system packages which can handle purchase data i would recommend take a look at https://github.com/benfred/implicit or at https://github.com/lyst/lightfm. Both packages can handle data sets with no explicit rating information (like for example 5 star movie ratings). Just google "implicit feedback recommender systems" which is probably what you want.
I think your data set is a boolean data set like; userId, itemId where rating doesn't exist.
Here is a solution with Apache Mahout for boolean data sets;
http://kickstarthadoop.blogspot.com.tr/2011/05/generating-recommendations-with-mahout_26.html

Can I use Apache Mahout Taste for User Preferences matching?

I am trying to match objects based on predefined user preferences. A simple example would be finding best matching vechicle.
Lets say a user 'Tom' is offered a rented vehicle for travel based on his predefined preferences. In this case, the predefined user preferences will be -
** Pre-defined user preferences for Tom:
PreferredVehicle (Make='ANY', Type='3-wheeler/4-wheeler',
Category='Sedan/Hatchback', AC/Non-AC='AC')
** while the 10 available vehicles are -
Vechile1(Make='Toyota', Type='4-wheeler', Category='Hatchback', AC/Non-AC='AC')
Vechile2(Make='Tata', Type='3-wheeler', Category='Transport', AC/Non-AC='Non-AC')
Vechile3(Make='Honda', Type='4-wheeler', Category='Sedan', AC/Non-AC='AC')
;
;
and so on upto 'Vehicle10'
All I want to do is - choose a vehicle for Tom that best matches his preferences and also probably give him choices in order, i.e. best match first.
Questions I have :
Can this be done with Mahout Taste?
If yes, can someone please point me to some example code where I can start quickly?
A recommender may not be the best tool for the job here, for a few reasons. First, I don't expect that the best answers are all that personal in this domain. If I wanted a Ford Focus, the best alternative you have is likely about the same for most every user. Second, there is not much of a discovery problem here. I'm searching for a vehicle that meets certain needs; I don't particularly want or need to find new and unknown vehicles, like I would for music. Finally you don't have much data per user; I assume most users have never rented before, and very few have even 3+ rentals.
Can you throw this data at a recommender anyway? Sure, try Mahout Taste (I'm the author). If you have the book Mahout in Action it will walk you through it. Since it's non-rating data, I can also recommend the successor project, Myrrix (http://myrrix.com) as it will be easier to set up and run. You can at least evaluate the results to see if it's anywhere near useful.
Either way, your work will just be to make a CSV file of "userID,vehicleID" pairs from your data and feed it in. Then it will give you vehicle IDs as recommendations for any user ID.
But, I imagine you will do much better to analyze what people picked when the car wasn't available, and look at the difference, and learn which attributes they are most and least likely to be sacrificed, and learn to score the alternatives that way. This is entirely feasible since this data set is small, and because you have rich item attribute data.

Tools for getting intent from Twitter statuses? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?
Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?
I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.
I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.
Ideas? Has this been done before?
There are two straightforward ways to go about finding tweets relevant to your content.
The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
Supervised Document Classification
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL or #NBA could all be assigned a label of sports.
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
NLTK (Python) - see Chapter 6 in the NLTK book on Learning to Classify Text
Classifier4J (Java)
nBayes (C#)
Semantic Overlap
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
Better Semantic Overlap
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile and A little car.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.
Great question. I think for twitter your best bet is to use hashtags because otherwise you need to create algorithms or find existing algorithms that do language analysis and improve over time based on user input/feedback.
For facebook you can kind of do what bing implemented a while back. As I covered in this article here:
http://www.socialtimes.com/2010/06/bing-adds-facebook-and-twitter-features-steps-up-social-services/
I wrote: For example, a search for “NBA Finals” will return fan-page content from Facebook, including posts from a local TV station. So if you're trying to augmented NBA related content, you could do a similar search as Bing provides - searching publically available fan-page content the way spiders index them for search engines. I'm not a developer so i'm not sure of the intricacies but I know it can be done.
Also you can display popular shared links from users who are publishing to ‘everyone’ will be aggregated for all non-fan page content. I'm not sure if this is limited to being published to 'everyone' and/or being 'popular' although I would assume so - but you can double check that.
Hope this helps
The problem with NLP is not the algorithm (although that is an issue) the problem is the resources. There are some open source shallow parsing tools (that's all you would need to get intent) that you could use but parsing thousands or millions of tweets would cost a fortune in computer time.
On the other hand like you said not all tweets have hashtags and there is no promise they will be relevant.
Maybe you can use a mixture of keyword search to filter out a few possibilities (those with the highest keyword density) and then use a deeper data analysis to pick the top 1 or 2. This would keep computer resources at a minimum and you should be able to get relevant tweets.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.