Recommendation Algorithm for suggesting job to workers(Crowdsourcing platform) - recommendation-engine

I have crawled MTurk website. and I have 260 Hits as a dataset and from this dataset particular number of users has selected Hits and assigned ratings to each selected Hits. now I want to give recommendation to these users on basis of their selection. How it is possible ? Can anyone recommend me any recommendation algorithm ?

It sounds that You should go for the one of the Collaborative Filtering (CF) algorithm as users have explicit feedback in a form of ratings. First, I would suggest implementing a simple item/user-based k-Nearest Neighbours algorithm. If the results do not satisfy You and maybe Your data is very sparse - probably matrix factorization techniques should do the trick. A good recently survey which I read was [1] - it presents the different methods on different data settings.
If You fill fill comfortable with this and You realize that what You need is actually ranked list of Top-N predictions than ratings, I would suggest reading about e.g. Bayesian Personalized Ranking[2].
And the best part is - those algorithms are really well known and most of them are available for almost every programming language, e.g. python ->
[1] J. Lee, M. Sun, and G. Lebanon, “A Comparative Study of Collaborative Filtering Algorithms,” ArXiv, pp. 1–27, 2012.
[2] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-thieme, “BPR : Bayesian Personalized Ranking from Implicit Feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, vol. cs.LG, pp. 452–461.


Clearing Mesh of Graph

If we do the information visualization of documents, the graph generation across multiple documents often forms a mesh. Now to get a clear picture it is easy to form them with minimum data load and thus summarization is a good thing. But if the document load becomes
million then with summarization also the graph forms a big mesh.
I am bit perplexed how to clear the mesh. Reading and working round is not coming much help, as data is huge.
If any learned members may kindly help me out.
Are you talking about creating a graph or network of the documents? For example, you could have a network of documents linked by their citations, by having shared authors, by having the same terms appearing in them, etc. This isn't generally called a mesh problem, instead it is an automatic graph layout problem.
You need either better layout algorithms or to do some kind of clustering and reduction. There are many clustering algorithms you can use, for example Wakita & Tsurumi's:
Ken Wakita and Toshiyuki Tsurumi. 2007. Finding community structure in mega-scale social networks: [extended abstract]. Proc. 16th international conference on World Wide Web (WWW '07). 1275-1276. DOI=10.1145/1242572.1242805.
One that is particularly targeted at reducing complexity through "graph summarization" is Navlakha et al. 2008:
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. Proc. 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). 419-432. DOI=10.1145/1376616.1376661.
You could also check out my latest paper, which replaces common repeating patterns in the network with representative glyphs:
Dunne, C. & Shneiderman, B. 2013. Motif simplification: improving network visualization readability with fan, connector, and clique glyphs. Proc. 2013 SIGCHI Conference on Human Factors in Computing Systems (CHI '13). PDF.
Here's an example picture of the reduction possible:

How to generate recommendation with matrix factorization

I've read some papers of Matrix Factorization(Latent Factor Model) in Recommendation System,and I can implement the algorithm.I can get the similar RMSE result like the paper said on the MovieLens dataset.
However I find out that,if I try to generate a top-K(e.g K=10) recommended movies list for every user by rank the predicted rating,it seems that the movies that are thought to be rated high point of all users are the same.
Is that just what it works or I've got something wrong?
This is a known problem in recommendation.
It is sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter.
So most automated procedures will find out which items are generally popular, and recommend those to the users.
You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is.

Clustering or classification?

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.
I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.
To solve this, I turned to Mahout.
I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).
I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)
You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?
Thank you in advance for pointing me in the right direction.
Copying my reply from the mailing list:
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.
Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.
That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.
Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.
Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.
What you are trying to do sounds like classification to me. Because you have predefined classes.
A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).
Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.

Online k-means clustering

Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time.
I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis.
Also, does anyone have advice for other online clustering algorithms?
(lmgtfy failed ;))
Yes there is. Google failed to find it because it's more commonly known as "sequential k-means".
You can find two pseudo-code implementations of sequential K-means in this section of some Princeton CS class notes by Richard Duda. I've reproduced one of the two implementations below:
Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
Acquire the next example, x
If mi is closest to x
Increment ni
Replace mi by mi + (1/ni)*( x - mi)
The beautiful thing about it is that you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can throw away the data point.
I'm not sure where you would be able to find a citation for it. I would start looking in Duda's classic text Pattern Classification and Scene Analysis or the newer edition Pattern Classification. If it's not there, you could try Chris Bishop's newest book or Daphne Koller and Nir Friedman's recent text.

Making predictions from a CV

I have a database with many CVs, including structured data of the gender, age, address, number of years of education, and many other parameters of each person.
For about 10% of the sample, I also have additional data about a certain action they've made at some point in time. For instance, that Jane took a home loan in July 1998 or that John started pilot training in Jan. 2007 and got his license in Dec. 2007.
I need an algorithm that will give, for each of the actions, the probability that it will happen for each person in future time increments. For instance, that the chance of Bill taking a home loan is 2% in 2011, 3.5% in 2012, etc.
How should I approach this? Regression analysis? SVM? Neural net? Something else?
Is there perhaps even some standard tool/library that I can use with just the obvious customizations?
The probability that X happens given that Y happened is right out of Bayesian inference, I think.
Lou is right, this is the case for 'Bayesian Inference'.
The best tool/library to solve this is the R statistic programming language (
Take a look at the Bayesian Inference Libraries in R:
How many people are in the "10% of the sample"? If it's below 100 people or so, I would fear that the results of the analysis could not be significant. If it's 1000 or more people, the results will be quite good (rule of thumb).
I would fist export the data to R (r-project) and do some data cleaning necessary. Then find a person familiar with R and advanced statistics, he will be able to solve this very quickly. Or try yourself, but R takes some time in the beginning.
Concerning the tool/library choice, I suggest you give Weka a try. It's an open source tool for experimenting with data mining and machine learning. Weka has several tools for reading, processing and filtering your data, as well as prediction and classification tools.
However, you must have a strong foundation in the above mentioned fields in order to strive for a useful result.