Integrating content information with factorization-based collaborative filtering - recommendation-engine

I'm reading some papers in CF and noticed that most state-of-the-art methods are based on different factorization methods on the rating matrix only. I'd like to know if there are some representative works on combining content information (e.g. user features and item features) into factorization. Any ideas?

I am a researcher in the field of recommender systems, and did some work on exactly that. Here are some papers on that topic:
Aditya Krishna Menon, Charles Elkan: A log-linear model with latent features for dyadic prediction, ICDM 2010
David Stern, Ralf Herbrich, and Thore Graepel: Matchbox: Large Scale Bayesian Recommendations, WWW 2009
Chong Wang, David Blei: Collaborative topic modeling for recommending scientific articles, KDD 2011
Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, Lars Schmidt-Thieme: Learning Attribute-to-Feature Mappings for Cold-Start Recommendations, ICDM 2010
D. Agarwal and B.-C. Chen. Regression-based latent factor models, KDD 2009
D. Agarwal and B.-C. Chen. fLDA: Matrix factorization through latent dirichlet allocation, WSDM 2010
Please note that (4) is a paper by me, so this is also some kind of advertisement ;-)
Also, the KDD Cup 2011 involved an item taxonomy, and there has been some interesting work on combining such taxonomy information with latent factor models at the workshop: http://kddcup.yahoo.com/workshop.php

See for example "5. Hybrid Collaborative Filtering Techniques" in
X. Su, T. M. Khoshgoftaar, A Survey of Collaborative Filtering Techniques,
Advances in Artiļ¬cial Intelligence (2009). PDF

Related

Difference between i-vector and d-vector

could someone please explain the difference between i-vector and d-vector? All I know about them is that they are widely used in speaker/speech recognition systems and they are kind of templates for representing speaker information, but I don't know the main differences.
I-vector is a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
On the other hand, d-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution).
So in conclusion, these are two distinct features extracted from totally different methods or assumptions. I recommend you reading these papers:
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. G-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. ICASSP, 2014, pp. 4080-4084.
I don't know how to properly characterize the d-vector in plain language, but I can help a little.
The identity vector, or i-vector, Is a spectral signature for a particular slice of speech, usually a sliver of a phoneme, rarely (as far as I can see) as large as the entire phoneme. Basically, it's a discrete spectrogram expressed in a form isomorphic to the Gaussian mixture of the time slice.
EDIT
Thanks to those who provided comments and a superior answer. I updated this only to replace the incorrect information from my original attempt.
A d-vector is extracted from a Deep NN, the mean of the feature vectors in the DNN's final hidden layer. This becomes the model for the speaker, used to compare against other speech samples for identification.

Evaluation of user-based collaborative filtering K-Nearest Neighbor Algorithm

I was trying to find evaluation mechanisms of collaborative K-Nearest neighbor algorithm, but i am confused how can I evaluate this algorithm. How can I be sure that the recommendation done by this algorithm is correct or good. Actually I have also developed an algorithm that i want to compare with it. but i am not sure how can i compare and evaluate both of them. The data set used by me is of movie lens.
your people help on evaluating this recomender system will be highly appreciated.
Evaluating recommender systems is a large concern of its research and industry communities. Look at "Evaluating collaborative filtering recommender systems", a Herlocker et al paper. The people who publish MovieLens data (the GroupLens research lab at the University of Minnesota) also publish many papers on recsys topics, and the PDFs are often free at http://grouplens.org/publications/.
Check out https://scholar.google.com/scholar?hl=en&q=evaluating+recommender+systems.
In short, you should use a method that hides some data. You will train your model on a portion of the data (called "training data") and test on the remainder of the data that your model has never seen before. There's a formal way to do this called cross-validation, but the general concept of visible training data versus hidden test data is the most important.
I also recommend https://www.coursera.org/learn/recommender-systems, a Coursera course on recommender systems taught by GroupLens folks. In that course you'll learn to use LensKit, a recommender systems framework in Java that includes a large evaluation suite. Even if you don't take the course, LensKit may be just what you want.

Classification in real time without prior knowledge of the number of classes

Is there an implemented algorithm (with python/R or java in preference) that can classify incoming data from an unknown generator with absolutely no prior knowledge or assumption.
For example:
Let G be a generator of 2d vectors that generate one vector in each second.
What we know, and nothing else, is that this vectors are separable into clusters in space (euclidean distance).
Question: How can I classify my data in real time so that at each iteration, the algorithm propose clusters?
I'm also in the process of searching for something related to Data stream clustering and I found some papers and code:
Aforementioned survey by Charu C. Aggarwal from aforementioned book;
Density-Based Clustering over an Evolving Data Stream with Noise. by Feng Cao et al., proposes DenStream; here is a git repo for that (Matlab);
Density-Based Clustering for Real-Time Stream Data by Yixin Chen, Li Tu, proposes the D-Stream framework (2008 version called Stream data clustering based on grid density and attraction); There is a DD-Stream that I can't find a pdf for in A grid and density-based clustering algorithm for processing data stream by Jia.
A Fast and Stable Incremental Clustering Algorithm by Steven Young et al. focuses on clustering as an unsupervised learning process;
Self-Adaptive Anytime Stream Clustering by Philipp Kranen et al. has ClusTree and this git repo implements DClusTree;
Pre-clustering algorithm for anomaly detection and clustering that uses variable size buckets by Manish Sharma et al. is more recent and may be relevant (git repo of the author);
This paper is about MOA (Massive Online Analysis) states that it implements some of the above (StreamKM++, CluStream, ClusTree, Den-Stream, D-Stream and CobWeb). I believe that D-Stream is work in progress/wishful thinking (is not part of the pre-release available from their website). MOA is written in Java, here is streamMOA package.
The code in this repository seems to be a Python implementation of the D-Stream but, according to the author, it is slow.
Also, stream is a framework for data stream clustering research with R.
I think you are asking about "Stream mining" here.
Read this article
Chapter 10: A Survey of Stream Clustering Algorithms.
Charu C. Aggarwal, IBM T. J. Watson Research Center, Yorktown Heights, NY
This can be found in the 2014 book
DATA CLUSTERING- Algorithms and Applications, Edited by Charu C. Aggarwal and Chandan K. Reddy.
In that chapter the "CluStream" framework is described. This project is from 2002, and it is based on the BIRCH algorithm from 1997 which is a "Micro-Clustering" approach. The algorithm creates an index structure on the fly.
Considering that there are few BIRCH implementations,
there is probably no open-source CluStream algorithm/framework available.
Here's a Github repo with a BIRCH implementation in Java - although I haven't tried this code, and that repo is not for "stream mining".
All this just appeared on my radar because I just recently participated in the Coursera MOOC on cluster analysis.
There are no assumption free methods.
You are asking for magic to happen.
Never blindly use a clustering result. Do not use clustering on a stream. Instead, analyze and correct any clustering result before deployment.
Watch out for hidden assumptions. For example, assumptions that clusters are convex, distance based (why is Euclidean distance the coorect choice?), have the same size or extend, are separated (by what?) or shape. Whenever you design a method, you make assumptions on what is interesting.
Wothout assumption, anything is a "clustering"!

Are there any implementations available online for filter based feature selection methods?

The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.
Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.
My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.
PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.

Autoencoders: Papers and Books regarding algorithms for Training

Which are some of the famous research papers and/or books which concern Autoencoders and the various different training algorithms for Autoencoders?
I'm talking about research papers and/or books which lay the foundation for the different training algorithms used to train autoencoders.
I first saw autoencoders in Fahlman's 1988 article where he introduces quickpropagation for their training. The paper is here.
Fahlman, S. E. (1988) "Faster-Learning Variations on Back-Propagation: An Empirical Study" in Proceedings, 1988 Connectionist Models Summer School, Morgan-Kaufmann, Los Altos CA. (This paper introduced the Quickprop learning algorithm.)
I also wrote the following example around it, including quickprop.
https://github.com/encog/encog-java-examples/blob/master/src/main/java/org/encog/examples/neural/benchmark/FahlmanEncoder.java