What is the type or family of recsys algorithms for recommending similar users based on their interests? - recommendation-engine

I am learning recommendation systems from Coursera MooC. I see there are majorly three types of filtering methods (in introduction course).
a. Content-based filtering
b. Item-Item collaborative filtering
c. User-User collaborative filtering
Having understood this, I am not sure where does the - similar users recommendation based on the interests/preferences belong to? For example, consider I have User->TopicsOfInterest0..n relation. I want to recommend other similar users based on their respective TopicsOfInterest (vector).

I'm not sure that these three types are an exhaustive classification of all recommender systems.
In fact, any matrix-factorization based algorithm (SVD, etc.) is both item-based and user-based at the same time. But the TopicsOfInterest (factors) are inferred automatically by the algorithm. For example, Apache Spark includes an implementation of the alternating least squares (ALS) algorithm. Spark's API has the userFeatures method, which returns (roughly) a matrix, predicting users's attitude to each feature.
The only thing left to do is to compute a set of most similar users to a given one (e.g. find vectors, that are closest to a given one by cosine similarity).

Related

Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

Fusion Classifier in Weka?

I have a dataset with 20 features. 10 for age and 10 for weight. I want to classify the data for both separately then use the results from these 2 classifiers as an input to a third for the final result..
Is this possible with Weka????
Fusion of decisions is possible in WEKA (or with any two models), but not using the approach you describe.
Seeing as your using classifiers, each model will only output a class. You could use the two labels produced as features for a third model, but the lack of diversity in your inputs would most likely prevent the third model from giving you anything interesting.
At the most basic level, you could implement a voting scheme. Give each model a "vote" and then take assume that the correct class is the majority voted class. While this will give a rudimentary form of fusion, if you're familiar with voting theory you know that majority-rules somewhat falls apart when you have more than two classes.
I recommend that you use Combinatorial Fusion to fuse the output of the two classifiers. A good paper regarding the technique is available as a free PDF here. In essence, you use the Classifer::distributionForInstance() method provided by WEKA's classifiers and then use the sum of the distributions (called "scores") to rank the classes, choosing the class with the highest rank. The paper demonstrates that this method is superior to doing just voting alone.

Are there any implementations available online for filter based feature selection methods?

The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.
Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.
My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.
PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.

results of two feature selection algo do not match

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.
It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.
Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.

How to find bridges (community connecting nodes) in large networks represented using the adjacency matrix

I have networks of roughly 10K to 100K nodes which are all connected. These nodes are typically grouped into clusters of communities which are strongly connected with many edges between them and there are hubs etc. Between the communities there are nodes with a few edges bridging / connecting the communities together. These datasets are in adjacency matrices
I have tried spectral clustering (Ding et al 2001) but it is really slow on large data sets and seems to stop working when there is a lot of ambiguity (bridges which are not the only bridge route to another cluster- other communities can act as alternative proxy routes).
I have tried some of the methods from martelot such as the Newman algorithm for modularity optimisation but have not incorporated the stability optimisation functions in that effort (could that be crucial?). On synthetic data sets where the clusters are created by random graphs (ER graphs) the methods work but on real ones where there is nested hierarchy the results are scattered. Using a standalone visualization application/tool the bridges are evident though.
What methods would you recommend/advise to try? I am using MATLAB.
What do you want to do, exactly? Detect communities, or bridges between them? Those are two different problems. Once you have the communities, it's straightforward enough identifying the edges connecting nodes from two distinct communities. So, I guess you want to detect communities.
There are actually thousands methods for this purpose, some of them implemented in Matlab, such as the one you cite, or the generalized Louvain algorithm (also based on modularity optimization). However, most of them are rather available as C or C++ programs, such as InfoMap (based on a data compression paradigm), WalkTrap (clustering using a random walk-based distance), Markov Cluster (simulates some propagation mechanism), and the list goes on...
Those tools formalize the notion of community structure more or less differently, potentially leading to different (estimated) community structures, when applied on the same network. And of course, different communities means different bridges, too. So the question is rather to know how to pick the appropriate method for your data. You seem to have a priori knowledge regarding the networks you are studying, so you should use that to make your choice (rather than the programming language). For instance, even if you don't state it explicitly, you seem to be looking for a hierarchical community structure: not all tools are able to detect this kind of structure. Similarly, if you think one node can belong to several communities at the same time, then you should consider looking for overlapping communities, for instance using CFinder (based on clique percolation).
I'd advise you to have a look at this excellent review of community detection, you might find some interesting information allowing you to pick a method: Community Detection in Graphs. Also, from a programming point of view, I'd advise you to play with the igraph library (available for C, R and Python): it contains several standard community detection tools. You can try them on your data and see what you get.