When execute KNN on user based collaborative filter? - recommendation-engine

I have seen codes and examples of user based collaborative filter that executes knn before calculates similarities, and others that executes only after, it means, executes knn on similarities matrix.
So my doubt is when calculates KNN algorithm, on data matrix before calculates similarities, or on similarities matrix?
References:
https://github.com/mhahsler/recommenderlab/blob/master/R/RECOM_UBCF.R
Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard
Friedrich - Recommender Systems An Introduction - Cambridge University
Press (2010)
A CUDA-enabled Parallel Implementation of Collaborative Filtering
Zhongya Wanga, Ying Liua, and Pengshan Ma

Related

Clustering coefficient, EEG brain data, Graph theory analysis

I am final year master student of biomedical engineering field, my research interest area is the Brain research, using EEG modality data. Currently i am struggling to understand statistical analysis using the graph theory analysis technique, For the 10–20 system EEG electrode locations related brain node, clusters. if you are an expert or having knowledge of the local /global clustering coefficient please share your precious knowledge.
Case: If we stimulate the subject with Theta, alpha, beta and gamma frequencies through acoustic binaural beat modulation then following two queries:
Q1. when the clustering coefficient is increased, what will happen to the cortical neuronal activities in the brain areas/ regions ? will be well organized or hyper active or control/uncontrol brain functional network?
Q2. when the clustering coefficient is decreased then what will happen as result? is it good for the cortical activities or bad either?
please share your recommendations resources as related articles, websites, ebook link.

How to update datasets without re-clustering the whole datatsets after clustering finished?

What I used is spectral clustering. What should I do to avoid clustering the whole datasets?
There are papers on how to infer the spectral embedding for new data points, e.g.,
Bengio, Y., Paiement, J. F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems (pp. 177-184).
You can then assign them to the nearest k-means cluster and update the means.
But implementing this will require quite some coding work on your behalf. In particular, in order to get this fast.
Clustering isn't really meant to be updatable, and not is spectral embedding, so it is worth looking into alternate algorithms, and to reconsider your objective whether you really need to have this.

Which clustering algorithms can be used with Word Mover's Distance from M. Kusner's paper?

I am new to machine learning and now I am interested in document clustering (short texts with different lengths) according to their semantic similarity (I just want to go beyond the standard TF/IDF approach). I read the paper http://proceedings.mlr.press/v37/kusnerb15.pdf where the Word Mover's distance for word embeddings is explained. In the paper they used it for classification. My question is now - can I use it for clustering? If so, is there a paper where this kind of usage is discribed?
P.S.: I am basically interested in clustering which takes into account the semantic similarity, so even a word2vec or doc2vec approach will do the job - I just couldn't find any papers where they are used in a clustering problem.
If you could afford to compute an entire distance matrix, then you could do hierarchical clustering, for example.
It's easy today find other clusterings that accept any distance and use a threshold. These could even use the bounds for performance. But it's not obvious that they will work on such data.

Classifier with a vector of features and a matrix of classes

I'm new in Classification so I'm asking for some advice on how to start.
I've created a Matlab script which create two matrices, one is the class identifier, meaning 100x1 which contains the group from where the data is. group one (1) or group two (2).
The second matrix contains the features 100x40 with 40 features for each point.
What's the best way to start, I'm really lost. Does Matlab has some functions I can use?
I would really appreciate some help.
Thank you.
It depends on what version of MATLAB you are using, but the best starting point would be to look at statistics toolbox for supervised learning. Here are some starting tips for MATLAB 2013a:
http://www.mathworks.co.uk/help/stats/supervised-learning.html
Let's assume that your data is
classes: 100x1
features: 100x40
For each method, the first line shows you how to fit your classification model and the second lines shows how to classify the first row of data in features.
Statistics Toolbox
Naive Bayes Classification
Wikipedia: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
myClassifier = NaiveBayes.fit(features, classes)
myClassifier.predict(features(1,:))
Nearest Neighbors
Wikipedia: https://en.wikipedia.org/wiki/Nearest_neighbour_classifiers
myClassifier = ClassificationKNN.fit(features, classes)
myClassifier.predict(features(1,:))
Classification Trees
Wikipedia: https://en.wikipedia.org/wiki/Classification_tree
myClassifier = ClassificationTree.fit(features, classes)
myClassifier.predict(features(1,:))
Support Vector Machines
Wikipedia: https://en.wikipedia.org/wiki/Support_vector_machine
Note that Support Vector Machines moved into 2013a from Bioinformatics toolbox and it only supports classification into two groups.
myClassifier = svmtrain(features, classes)
svmclassify(myClassifier, features(1,:))
Discriminant Analysis
Wikipedia: https://en.wikipedia.org/wiki/Discriminant_analysis
myClassifier = ClassificationDiscriminant.fit(features, classes)
myClassifier.predict(features(1,:))
Neural Network Toolbox:
If you only have two classes, you could use Neural Network Toolbox for pattern recognition by typing nnstart

Feature Selection in MATLAB

I have a dataset for text classification ready to be used in MATLAB. Each document is a vector in this dataset and the dimensionality of this vector is extremely high. In these cases peopl usually do some feature selection on the vectors like the ones that you have actually find the WEKA toolkit. Is there anything like that in MATLAB? if not can u suggest and algorithm for me to do it...?
thanks
MATLAB (and its toolboxes) include a number of functions that deal with feature selection:
RANDFEATURES (Bioinformatics Toolbox): Generate randomized subset of features directed by a classifier
RANKFEATURES (Bioinformatics Toolbox): Rank features by class separability criteria
SEQUENTIALFS (Statistics Toolbox): Sequential feature selection
RELIEFF (Statistics Toolbox): Relief-F algorithm
TREEBAGGER.OOBPermutedVarDeltaError, predictorImportance (Statistics Toolbox): Using ensemble methods (bagged decision trees)
You can also find examples that demonstrates usage on real datasets:
Identifying Significant Features and Classifying Protein Profiles
Genetic Algorithm Search for Features in Mass Spectrometry Data
In addition, there exist third-party toolboxes:
Matlab Toolbox for Dimensionality Reduction
LIBGS: A MATLAB Package for Gene Selection
Otherwise you can always call your favorite functions from WEKA directly from MATLAB since it include a JVM...
Feature selection depends on the specific task you want to do on the text data.
One of the simplest and crudest method is to use Principal component analysis (PCA) to reduce the dimensions of the data. This reduced dimensional data can be used directly as features for classification.
See the tutorial on using PCA here:
http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html
Here is the link to Matlab PCA command help:
http://www.mathworks.com/help/toolbox/stats/princomp.html
Using the obtained features, the well known Support Vector Machines (SVM) can be used for classification.
http://www.mathworks.com/help/toolbox/bioinfo/ref/svmclassify.html
http://www.autonlab.org/tutorials/svm.html
You might consider using the independent features technique of Weiss and Kulikowski to quickly eliminate variables which are obviously unimformative:
http://matlabdatamining.blogspot.com/2006/12/feature-selection-phase-1-eliminate.html