Are there any implementations available online for filter based feature selection methods? - matlab

The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.

Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.

My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.

PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.


Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

GPy and GPflow mathematical background - references

Does GPy and GPflow share a common mathematical background? I'm asking this because I'm using GPy but I cannot see the references. However, GPflow provides references in its examples.
Is it Ok using keep using GPy or would you suggest the use GPflow inmediately for gaussian processes purposes?
GPy and GPflow definitely share a common mathematical background: Gaussian processes Rasmussen and Williams, and many of the concepts are very similar in both frameworks: kernels, likelihoods, mean-functions, inducing points, etc. For me, the biggest difference between GPy and GPflow is the computational backend: AFAIK GPy uses plain Python and numpy to perform all its computations, whereas GPflow relies on TensorFlow. This gives GPflow multiple nice features for free: GPU acceleration, automatic gradients, compatibility with TF eco-system, etc. Depending on your use-case, these features can be crucial or simply nice-to-have.
Here is more information on the technical details between the two frameworks:
That would depend on what you are actually doing.
The very basic GPs should be similar, just that GPflow relies on tensorflow for the gradients (if used) and probably some technical implementation differences.
For the other more advanced models, both libraries provide references to the respective papers in the docs. In my opinion, GPflow's design is mainly centered around the SVGP framework from [1] and [2] (and many other extensions.. I can really recommend [2] if you are interested in the theory).
But they still do provide some other implementations.
I use GPflow since it works on the GPU and offers a lot of state-of-the-art implementations. However, the disadvantage would be that it is under a lot of change.
If you want to use classic GPs and are not too concerned with performance or very up-to-date methods I'd say GPy should be sufficient and the more stable variant.
[1] Hensman, James, Alexander Matthews, and Zoubin Ghahramani. "Scalable variational Gaussian process classification." (2015).
[2] Matthews, Alexander Graeme de Garis. Scalable Gaussian process inference using variational methods. Diss. University of Cambridge, 2017.

Most important attributes in matlab

so I have a dataset of 77 patients cancer patients and 12500+ attributes. I have applied Principal Component Analysis in order to filter the attributes to only retain the ones the explain the most variance.
My question is, are there techniques in Matlab, other than PCA, to identify the attributes with the most predictive power?
There are two main ways to cleverly "reduce the dimensionality" of your dataset. One is Feature Transformation (that includes, for example, PCA), and the other one is Feature Selection.
It seems that you are looking for a Feature Selection algorithm, that would retain the most informative original attributes. On the contrary, a Feature Transformation algorithm will generate a new set of attributes!
As for your exact question, there are multiple choices you can make. Keep in mind that, naively, each Feature Selection algorithm will have to choose the best features according to "how well" those features alone can model the problem.
For a MATLAB built-in implementation, if you have the Statistics and Machine Learning Toolbox installed, you can use the "Sequential feature selection" function sequentialfs.

Evaluation of user-based collaborative filtering K-Nearest Neighbor Algorithm

I was trying to find evaluation mechanisms of collaborative K-Nearest neighbor algorithm, but i am confused how can I evaluate this algorithm. How can I be sure that the recommendation done by this algorithm is correct or good. Actually I have also developed an algorithm that i want to compare with it. but i am not sure how can i compare and evaluate both of them. The data set used by me is of movie lens.
your people help on evaluating this recomender system will be highly appreciated.
Evaluating recommender systems is a large concern of its research and industry communities. Look at "Evaluating collaborative filtering recommender systems", a Herlocker et al paper. The people who publish MovieLens data (the GroupLens research lab at the University of Minnesota) also publish many papers on recsys topics, and the PDFs are often free at
Check out
In short, you should use a method that hides some data. You will train your model on a portion of the data (called "training data") and test on the remainder of the data that your model has never seen before. There's a formal way to do this called cross-validation, but the general concept of visible training data versus hidden test data is the most important.
I also recommend, a Coursera course on recommender systems taught by GroupLens folks. In that course you'll learn to use LensKit, a recommender systems framework in Java that includes a large evaluation suite. Even if you don't take the course, LensKit may be just what you want.

Project ideas for discrete mathematics course using MATLAB? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A professor asked me to help making a specification for a college project.
By the time the students should know the basics of programming.
The professor is a mathematician and has little experience in other programming languages, so it should really be in MATLAB.
I would like some projects ideas. The project should
last about 1 to 2 months
be done individually
have web interface would be great
doesn't necessary have to go deep in maths, but some would be great
use a database (or store data in files)
What kind of project would make the students excited?
If you have any other tips I'll appreciate.
UPDATE: The students are sophomores and have already studied vector calculus. This project is for an one year Discrete Mathematics course.
UPDATE 2: The topics covered in the course are
Formal Logic
Proofs, Recursion, and Analysis of Algorithms
Sets and Combinatorics
Relations, Functions, and Matrices
Graphs and Trees
Graph Algorithms
Boolean Algebra and Computer Logic
Modeling Arithmetic, Computation, and Languages
And it'll be based on this book Mathematical Structures for Computer Science: A Modern Approach to Discrete Mathematics by Judith L. Gersting
General Suggestions:
There are many teaching resources at The MathWorks that may give you some ideas for course projects. Some sample links:
The MATLAB Central blogs, specifically some posts by Loren that include using LEGO Mindstorms in teaching and a webinar about MATLAB for teaching (note: you will have to sign up to see the webinar)
The Curriculum Exchange: a repository of course materials
Teaching with MATLAB and Simulink: a number of other links you may find useful
Specific Suggestions:
One of my grad school projects in non-linear dynamics that I found interesting dealt with Lorenz oscillators. A Lorenz oscillator is a non-linear system of three variables that can exhibit chaotic behavior. Such a system would provide an opportunity to introduce the students to numerical computation (iterative methods for simulating systems of differential equations, stability and convergence, etc.).
The most interesting thing about this project was that we were using Lorenz oscillators to encode and decode signals. This "encrypted communication" aspect was really cool, and was based on the following journal article:
Kevin M. Cuomo and Alan V. Oppenheim,
Circuit Implementation of Synchronized Chaos with Applications
to Communications, Physical Review
Letters 71(1), 65-68 (1993)
The article addresses hardware implementations of a chaotic communication system, but the equivalent software implementation should be simple enough to derive (and much easier for the students to implement!).
Some other useful aspects of such a project:
The behavior of the system can be visualized in 2-D and 3-D plots, thus exposing the students to a number of graphing utilities in MATLAB (PLOT, PLOT3, COMET, COMET3, etc.).
Audio signals can be read from files, encrypted using the Lorenz equations, written out to a new file, and then decrypted once again. You could even have the students each encrypt a signal with their Lorenz oscillator code and give it to another student to decrypt. This would introduce them to various file operations (FREAD, FWRITE, SAVE, LOAD, etc.), and you could even introduce them to working with audio data file formats.
You can introduce the students to the use of the PUBLISH command in MATLAB, which allows you to format M-files and publish them to various output types (like HTML or Word documents). This will teach them techniques for making useful help documentation for their MATLAB code.
I have found that implementing and visualizing Dynamical systems is great
for giving an introduction to programming and to an interesting branch of
applied mathematics. Because one can see the 'life' in these systems,
our students really enjoy this practical module.
We usually start off by visualizing a 1D attractor, so that we can
overlay the evolution rule/rate of change with the current state of
the system. That way you can teach computational aspects (integrating the system) and
visualization, and the separation of both in implementation (on a simple level, refreshing
graphics at every n-th computation step, but in C++ leading to threads, unsure about MATLAB capabilities here).
Next we add noise, and then add a sigmoidal nonlinearity to the linear attractor. We combine this extension with an introduction to version control (we use a sandbox SVN repository for this): The
students first have to create branches, modify the evolution rule and then merge
it back into HEAD.
When going 2D you can simply start with a rotation and modify it to become a Hopf oscillator, and visualize either by morphing a grid over time or by going 3D when starting with a distinct point. You can also visualize the bifurcation diagram in 3D. So you again combine generic MATLAB skills like 3D plotting with the maths.
To link in other topics, browse around in wikipedia: you can bring in hunter/predator models, chaotic systems, physical systems, etc.etc.
We usually do not teach object-oriented-programming from within MATLAB, although it is possible and you can easily make up your own use cases in the dynamical systems setting.
When introducing inheritance, we will already have moved on to C++, and I'm again unaware of MATLAB's capabilities here.
Coming back to your five points:
Duration is easily adjusted, because the simple 1D attractor can be
done quickly and from then on, extensions are ample and modular.
We assign this as an individual task, but allow and encourage discussion among students.
About the web interface I'm at a loss: what exactly do you have in mind, why is it
important, what would it add to the assignment, how does it relate to learning MATLAB.
I would recommend dropping this.
Complexity: A simple attractor is easily understood, but the sky's the limit :)
Using a database really is a lot different from config files. As to the first, there
is a database toolbox for accessing databases from MATLAB. Few institutes have the license though, and apart from that: this IMHO does not belong into such a course. I suggest introducing to the concept of config files, e.g. for the location and strength of the attractor, and later for the system's respective properties.
All this said, I would at least also tell your professor (and your students!) that Python is rising up against MATLAB. We are in the progress of going Python with our tutorials, but I understand if someone wants to stick with what's familiar.
Also, we actually need the scientific content later on, so the usefulness for you will probably depend on which department your course will be related to.
A lot of things are possible.
The first example that comes in mind is to model a public transportation network (the network of your city, with underground, buses, tramways, ...). It is represented by a weighted directed graph (you can use sparse matrix to represent it, for example).
You may, for example, ask them to compute the shortest path from one station to another one (Moore-dijkistra algorithm, for example) and display it.
So, for the students, the several steps to do are:
choose an appropriate representation for the network (it could be some objects to represent the properties of the stations and the lines, and a sparse matrix for the network)
load all the data (you can provide them the data in an XML file)
be able to draw the network (since you will put the coordinates of the stations)
calculate the shortest path from one point to another and display it in a pretty way
create a fronted (with GUI)
Of course, this could be complicated by adding connection times (when you change from one line to another), asking for several options (shortest path with minimum connections, take in considerations the time you loose by waiting for a train/bus, ...)
The level of details will depend on the level of the students and the time they could spend on it (it could be very simple, or very realist)
You want to do a project with a web interface and a database, but not any serious math... and you're doing it in MATLAB? Do you understand that MATLAB is especially designed to be used for "deep math", and not for web interfaces or databases?
I think if this is an intro to a Discrete Mathematics course, you should probably do something involving Discrete Mathematics, and not waste the students' time as they learn a bunch of things in that language that they'll never actually use.
Why not do something involving audio? I did an undergraduate project in which we used MATLAB to automatically beat-match different tunes and DJ mix between them. The full program took all semester, but you could do a subset of it. wavread() and the like are built in and easy to use.
Or do some simple image processing like finding Waldo using cross-correlation.
Maybe do something involving cryptography, have them crack a simple encryption scheme and feel like hackers.
MATLAB started life as a MATrix LAB, so maybe concentrating on problems in linear algebra would be a natural fit.
Discrete math problems using matricies include:
Spanning trees and shortest paths
The marriage problem (bipartite graphs)
Matching algorithms
Maximal flow in a network
The transportation problem
See Gil Strang's "Intro to Applied Math" or Knuth's "Concrete Math" for ideas.
You might look here:
on the MathWorks website. The interactive tutorial (second link) is quite popular.
I always thought the one I was assigned in grad school was a good choice-a magnetic lens simulator. The math isn't completely overwhelming so you can focus more on learning the language, and it's a good intro to the graphical capabilities (e.g., animating the path of an off-axis electron going through the lens).
db I/O and fancy interfaces are out of place in a discrete math course.
my matlab labs were typically algorithm implementations, with charts as output, and simple file input.
how hard is the material? image processing is really easy in matlab, can you do some discrete 2D filtering? blurs and stuff.