Classification in real time without prior knowledge of the number of classes - real-time

Is there an implemented algorithm (with python/R or java in preference) that can classify incoming data from an unknown generator with absolutely no prior knowledge or assumption.
For example:
Let G be a generator of 2d vectors that generate one vector in each second.
What we know, and nothing else, is that this vectors are separable into clusters in space (euclidean distance).
Question: How can I classify my data in real time so that at each iteration, the algorithm propose clusters?

I'm also in the process of searching for something related to Data stream clustering and I found some papers and code:
Aforementioned survey by Charu C. Aggarwal from aforementioned book;
Density-Based Clustering over an Evolving Data Stream with Noise. by Feng Cao et al., proposes DenStream; here is a git repo for that (Matlab);
Density-Based Clustering for Real-Time Stream Data by Yixin Chen, Li Tu, proposes the D-Stream framework (2008 version called Stream data clustering based on grid density and attraction); There is a DD-Stream that I can't find a pdf for in A grid and density-based clustering algorithm for processing data stream by Jia.
A Fast and Stable Incremental Clustering Algorithm by Steven Young et al. focuses on clustering as an unsupervised learning process;
Self-Adaptive Anytime Stream Clustering by Philipp Kranen et al. has ClusTree and this git repo implements DClusTree;
Pre-clustering algorithm for anomaly detection and clustering that uses variable size buckets by Manish Sharma et al. is more recent and may be relevant (git repo of the author);
This paper is about MOA (Massive Online Analysis) states that it implements some of the above (StreamKM++, CluStream, ClusTree, Den-Stream, D-Stream and CobWeb). I believe that D-Stream is work in progress/wishful thinking (is not part of the pre-release available from their website). MOA is written in Java, here is streamMOA package.
The code in this repository seems to be a Python implementation of the D-Stream but, according to the author, it is slow.
Also, stream is a framework for data stream clustering research with R.

I think you are asking about "Stream mining" here.
Read this article
Chapter 10: A Survey of Stream Clustering Algorithms.
Charu C. Aggarwal, IBM T. J. Watson Research Center, Yorktown Heights, NY
This can be found in the 2014 book
DATA CLUSTERING- Algorithms and Applications, Edited by Charu C. Aggarwal and Chandan K. Reddy.
In that chapter the "CluStream" framework is described. This project is from 2002, and it is based on the BIRCH algorithm from 1997 which is a "Micro-Clustering" approach. The algorithm creates an index structure on the fly.
Considering that there are few BIRCH implementations,
there is probably no open-source CluStream algorithm/framework available.
Here's a Github repo with a BIRCH implementation in Java - although I haven't tried this code, and that repo is not for "stream mining".
All this just appeared on my radar because I just recently participated in the Coursera MOOC on cluster analysis.

There are no assumption free methods.
You are asking for magic to happen.
Never blindly use a clustering result. Do not use clustering on a stream. Instead, analyze and correct any clustering result before deployment.
Watch out for hidden assumptions. For example, assumptions that clusters are convex, distance based (why is Euclidean distance the coorect choice?), have the same size or extend, are separated (by what?) or shape. Whenever you design a method, you make assumptions on what is interesting.
Wothout assumption, anything is a "clustering"!

Related

How to update datasets without re-clustering the whole datatsets after clustering finished?

What I used is spectral clustering. What should I do to avoid clustering the whole datasets?
There are papers on how to infer the spectral embedding for new data points, e.g.,
Bengio, Y., Paiement, J. F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems (pp. 177-184).
You can then assign them to the nearest k-means cluster and update the means.
But implementing this will require quite some coding work on your behalf. In particular, in order to get this fast.
Clustering isn't really meant to be updatable, and not is spectral embedding, so it is worth looking into alternate algorithms, and to reconsider your objective whether you really need to have this.

Evaluation of user-based collaborative filtering K-Nearest Neighbor Algorithm

I was trying to find evaluation mechanisms of collaborative K-Nearest neighbor algorithm, but i am confused how can I evaluate this algorithm. How can I be sure that the recommendation done by this algorithm is correct or good. Actually I have also developed an algorithm that i want to compare with it. but i am not sure how can i compare and evaluate both of them. The data set used by me is of movie lens.
your people help on evaluating this recomender system will be highly appreciated.
Evaluating recommender systems is a large concern of its research and industry communities. Look at "Evaluating collaborative filtering recommender systems", a Herlocker et al paper. The people who publish MovieLens data (the GroupLens research lab at the University of Minnesota) also publish many papers on recsys topics, and the PDFs are often free at http://grouplens.org/publications/.
Check out https://scholar.google.com/scholar?hl=en&q=evaluating+recommender+systems.
In short, you should use a method that hides some data. You will train your model on a portion of the data (called "training data") and test on the remainder of the data that your model has never seen before. There's a formal way to do this called cross-validation, but the general concept of visible training data versus hidden test data is the most important.
I also recommend https://www.coursera.org/learn/recommender-systems, a Coursera course on recommender systems taught by GroupLens folks. In that course you'll learn to use LensKit, a recommender systems framework in Java that includes a large evaluation suite. Even if you don't take the course, LensKit may be just what you want.

Are there any implementations available online for filter based feature selection methods?

The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.
Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.
My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.
PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.

hierarchical k-means clustering for SIFT vectors

All
I am searching for applying the same approach of David Nister and Henrik Stewenius in http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/scalable.pdf
In this paper, they use a high number of SIFT vectors (128-D) as input to a hierarchical k-means clustering to construct a hierarchical visual vocabulary tree.
Does any one know any good library that i can use to do this clustering?
Ps: the number of input SIFT descriptors is high (70,000,000) and i want that result will be a vocabulary tree with 1,000,000 leaf nodes.
thanks very much.
regards.
The ClusterQuantiser tool in OpenIMAJ should be able to do this if the data is in a supported format. If the tool can't work with your data out of the box, then you could write a driver for the org.openimaj.ml.clustering.kmeans.HierarchicalByteKMeans class (in the svn trunk version) or the org.openimaj.ml.clustering.kmeans.HByteKMeans class in the 1.0.5 release. Both versions of the class support streaming data from disk, so you don't need to hold all the features in memory!
For completeness, vlfeat also has a hierarchical k-means implementation, but I'm not sure how much it scales.
From practical experience, you might also consider sampling the features before clustering. I'm not sure that you'll get much benefit from clustering them all.

Project ideas for discrete mathematics course using MATLAB? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
A professor asked me to help making a specification for a college project.
By the time the students should know the basics of programming.
The professor is a mathematician and has little experience in other programming languages, so it should really be in MATLAB.
I would like some projects ideas. The project should
last about 1 to 2 months
be done individually
have web interface would be great
doesn't necessary have to go deep in maths, but some would be great
use a database (or store data in files)
What kind of project would make the students excited?
If you have any other tips I'll appreciate.
UPDATE: The students are sophomores and have already studied vector calculus. This project is for an one year Discrete Mathematics course.
UPDATE 2: The topics covered in the course are
Formal Logic
Proofs, Recursion, and Analysis of Algorithms
Sets and Combinatorics
Relations, Functions, and Matrices
Graphs and Trees
Graph Algorithms
Boolean Algebra and Computer Logic
Modeling Arithmetic, Computation, and Languages
And it'll be based on this book Mathematical Structures for Computer Science: A Modern Approach to Discrete Mathematics by Judith L. Gersting
General Suggestions:
There are many teaching resources at The MathWorks that may give you some ideas for course projects. Some sample links:
The MATLAB Central blogs, specifically some posts by Loren that include using LEGO Mindstorms in teaching and a webinar about MATLAB for teaching (note: you will have to sign up to see the webinar)
The Curriculum Exchange: a repository of course materials
Teaching with MATLAB and Simulink: a number of other links you may find useful
Specific Suggestions:
One of my grad school projects in non-linear dynamics that I found interesting dealt with Lorenz oscillators. A Lorenz oscillator is a non-linear system of three variables that can exhibit chaotic behavior. Such a system would provide an opportunity to introduce the students to numerical computation (iterative methods for simulating systems of differential equations, stability and convergence, etc.).
The most interesting thing about this project was that we were using Lorenz oscillators to encode and decode signals. This "encrypted communication" aspect was really cool, and was based on the following journal article:
Kevin M. Cuomo and Alan V. Oppenheim,
Circuit Implementation of Synchronized Chaos with Applications
to Communications, Physical Review
Letters 71(1), 65-68 (1993)
The article addresses hardware implementations of a chaotic communication system, but the equivalent software implementation should be simple enough to derive (and much easier for the students to implement!).
Some other useful aspects of such a project:
The behavior of the system can be visualized in 2-D and 3-D plots, thus exposing the students to a number of graphing utilities in MATLAB (PLOT, PLOT3, COMET, COMET3, etc.).
Audio signals can be read from files, encrypted using the Lorenz equations, written out to a new file, and then decrypted once again. You could even have the students each encrypt a signal with their Lorenz oscillator code and give it to another student to decrypt. This would introduce them to various file operations (FREAD, FWRITE, SAVE, LOAD, etc.), and you could even introduce them to working with audio data file formats.
You can introduce the students to the use of the PUBLISH command in MATLAB, which allows you to format M-files and publish them to various output types (like HTML or Word documents). This will teach them techniques for making useful help documentation for their MATLAB code.
I have found that implementing and visualizing Dynamical systems is great
for giving an introduction to programming and to an interesting branch of
applied mathematics. Because one can see the 'life' in these systems,
our students really enjoy this practical module.
We usually start off by visualizing a 1D attractor, so that we can
overlay the evolution rule/rate of change with the current state of
the system. That way you can teach computational aspects (integrating the system) and
visualization, and the separation of both in implementation (on a simple level, refreshing
graphics at every n-th computation step, but in C++ leading to threads, unsure about MATLAB capabilities here).
Next we add noise, and then add a sigmoidal nonlinearity to the linear attractor. We combine this extension with an introduction to version control (we use a sandbox SVN repository for this): The
students first have to create branches, modify the evolution rule and then merge
it back into HEAD.
When going 2D you can simply start with a rotation and modify it to become a Hopf oscillator, and visualize either by morphing a grid over time or by going 3D when starting with a distinct point. You can also visualize the bifurcation diagram in 3D. So you again combine generic MATLAB skills like 3D plotting with the maths.
To link in other topics, browse around in wikipedia: you can bring in hunter/predator models, chaotic systems, physical systems, etc.etc.
We usually do not teach object-oriented-programming from within MATLAB, although it is possible and you can easily make up your own use cases in the dynamical systems setting.
When introducing inheritance, we will already have moved on to C++, and I'm again unaware of MATLAB's capabilities here.
Coming back to your five points:
Duration is easily adjusted, because the simple 1D attractor can be
done quickly and from then on, extensions are ample and modular.
We assign this as an individual task, but allow and encourage discussion among students.
About the web interface I'm at a loss: what exactly do you have in mind, why is it
important, what would it add to the assignment, how does it relate to learning MATLAB.
I would recommend dropping this.
Complexity: A simple attractor is easily understood, but the sky's the limit :)
Using a database really is a lot different from config files. As to the first, there
is a database toolbox for accessing databases from MATLAB. Few institutes have the license though, and apart from that: this IMHO does not belong into such a course. I suggest introducing to the concept of config files, e.g. for the location and strength of the attractor, and later for the system's respective properties.
All this said, I would at least also tell your professor (and your students!) that Python is rising up against MATLAB. We are in the progress of going Python with our tutorials, but I understand if someone wants to stick with what's familiar.
Also, we actually need the scientific content later on, so the usefulness for you will probably depend on which department your course will be related to.
A lot of things are possible.
The first example that comes in mind is to model a public transportation network (the network of your city, with underground, buses, tramways, ...). It is represented by a weighted directed graph (you can use sparse matrix to represent it, for example).
You may, for example, ask them to compute the shortest path from one station to another one (Moore-dijkistra algorithm, for example) and display it.
So, for the students, the several steps to do are:
choose an appropriate representation for the network (it could be some objects to represent the properties of the stations and the lines, and a sparse matrix for the network)
load all the data (you can provide them the data in an XML file)
be able to draw the network (since you will put the coordinates of the stations)
calculate the shortest path from one point to another and display it in a pretty way
create a fronted (with GUI)
Of course, this could be complicated by adding connection times (when you change from one line to another), asking for several options (shortest path with minimum connections, take in considerations the time you loose by waiting for a train/bus, ...)
The level of details will depend on the level of the students and the time they could spend on it (it could be very simple, or very realist)
You want to do a project with a web interface and a database, but not any serious math... and you're doing it in MATLAB? Do you understand that MATLAB is especially designed to be used for "deep math", and not for web interfaces or databases?
I think if this is an intro to a Discrete Mathematics course, you should probably do something involving Discrete Mathematics, and not waste the students' time as they learn a bunch of things in that language that they'll never actually use.
Why not do something involving audio? I did an undergraduate project in which we used MATLAB to automatically beat-match different tunes and DJ mix between them. The full program took all semester, but you could do a subset of it. wavread() and the like are built in and easy to use.
Or do some simple image processing like finding Waldo using cross-correlation.
Maybe do something involving cryptography, have them crack a simple encryption scheme and feel like hackers.
MATLAB started life as a MATrix LAB, so maybe concentrating on problems in linear algebra would be a natural fit.
Discrete math problems using matricies include:
Spanning trees and shortest paths
The marriage problem (bipartite graphs)
Matching algorithms
Maximal flow in a network
The transportation problem
See Gil Strang's "Intro to Applied Math" or Knuth's "Concrete Math" for ideas.
You might look here: http://www.mathworks.com/academia/student_center/tutorials/launchpad.html
on the MathWorks website. The interactive tutorial (second link) is quite popular.
--Loren
I always thought the one I was assigned in grad school was a good choice-a magnetic lens simulator. The math isn't completely overwhelming so you can focus more on learning the language, and it's a good intro to the graphical capabilities (e.g., animating the path of an off-axis electron going through the lens).
db I/O and fancy interfaces are out of place in a discrete math course.
my matlab labs were typically algorithm implementations, with charts as output, and simple file input.
how hard is the material? image processing is really easy in matlab, can you do some discrete 2D filtering? blurs and stuff. http://homepages.inf.ed.ac.uk/rbf/HIPR2/filtops.htm