ELKI dbscan examples - cluster-analysis

I want to use dbscan algorithm in ELKI (not on GUI). Please give me some examples including loading data, running algorithm, displaying results? The docs of ELKI don't have any examples.

Use the example at
https://elki-project.github.io/howto/java_api#PureJavaAPI
Rather than KMeansLloyd, you would use DBSCAN, of course.
This example does not set up an index, so the runtime will be O(n²). For larger data sets, indexes such as the cover tree give considerable performance benefits.
Setting up indexes etc. often is easier with the parameterization API.
Alternatively, you could use the parameterization API, as done in the unit test:
https://github.com/elki-project/elki/blob/master/elki-clustering/src/test/java/de/lmu/ifi/dbs/elki/algorithm/clustering/DBSCANTest.java

Related

Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

How to cluster data using self-organising maps?

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:
The most important benefit of this procedure
is that computational load decreases considerably, making
it possible to cluster large data sets and to consider several
different preprocessing strategies in a limited time. Naturally,
the approach is valid only if the clusters found using the SOM
are similar to those of the original data.
In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:
Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
What characteristics should a dataset have, in general, for this approach to be useful?
Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.
It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.
Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.

Clustering classifier and clustering policy

I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.
The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.

hierarchical k-means clustering for SIFT vectors

All
I am searching for applying the same approach of David Nister and Henrik Stewenius in http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/scalable.pdf
In this paper, they use a high number of SIFT vectors (128-D) as input to a hierarchical k-means clustering to construct a hierarchical visual vocabulary tree.
Does any one know any good library that i can use to do this clustering?
Ps: the number of input SIFT descriptors is high (70,000,000) and i want that result will be a vocabulary tree with 1,000,000 leaf nodes.
thanks very much.
regards.
The ClusterQuantiser tool in OpenIMAJ should be able to do this if the data is in a supported format. If the tool can't work with your data out of the box, then you could write a driver for the org.openimaj.ml.clustering.kmeans.HierarchicalByteKMeans class (in the svn trunk version) or the org.openimaj.ml.clustering.kmeans.HByteKMeans class in the 1.0.5 release. Both versions of the class support streaming data from disk, so you don't need to hold all the features in memory!
For completeness, vlfeat also has a hierarchical k-means implementation, but I'm not sure how much it scales.
From practical experience, you might also consider sampling the features before clustering. I'm not sure that you'll get much benefit from clustering them all.

Text classification, preprocessing included

Which is the best method for document classification if time is not a factor, and we dont know how many classes there are?
In my (incomplete) knowledge, Hierarchical Agglomerative Clustering is the best approach if you don't know how many classes. All of the other clustering algorithms either require prior knowledge of the number of buckets or some sort of cross-validation or other experimentation to determine a sensible number of buckets.
A cross link: see how-do-i-determine-k-when-using-k-means-clustering on SO.