I have a dataset that is represented by this picture.
As you can see, there is a thin strip on top of the rest of data points. The question is how I can separate the strip from the rest, using clustering analysis or any other technique.
I have tried DBSCAN, KMeans, and Hierarchical Clustering and all gave me similar results shown by colors in the graph.
DBSCAN and OPTICS are your best candidates. If the data is not too big, you can also try meanshift. But they will not be able to do it perfectly - some points will be "noise" to them.
It's fairly obvious that k-means and most hierarchical clustering cannot solve this.
Keep minPts small (5 to 10), and focus on choosing epsilon. It must be small enough to not cover the gap. OPTICS will be easier to use, since you only need to give an upper bound on epsilon.
Consider manually specifying a model. Tweaking parameters until you get your desired result is not any better. Draw a line on your plot with a ruler, turn that into a linear model by reading off the parameters...
Related
I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.
I have a 3d box with some points in it (1800).
Like this:
Now I have to cluster these points and it can't be done with k-means because you don't now the number of clusters. An other problem is that the box is periodic. So the points at the side top and bottom can belong to eacht other. Like in this image:
The right en left belong to each other.
How can I define these clusters with a specific distance as threshold, and implement that the box is periodic (so when you are one the end of one axis look at the beginning if these distances are below the threshold)?
Kind regards,
Glenn
The Wikipedia article on cluster analysis will answer your question.
Look for density based clustering algorithms, as your data looks very much like the design scenario of density based clustering to me.
Well, first things first, you can indeed use K-Means. Of course you will need to use a cluster validity index (google Silhouette width index, Calinski-Harabasz index, Dunn's index, etc.).
If you really don't want to use K-Means for some other reason, you may wish to use a hierarchical clustering algorithm such as the Ward Method (description in Wikipedia). You won't need to know the number of clusters a priori (however, can you truly claim that you are creating a taxonomy without being able to answer the most basic of questions: how many taxons are there?).
The fact that your box is periodic raises an interesting challenge. My first thought here is that the best way to approach the problem is not by changing the distance measure (which you could do), but by transforming the data (feature extraction).
Your box has 6 sides, but because its periodic its like if it had 3 sides. So, the left side and right side are "the same" (as are the top and bottom, and the front and back).
How about redefining each object over three features? each feature is the distance between the object and one of the "three" sides.
Best of luck!
I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.
I have started using ELKI for data analysis, but one seemingly simple thing I cannot seem to do is output the calculated convex hull of clusters to a file after running DBSCAN. I am able to visualize the convex hulls via the visualization gui, but I cannot generate the KML file. I am also able to write my clustering results to a folder (using the ResultWriter resulthandler), but no file is generated when I set the KMLOutputHandler. I receive no error message in the log window (even with verbose parameter set to true).
Is there a trick to generating a KML file in ELKI? Could anyone walk through the steps of doing this?
Any help would be appreciated.
(as an aside, is it possible to generate alpha shapes for DBSCAN results with ELKI? If so, which parameter must be adjusted?)
So that is actually a lot of questions in one...
Cunvex hulls: they are used in ELKI for visualization, but not considered part of the output result, so they are not saved to file. A trick you could employ is to save the visualization as SVG and extract them from this file, but they will then be in a different coordinate system.
One of the reasons for this is that the convex hulls are only implemented for 2D Euclidean space - I figure you want to use it for spatial data, where it may actually happen to not return the correct convex hull then due to the curvature of the earth surface. Furthermore, many data sets will be of higher dimensionality.
However, you can of course look at the source code and invoke the convex hull algorithm, then write the result to your favorite output format. In general, just as you will need to spend time on preprocessing, you will also need to customize the output.
Which brings me to the second question. The KMLResultHandler is closely tied to the publication of ELKI 0.4.0: Spatial Outlier Detection: Data, Algorithms, Visualizations.
Which pretty much summarizes what this class does: visualize spatial outlier detection. It currently does not (yet) include code to visualize clusters of spatial data, for example. In order to get an output from the class, you need to ensure a number of restrictions, unfortunately. Essentially, if it finds a Polygon relation and a OutlierResult that it can map to each other, it will output this to KML.
It is not yet a class that could write arbitrary results to KML. It probably needs a lot more of documentation, too. Contributions of a more general output tool would be appreciated; but a customizeable, automatic, general output to KML is really hard to do. In particular, you may also end up having to include projection capabilities then, if someone is not processing Latitude-Longitude data, but e.g. UTM projected data.
As such, I recommend looking at the source code of the class and customizing it to your needs. In my opinion, visualization to KML will always require a lot of customization.
To generate alpha shapes (only the hull, not the extended alpha shape - the optimal visualization of DBSCAN would likely consist of the alpha shape of the core points only, extended by a radius of epsilon, which should then include the border points. This is on the wish list, but not implemented), you just need to set the -hull.alpha parameter to the desired alpha value. Note that this happens in the visualization projection, not at the raw data. If the axes are scaled differently, alpha shapes will look differently. Again, you may be interested in using the class AlphaShape on the raw data vectors, instead of exporting the projected visualization. Then you can easily write the resulting Polygons to your custom visualization.
If you implement such a KML visualization using alpha shapes (or convex hulls) for clusters, I would appreciate if you could contribute this to ELKI to make it available for others as well. Thank you.