Is there maximum number of noise/ outliers in DBSCAN algorithm? - cluster-analysis

I did clustering on spatial datasets using DBSCAN algorithm and generating a lot of noise 193000 of 250000 data. is that a reasonable amount?

Depends on your data and problem.
If I generate random coordinates, 100% noise would be appropriate because the data is random noise.

First, to address the question in your title. By making eps
very large, it is easy to get no noise points and all the points are
in one big cluster. By making eps very small, you can easily
make all points be noise points. In general, somewhere in between
is what you are looking for. Your job is to find a value that produces
a meaningful clustering. That is where the remark of
#Anony-Mousse comes into play.
Depends on your data and problem
As he suggested, if you have uniform random data, maybe all
noise is the best answer. If you have Gaussian random data,
maybe one big cluster with a few outliers is good. But this is
supposed to help you understand the structure of your data.
What happens as you change eps? From your current clustering
with many noise points, what happens as you gradually increase eps?
Does it gradually add a few noise points into the existing clusters?
Is there some place where two clusters get merged into one? Is there
someplace that there is a sudden change in the number of clusters?
Also, can you interpret the clusters in terms of your variables?
Perhaps the difference between two clusters is that in one all the
values of some variable are low and in another cluster they are high. Considering whatever problem you are trying to solve,
do the clusters divide the data into meaningful groups? Try to use
the clusterings to find meaning in your data.

Related

Selecting the K value for Kmeans clustering [duplicate]

This question already has answers here:
Cluster analysis in R: determine the optimal number of clusters
(8 answers)
Closed 3 years ago.
I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.
For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.
I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.
K-means is not suitable for outlier detection. This keeps popping up here all the time.
K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
K-means itself is known to not work well on noisy data where data points do not belong to the clusters
It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.
Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.
Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.
The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.
The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.
In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.

K-modes clusters evolvings

After performing a clustering of a dataset using k-modes, I have to evolve the clusters in time so, is there a way to automatically adjust the centroids as long as the data points changes its property values?.
I mean. I am clustering a big set of data with categorical values. However, these data points change in time (its categorical values) so I want to know if there is any way to make adjustments on the K centroids (or even in the K number) as long as the data points are slightly changing over time. I can recalculate the distance from each data point with the centroids and move the data point to another cluster but this would consider the centroids as fixed and I guess they could also change as data points change.
Re-Clustering is a very heavy task in time so there is a need to make this adjustments in a more efficient way. I am searching on the literature but I havent found any information about it.
Anybody knows if this is possible or any study related to this?
Rather than reclustering, use the previously found centers.
They shouldn't change much, and converge quickly.

Looking for a suggested Clustering technique

I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.

What is the importance of clustering?

During unsupervised learning we do cluster analysis (like K-Means) to bin the data to a number of clusters.
But what is the use of these clustered data in practical scenario.
I think during clustering we are losing information about the data.
Are there some practical examples where clustering could be beneficial?
The information loss can be intentional. Here are three examples:
PCM signal quantification (Lloyd's k-means publication). You know that are certain number (say 10) different signals are transmitted, but with distortion. Quantifying removes the distortions and re-extracts the original 10 different signals. Here, you lose the error and keep the signal.
Color quantization (see Wikipedia). To reduce the number of colors in an image, a quite nice method uses k-means (usually in HSV or Lab space). k is the number of desired output colors. Information loss here is intentional, to better compress the image. k-means attempts to find the least-squared-error approximation of the image with just k colors.
When searching motifs in time series, you can also use quantization such as k-means to transform your data into a symbolic representation. The bag-of-visual-words approach that was the state of the art for image recognition prior to deep learning also used this.
Explorative data mining (clustering - one may argue that above use cases are not data mining / clustering; but quantization). If you have a data set of a million points, which points are you going to investigate? clustering methods try ro split the data into groups that are supposed to be more homogeneous within and more different to another. Thrn you don't have to look at every object, but only at some of each cluster to hopefully learn something about the whole cluster (and your whole data set). Centroid methods such as k-means even can proviee a "prototype" for each cluster, albeit it is a good idea to also lool at other points within the cluster. You may also want to do outlier detection and look at some of the unusual objects. This scenario is somewhere inbetween of sampling representative objects and reducing the data set size to become more manageable. The key difference to above points is that the result is usually not "operationalized" automatically, but because explorative clustering results are too unreliable (and thus require many iterations) need to be analyzed manually.

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.