I have a dataset consisting of a large collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is nearest to the area with the highest density of points.
So my problem consists of two steps:
1: Determine where density of the distribution of points is at its highest
2: Determine which point is nearest to the point found in 1
Point 2 i can manage, but i'm not sure how to solve point 1. I know there are a lot of functions for density estimation in Matlab, but i'm not sure which one would be the most suitable, or straightforward to use.
Does anyone know?
My command of statistics is a little bit rusty, but as far as i can tell, this type of problem calls for multivariate analysis. Someone suggested i use multivariate kernel density estimation, but i'm not really sure if that's the best solution.
Density is a measure of mass per unit volume. On the assumption that your points all have the same mass then you are, I suppose, trying to measure the number of points per unit volume. So one approach is to divide your subset of Euclidean space into lots of little unit volumes (let's call them voxels like everyone does) and count how many points there are in each one. The voxel with the most points is where the density of points is at its highest. This is, of course, numerical integration of a sort. If your points were distributed according to some analytic function (and I guess they are not) you could solve the problem with pencil and paper.
You might make this approach as sophisticated as you like, perhaps initially dividing your space into 2 x 2 x 2 voxels, then choosing the voxel with most points and sub-dividing that in turn until your criteria are satisfied.
I hope this will get you started on your point 1; you seem to be OK with point 2 so I'll stop now.
EDIT
It looks as if triplequad might be what you are looking for.
Related
The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?
Why do you think all distances would assume normal (which btw. is not the same as uniform) data?
Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.
I have a greyscale image, represented by a histogram below (x and y axes are pixels, z axis is pixel intensity).
Each cluster of bars represents an object, with the local maxima fairly approximating the centroid of the object. My goal is to find the Full Width Half Max of each object – so I'm roughly approximating each object as a Gaussian distribution.
How can I detect each cluster individually? I understand how to mathematically calculate the FWHM, but I'm not sure how to detect each cluster based on its (roughly) Gaussian features. (e.g., in the example below I would want to detect 6 clusters. One can see a small cluster in the middle but its amplitude is so small that I am okay with missing it).
I appreciate any advice - and efficiency is not a major issue, so I can implement relatively expensive solutions.
To find the centers of each of these groupings you could use a type of A* search algorithm, or similar linear optimization algorithm.
It will find its way to the maxima of a grouping. The issue after that is you wont know if you are at a local maxima (which in your scenario is likely). After your current search has bottomed out at the highest point, and you have calculated the FWHM for that area, you could set all the nodes your A* has traversed to 0, (or mark each node as visited so as to not be visited again), and start the A* algorithm again, until all nodes have been seen, and all groupings found.
I need to divide data points into those that are similar to each other("good" points) and everyone else("bad" points).
It looks like some kind of clustering problem and what do I do:
I am assuming that there are at least two "good" points.
Find pairwise distance between all types of points.
Find minimum distance (minDist).
Do hierarchical clustering for all points.
Make a cut at the height of 5*minDist.
Say that all points that are in the same cluster as pair with minDist and under that cut belong to the desired "good" cluster.
And this works pretty well, but if there are two points that are very close to each other. minDist is very small and this 5*minDist cut is also small => only these 2 points are in the desired "good" cluster.
I would think that either I need to change this approach completely and here is question number 1:
[1] "What methods do exist to separate similar points from everyone else?"
Or I need to modify this 5*minDist to some other function of minDist. And question is:
[2] "What may I choose as reasonable alternative to 5*minDist?"
Vladimir
Instead of doing clustering, you want to do outlier detection.
There are dozens of algorithms for this (see ELKI for a large collection). Some very basic methods may solve your problem:
The number of neighbors in radius r. If +i < threshold, the point is an outlier.
Distance to the k nearest neighbor. Choose k>1 to avoid these 2 element clusters you are seeing.
Also, DBSCAN clustering could work for you. Consider all clusters to be good, and only noise to be bad!
I have a large dataset of around 20 million points (x,y,z) in a 3-dimensional space. I know these points are organized in dense regions, but that these regions vary in size. I think a standard unsupervised 3D clustering should solve my problem.
Since I can't estimate the number of clusters a priori, I tried using k-means with a wide range for k, but it is slow and also, I would have to estimate how significant each k-partition is.
Basically, my question is: how can I extract the most significant partition of my points into clusters?
k-means is probably not the best alhorithm for such data.
DBSCAN should be closer to your intuition of dense regions.
Try on a sample first, then figure out how to scale up.
It is not clear to me from the above if you're going to use k-means or not, but if you are, you should be following the responses from the post below which shows how to measure variance of the clusters.
Calculating the percentage of variance measure for k-means?
Additionally, you can get a good fit using 'the elbow method' by trying 2 to 15 k sized clusters. See the answer from Amro for the process on this.
One simple idea in this case is to use 3 different clusterings, along each dimension. That might speed things up.
So you find clusters along X axis (project all the points down to X axis) and then continue to form sub clusters along the Y axis and then along the Z axis.
I think 1-D k-means can be solved very efficiently using dynamic programming http://www.sciencedirect.com/science/article/pii/0025556473900072.
Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf
If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood.
What does eps represent actually? It is a radius to draw a circle right? So, why the radius is so small, smaller than distances between two objects? If so, the MinPts will be 0 forever.
Yes, if used with Euclidean distance, then it is a radius.
It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead.
Use the original paper to understand the algorithm, not some indian journal variant of it.
In Euclidean distance, it is the radius. Selection of Eps is a little difficult.
This problem is related to model selection, i.e., the selection of a particular model and its corresponding parametrization. In the case of k-means (which requires from the user the number of clusters as input) there is a plethora of measures in the literature that can help in the selection of the best number of clusters, for instance: silhouette, c-index, dunn, davies-bouldin. These measures are the so-called relative validity criteria.
In the case of Density-based clustering algorithms, there are some measures too, for instance: CDbw and DBCV.