DBSCAN with R*-Tree - how it works

DBSCAN with R*-Tree - how it works - cluster-analysis

Whether someone can explain to me how dbscan algorithm works with R*-Tree? I understand work of dbscan, it seems, I understand as the R*-Tree works, but I can't connect them together.
Initially, I have data - feature vectors with 8 features, and I don't understand how I have to process them for construct R*-Tree. I will be grateful if someone lists the main steps which I have to pass.
I apologize if my question is obvious, but it causes difficulties in me.
Thanks in advance!

An R*-Tree indexes arbitrary geometric objects by their bounding box. In your case, as you have only points, the minimum and maximum values of your bounding box are the same. Every R*-Tree has a function like rtree.add_element(object, boundingbox). object would be the index of the data point and boundingbox would be as mentioned above.
The connecting point is the regionQuery part of DBSCAN. regionQuery(p) of a data point p returns all the data points q for which euclideanDistance(p,q) ≤ ε (value of parameter ε is provided by the user).
Naïvely, you could compute the distance of all your data points to p, which takes O(n) time for one data point, hence querying all your n data points takes o(n²) time. Alternatively you could precompute a matrix which holds the euclidean distances of all your data points to each other. This then takes O(n²) of space whereas regionQuery of one point just has to be looked up in that matrix.
An R*-Tree enables you to look up data points within coordinate ranges in O(log n) time. However, an R*-Tree only allows queries of the form
"All points where: Coordinate 1 in [ 0.3 ; 0.5 ] AND Coordinate 2 in [ 0.8 ; 1.0 ]"
and not
"All points q where: euclideanDistance(p,q) ≤ ε"
Therefore, you query the R*-Tree for points where each coordinate is the respective corrdinate of p±ε and then calculate the euclidean distance of all matching points to your query point p. The difference, however, is that these are far less points to check than if you would calculate the euclidean distance of p to all of your points. Therefore, your time complexity of one regionQuery is now O(log n * m), where m is the number of points returned by your R*-tree. If you choose ε small, you will get few matching points from your R*-tree and m will be small. So your time complexity approaches O(log n) for one regionQuery and therefore O(n * log n) for one regionQuery for each of your data poins. On the other extreme, if you choose ε so large that it will encompass most of your data points, m will approach n and therefore, the time needed for one regionQuery for each data point approaches O(n * log n * n) = O(n² * log n ) again, so you gain nothing compared to the naïve approach.
It is therefore of crucial importance that you choose ε small enough so that every point has only a few other points within euclidean distance of ε.

The R*-tree is a spatial index.
It can find the neighbors faster.

Related

Plotting Jaccard Index against spatial data (lat-long)

I am trying to find a way to plot a matrix (In this case a matrix with jaccard indices) against spatial distances (I have latitude and longitude data). I have been told to use the "geosphere" package but I haven't been able to fully understand how to use it.
So if anyone here is well versed in doing such things, please help me out
kind regards

Package geosphere is indeed a good choice as it is based on ellipsoid instead of a sphere and gives more accurate results than many other alternatives (and directly in metres). However, it is more tedious to use as it only calculates a distance between two points, or in a matrix, the track lengths from point to the next point instead of the full matrix. The following is an easy way that does a lot of unnecessary calculation that you throw away, but it is much simpler than alternatives (and sufficiently fast for any practical purpose).
Assume you have matrix x of dimensions N times 2, where the two columns are the longitude and latitude (in this order) in decimal degrees and N is the number of observations:
library(geosphere)
N <- NROW(x)
geodists <- matrix(0, N, N)
for (i in 1:N) for(j in 1:N) geodists[i,j] <- distGeo(x[i,], x[j,])
## alternative for only lower diagonal:
## for(j in 1:(N-1)) for(i in (j+1):N) geodists[i,j] <- distGeo(x[i,], x[j,])
geodists <- as.dist(geodists)
The geodists will then be arranged similarly as the Jaccard distances (assuming you used vegan or other package that returns standad dist structures) and these can be directly plotted against each other. If you used some package that gives you a matrix (which is symmetric and has zero diagonal), it is best to change the result to distances (as.dist()) which only have the lower triangle without the zero diagonal as these give nicer plots.
Package sp uses also WGS84 ellipsoid, but its results differ little (less than 0.01% in my test for distances up to 2500km) from those given by the geosphere. It may be that geosphere is more accurate (and it also allows alternatives to WGS84 ellipsoid). However, sp is much easier to use, and will give you directly the symmetric matrix of distances (but in kilometres instead of metres, though I claimed so in my comment), and you can do with one command:
library(sp)
geodists <- as.dist(spDists(x, longlat=TRUE))*1000 # in metres

Finding length between a lot of elements

I have an image of a cytoskeleton. There are a lot of small objects inside and I want to calculate the length between all of them in every axis and to get a matrix with all this data. I am trying to do this in matlab.
My final aim is to figure out if there is any axis with a constant distance between the object.
I've tried bwdist and to use connected components without any luck.
Do you have any other ideas?

So, the end goal is that you want to globally stretch this image in a certain direction (linearly) so that the distances between nearest pairs end up the closest together, hopefully the same? Or may you do more complex stretching ? (note that with arbitrarily complex one you can always make it work :) )
If linear global one, distance in x' and y' is going to be a simple multiplication of the old distance in x and y, applied to every pair of points. So, the final euclidean distance will end up being sqrt((SX*x)^2 + (SY*y)^2), with SX being stretch in x and SY stretch in y; X and Y are distances in X and Y between pairs of points.
If you are interested in just "the same" part, solution is not so difficult:
Find all objects of interest and put their X and Y coordinates in a N*2 matrix.
Calculate distances between all pairs of objects in X and Y. You will end up with 2 matrices sized N*N (with 0 on the diagonal, symmetric and real, not sure what is the name for that type of matrix).
Find minimum distance (say this is between A an B).
You probably already have this. Now:
Take C. Make N-1 transformations, which all end up in C->nearestToC = A->B. It is a simple system of equations, you have X1^2*SX^2+Y1^2*SY^2 = X2^2*SX^2+Y2*SY^2.
So, first say A->B = C->A, then A->B = C->B, then A->B = C->D etc etc. Make sure transformation is normalized => SX^2 + SY^2 = 1. If it cannot be found, the only valid transformation is SX = SY = 0 which means you don't have solution here. Obviously, SX and SY need to be real.
Note that this solution is unique except in case where X1 = X2 and Y1 = Y2. In this case, grab some other point than C to find this transformation.
For each transformation check the remaining points and find all nearest neighbours of them. If distance is always the same as these 2 (to a given tolerance), great, you found your transformation. If not, this transformation does not work and you should continue with the next one.
If you want a transformation that minimizes variations between distances (but doesn't require them to be nearly equal), I would do some optimization method and search for a minimum - I don't know how to find an exact solution otherwise. I would pick this also in case you don't have linear or global stretch.

If i understand your question correctly, the first step is to obtain all of the objects center of mass points in the image as (x,y) coordinates. Then, you can easily compute all of the distances between all points. I suggest taking a look on a histogram of those distances which may provide some information as to the nature of distance distribution (for example if it is uniformly random, or are there any patterns that appear).
Obtaining the center of mass points is not an easy task, consider transforming the image into a binary one, or some sort of background subtraction with blob detection or/and edge detector.
For building a histogram you can use histogram.

How to understand the Matlab build in function "kmeans"?

Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply
Matlab build in function "kmeans"to the matrix A.
k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');
I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8;
According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation.
My question is:
1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?
2) What are the final centers c1, c2,...,ck? Are they just values or locations?
3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?
Thanks!

Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).
A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.
Now I will answer the three parts:
The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.
Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
Edit:
Here is the code necessary for part 3:
[~, min_idxs] = min(D, [], 1);
closest_vecs = A(min_idxs, :);
Each row i of closest_vecs is the vector that is closest to centroid i.

OK, before we actually get into the details, let's give a brief overview on what K-means clustering is first.
k-means clustering works such that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
How you use the kmeans function in MATLAB is that assuming you have a data matrix (A in your case), it is arranged such that each row is a sample and each column is a feature / dimension of a sample. For example, we could have N x 2 or N x 3 arrays of Cartesian coordinates, either in 2D or 3D. In colour images, we could have N x 3 arrays where each column is a colour component in an image - red, green or blue.
How you invoke kmeans in MATLAB is the following way:
[IDX, C] = kmeans(X, K);
X is the data matrix we talked about, K is the total number of clusters / groups you would like to see and the outputs IDX and C are respectively an index and centroid matrix. IDX is a N x 1 array where N is the total number of samples that you have put into the function. Each value in IDX tells you which centroid the sample / row in X best matched with a particular centroid. You can also override the distance measure used to measure the distance between points. By default, this is the Euclidean distance, but you used the cosine distance in your invocation.
C has K rows where each row is a centroid. Therefore, for the case of Cartesian coordinates, this would be a K x 2 or K x 3 array. Therefore, you would interpret IDX as telling which group / centroid that the point is closest to when computing k-means. As such, if we got a value of IDX=1 for a point, this means that the point best matched with the first centroid, which is the first row of C. Similarly, if we got a value of IDX=1 for a point, this means that the point best matched with the third centroid, which is the third row of C.
Now to answer your questions:
We just talked about C and IDX so this should be clear.
The final centres are stored in C. Each row gives you a centroid / centre that is representative of a group.
It sounds like you want to find the closest point to each cluster in the data, besides the actual centroid itself. That's easy to do if you use knnsearch which performs K-Nearest Neighbour search by giving a set of points and it outputs the K closest points within your data that are close to a query point. As such, you supply the clusters as the input and your data as the output, then use K=2 and skip the first point. The first point will have a distance of 0 as this will be equal to the centroid itself and the second point will give you the closest point that is closest to the cluster.
You can do that by the following, assuming you already ran kmeans:
out = knnsearch(A, C, 'k', 2);
out = out(:,2);
You run knnsearch, then toss out the closest point as it would essentially have a distance of 0. The second column is what you're after, which gives you the closest point to the cluster excluding the actual centroid. out will give you which points in your data matrix A that was closest to each centroid. To get the actual points, do this:
pts = A(out,:);
Hope this helps!

3 points distance calculation strategy

Problem: I have (lat-long) co-ordinates of a lot of points a, b, c, d . . . in the database.
Now, when i choose point a, i need to calculate the distances of point a from each of the other points and get the closest one for eg. This math requires cos and tan calculations of the points. So this seems to be quite expensive on the db side.
So i thought of a strategy to simplify this. Below is the explanation of the strategy.
I have 3 known points (x, y, z) the distance between one point to the other is known. For this example lets assume to be 10. i.e. distance from x to y = 10; y to z = 10; z to x = 10. (this forms an equilateral triangle. but real scenario might not be the case)
Now lets say we have two points a and b. we calculate the distances of point a to x, y and z and store respectively and so for point b. (say application logic)
so we have:
for point a: Ax, Ay and Az
for point b: Bx, By and Bz
As for the strategy, the question is how can we calculate the distance between point a to point b.
As for the problem itself, if i apply the above strategy to my it, question is am I simplifying or complicating the situation?
Thanks in advance for your answer.

You can calculate the distance between two points using the Pythagorean theorem assuming they are given in Cartesian coordinates, no expensive trigonometry involved.
But this will probably still be to expensive if you have thousands or millions of points. Depending on the database you use it may offer spatial data types and can handle your query efficiently. See for example spatial data in SQL Server.
If your database does not support spatial data the problem becomes quite complex. You need a spatial index with efficient support for nearest neighbor queries. You can start at the Wikipedia article on spatial databases to learn how this problem is usually solved.
If your set of points is stable another option would be to just precompute and store the nearest neighbor for every point but this will get tricky when points are added, deleted or changed.

K means clustring find k farthest points in java

I'm trying to implement k means clustering.
I've a set of points with coordinates (x,y) and i am using Euclidean distance for finding distance. I've computed distance between all points in a matrix
dist[i][j] - distance between points i and j
when i choose a[1][3] farthest from pt 1 as 3.
then when i search farthest from 3 i may get a[3][j] but a[1][j] may be minimum.
[pt j is far from pt3 but near to 1]
so how to choose k farthest points using the distance matrix.

Note that the k-farthest points do not necessarily yield the best result: they clearly aren't the best cluster center estimates.
Plus, since k-means heuristics may get stuck in a local minimum, you will want a randomized algorithm that allows you to restart the process multiple times and get potentiall different results.
You may want to look at k-means++ which is a known good heuristic for k-means initialization.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse