Clustering with constraints on centers and sizes - cluster-analysis

My knowledge about clustering algorithms is quite low. I am looking for a solution of the following problem, the simpler the better, I need a reasonable solution easy to implement, not necessarily the state of art.
Given N points in 2D I need to find K clusters such that:
each of the K cluster contains a fixed number K(i) of point and the sum for i = 1,...,K of all the K(i) point in the clusters is the total number of point N;
each center C(i) of the clusters dominates or is dominated by the other centers in this sense: given the coordinates (x,y) of C(i) and (w,z) of C(j) then (x > w AND y > z) OR (x <= w AND y <= z).
Any idea?

Related

Finding the bounds of a region represented by data points in Matlab

For simplicity, I will consider the 2-D case but my question is well applicable to n-dimensions. I have a region S in 2-D that is closed and bounded and has no holes. The only thing I know about S is a bunch of data points that lie in S. Essentially, these points fill-up S; the more points I have, the more accurate my representation of S is. From these data points (x,y) in 2-D, I can easily approximate xL and xU such that all points (x,y) in S satisfy xL <= x <= xU.
I am wondering if there is a function or method in Matlab that enables me, for a particular point (x,y) in S, to give me the bounds of y approximated from these data points. In particular, I am looking for functions yL(x) and yU(x) such that yL(x) <= y <= yU(x) for any x such that xL <= x <= xU.
Now, I am not necessarily looking for a function in matlab like boundary which simply connects the points that are literally on the boundary of the data set. I am rather looking for the "best fit" boundary given the scattered plot that I have.
Suggestions appreciated!

Find all neighbors within point-specific radius in Matlab

I am implementing Equation 8 from Kraskov et al (2004)'s Estimating Mutual Information paper, and have the following problem:
Given vectors X = [X_1,...,X_N] and r = [r_1,...,r_N], I need to compute A= [A_1,...A_N] where A_i is the number of points in X within a r_i radius of X_i.
If r were a fixed number, that is, if the radius around each point was fixed for all points, I could easily use rangesearch. But because it is a vector (different radius for each point), I am not sure how to do this fast. Exhaustive search (or making any N^2 sized distance arrays) is not good, because N is on the order of a million.

PCA (Principle Component Analysis) on multiple datasets

I have a set of climate data (temperature, pressure and moisture for example), X, Y, Z which are matricies with dimensions (n x p) where n is the number of observations and p is the number of spatial points.
Previously, to investigate modes of variability in dataset X, I simply performed a empirical orthogonal function (EOF) analysis OR Principle component Analysis (PCA) on X. This involved decomposing (via SVD), the matrix X.
To investigate the coupling of the modes of variability of X and Y, I used maximum covariance analysis (MCA) which involved decomposing a covariance matrix proportional to XY^{T}. (T is transpose)
However, if I wish to looked at all three datasets, how do I go about doing this? One idea I had was to form a fourth matrix, L, which will be the 'feature' concatenation of the three datasets:
L = [X, Y, Z]
so that my matrix L will have dimensions (n x 3p).
I would then use standard PCA/EOF analysis and use SVD to decompose this matrix L and then I would obtain modes of variabiilty with size (3p x 1) and thus subsequently the mode associated with X is the first p values, the mode associated with Y is the second set of p values and the mode associated with Z is the last p values.
Is this correct? Or can anyone suggest a better way of looking at the coupling of all three (or more) datasets?
Thank you so much!
I'd recommend to treat spatial points as extra dimension, i.e. f x n x p, where 'f' is your number of features. At this point you should use multilinear extension of PCA that can work on tensor data.

Brisk (binary robust invariant scalable keypoints )

I am trying to do BRISK my own code in matlab.
Where ı am stack, ı don't understand what this expression means.
let us consider one of the N*(N −1)/2 sampling-point pairs (pi, pj).
A = {(pi, pj) ∈ R2 × R2 | i < N ∧ j < i ∧ i, j ∈ N }
The other my question , what is the difference between local gradient and global gradient?
The expression means that you are looking at a pair of pixels (pi, pj), such that both pixels belong to the region R2 x R2, and the two pixels cannot be the same.
Gradient is a vector (Ix, Iy), where Ix is the first derivative in the x direction, and Iy is the first derivative in the y direction. This vector is defined at a point, so gradient is local by definition. I don't know what global gradient means. More context may help here.
Given that we have set of points of size N. N*(N −1)/2 is N choose 2 which equals number of subsets of size 2 that can be taken from a set of size N ( a concept in probability called Combinations). Because you are working with pair of points you need subset size to be 2.
R refers to the set of all real numbers ( a single value). When it is squared it refers to the Cartesian plane, so pi is a pair of real numbers(x,y), a point in the Cartesian plane.
The character '^' is AND operation. So all the following conditions has to be satisfied:
the index i of the first point, pi, should be less than N
the index j of the second point must be less than the index of the first point.
Like i, j must also be less than N
Local gradient is computed locally on the pair of pixels pi and pj. while, global gradient is estimated for the region surrounding the keypoint by accumulating local gradients.

Selecting data based on the distance from a query point in Matlab

I have a data-set that has four columns [X Y Z C]. I would like to find all the C values that are in a given sphere centered at [X, Y, Z] with a radius r. What is the best approach to address this problem? Should I use the clusterdata command?
Here is one solution that uses naively euclidean distance:
say V = [X Y Z C] is your dataset, Center = [x,y,z] is the center of the sphere, then
dist = bsxfun(#minus,V(:,1:3),Center); % // finds the distance vectors
% // between the points and the center
dist = sum(dist.^2,2); % // evaluate the squares of the euclidean distances (scalars)
idx = (dist < r^2); % // Find the indexes of the matching points
The good C values are
good = V(idx,4); % // here I kept just the C column
This is not "cluster analysis": You do not attempt to discover structure in your data.
Instead, what you are doing, is commonly called a "range query" or "radius query". In classic database terms, a SELECT, with a distance selector.
You probably want to define your sphere using euclidean distance. For computational purposes, it actually is beneficial to instead of squared Euclidean, by simply taking the square of your radius.
I don't use matlab, but there must be tons of examples of how to compute the distance of each instance in a data set from a query point, and then selecting those objects where the distance is small enough.
I don't know if there is any good index structures package for Matlab. But in general, at 3D, this can be well accelerated with index structures. Computing all distances is O(n), but with an index structure only O(log n).