Mahalanobis distance with 0 variance - cluster-analysis

I'm developing a clustering algorithm, I'm using the Mahalanobis distance to decide if a point must be included in a cluster or not.
A cluster is represented by the statistics of the points that are included in it, the statistics are:
the sum of the components of all the points in each dimension named SUMi;
the sum of the squares of the components of all the points in each dimension, named SUMQi.
Choosing these statistics, the centroid’s coordinate in the ith dimension is the SUMi/N, where N is the number of points in the cluster. Talking about the variance, I should compute the variance in the ith dimension as SUMSQi/N − (SUMi/N)^2.
To initialize clusters I choose k random point in the dataset, where k is a priori known number of clusters, in this way I have k clusters, each represented by the statistics of 1 point, this means that SUMSQi/N = (SUMi/N)^2 and thus variance is 0, so I can't compute Mahalanobis distance because it need to divide by the variance of the cluster and in this case, it is 0.
Which is the correct approach to this problem? And since I'm working with d-dimensional points how should I choose the threshold to discriminate if a point should be added to a cluster? Is using the module or the mean of the element of the variance vector a good idea?
The clusters are represtend by (2*d) + 1 elements: the first is the number of points in the cluster, from the second to the d-th element there are the SUMi, from the d-th to the (2*d)-th there are the SUMQi.
This is the function I implemented in C to compute variance of a cluster and Mahalanobis distance of a point from the cluster:
double mahalanobis(double *cluster, double *point, int dimension) {
double* std_dev = malloc(dimension * sizeof(double ));
/**
* variance of the cluster
*/
for (int i = 0; i < dimension; ++i) {
std_dev[i] = cluster[i + dimension + 1]/cluster[0] - pow(cluster[i + 1]/cluster[0], 2);
}
/**
* mahalanobis distance
*/
double sum = 0;
for (int i = 0; i < dimension; ++i) {
sum += pow(point[i] - ((cluster[i+1] + point[i])/(cluster[0] + 1)), 2)/std_dev[i];
}
free(std_dev);
return sqrt(sum);
}

Related

MATLAB: Computing euclidean distance in an efficient way?

What I am currently doing is computing the euclidean distance between all elements in a vector (the elements are pixel locations in a 2D image) to see if the elements are close to each other. I create a reference vector that takes on the value of each index within the vector incrementally. The euclidean distance between the reference vector and all the elements in the pixel location vector is computed using the MATLAB function "pdist2" and the result is applied to some conditions; however, upon running the code, this function seems to be taking the longest to compute (i.e. for one run, the function was called upon 27,245 times and contributed to about 54% of the overall program's run time). Is there a more efficient method to do this and speed up the program?
[~, n] = size(xArray); %xArray and yArray are same size
%Pair the x and y coordinates of the interest pixels
pairLocations = [xArray; yArray].';
%Preallocate cells with the max amount (# of interest pixels)
p = cell(1,n);
for i = 1:n
ref = [xArray(i), yArray(i)];
d = pdist2(ref,pairLocations,'euclidean');
d = d < dTh;
d = find(d==1);
[~,k] = size(d);
if (k >= num)
p{1,i} = d;
end
end
For squared Euclidean distance, there is a trick using matrix dot product:
||a-b||² = <a-b, a-b> = ||a||² - 2<a,b> + ||b||²
Let C = [xArray; yArray]; a 2×n matrix of all locations, then
n2 = sum(C.^2); % sq norm of coordinates
D = bsxfun(#plus, n2, n2.') - 2 * C.' * C;
Now D(ii,jj) holds the square distance between point ii and point jj.
Should run quite quickly.

How to calculate normalized euclidean distance on two vectors?

Let's say I have the following two vectors:
x = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
y = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
The first seven elements are continuous values in the range [1,10]. The last element is an integer in the range [1,10].
Now I would like to compute the euclidean distance between x and y. I think the integer element is a problem because all other elements can get very close but the integer element has always spacings of ones. So there is a bias towards the integer element.
How can I calculate something like a normalized euclidean distance on it?
According to Wolfram Alpha, and the following answer from cross validated, the normalized Eucledean distance is defined by:
You can calculate it with MATLAB by using:
0.5*(std(x-y)^2) / (std(x)^2+std(y)^2)
Alternatively, you can use:
0.5*((norm((x-mean(x))-(y-mean(y)))^2)/(norm(x-mean(x))^2+norm(y-mean(y))^2))
I would rather normalise x and y before calculating the distance and then vanilla Euclidean would suffice.
In your example
x_norm = (x -1) / 9; % normalised x
y_norm = (y -1) / 9; % normalised y
dist = norm(x_norm - y_norm); % Euclidean distance between normalised x, y
However, I am not sure about whether having an integer element contributes to some sort of bias but we have already gotten kind of off-topic for stack overflow :)
From Euclidean Distance - raw, normalized and double‐scaled coefficients
SYSTAT, Primer 5, and SPSS provide Normalization options for the data so as to permit an investigator to compute a distance
coefficient which is essentially “scale free”. Systat 10.2’s
normalised Euclidean distance produces its “normalisation” by dividing
each squared discrepancy between attributes or persons by the total
number of squared discrepancies (or sample size).
Frankly, I can see little point in this standardization – as the final
coefficient still remains scale‐sensitive. That is, it is impossible
to know whether the value indicates high or low dissimilarity from the
coefficient value alone

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Matlab, generate and plot a point cloud distributed within a triangle

I'm trying to generate a cloud of 2D points (uniformly) distributed within a triangle. So far, I've achieved the following:
The code I've used is this:
N = 1000;
X = -10:0.1:10;
for i=1:N
j = ceil(rand() * length(X));
x_i = X(j);
y_i = (10 - abs(x_i)) * rand;
E(:, i) = [x_i y_i];
end
However, the points are not uniformly distributed, as clearly seen in the left and right corners. How can I improve that result? I've been trying to search for the different shapes too, with no luck.
You should first ask yourself what would make the points within a triangle distributed uniformly.
To make a long story short, given all three vertices of the triangle, you need to transform two uniformly distributed random values like so:
N = 1000; % # Number of points
V = [-10, 0; 0, 10; 10, 0]; % # Triangle vertices, pairs of (x, y)
t = sqrt(rand(N, 1));
s = rand(N, 1);
P = (1 - t) * V(1, :) + bsxfun(#times, ((1 - s) * V(2, :) + s * V(3, :)), t);
This will produce a set of points which are uniformly distributed inside the specified triangle:
scatter(P(:, 1), P(:, 2), '.')
Note that this solution does not involve repeated conditional manipulation of random numbers, so it cannot potentially fall into an endless loop.
For further reading, have a look at this article.
That concentration of points would be expected from the way you are building the points. Your points are equally distributed along the X axis. At the extremes of the triangle there is approximately the same amount of points present at the center of the triangle, but they are distributed along a much smaller region.
The first and best approach I can think of: brute force. Distribute the points equally around a bigger region, and then delete the ones that are outside the region you are interested in.
N = 1000;
points = zeros(N,2);
n = 0;
while (n < N)
n = n + 1;
x_i = 20*rand-10; % generate a number between -10 and 10
y_i = 10*rand; % generate a number between 0 and 10
if (y_i > 10 - abs(x_i)) % if the points are outside the triangle
n = n - 1; % decrease the counter to try to generate one more point
else % if the point is inside the triangle
points(n,:) = [x_i y_i]; % add it to a list of points
end
end
% plot the points generated
plot(points(:,1), points(:,2), '.');
title ('1000 points randomly distributed inside a triangle');
The result of the code I've posted:
one important disclaimer: Randomly distributed does not mean "uniformly" distributed! If you generate data randomly from an Uniform Distribution, that does not mean that it will be "evenly distributed" along the triangle. You will see, in fact, some clusters of points.
You can imagine that the triangle is split vertically into two halves, and move one half so that together with the other it makes a rectangle. Now you sample uniformly in the rectangle, which is easy, and then move the half triangle back.
Also, it's easier to work with unit lengths (the rectangle becomes a square) and then stretch the triangle to the desired dimensions.
x = [-10 10]; % //triangle base
y = [0 10]; % //triangle height
N = 1000; %// number of points
points = rand(N,2); %// sample uniformly in unit square
ind = points(:,2)>points(:,1); %// points to be unfolded
points(ind,:) = [2-points(ind,2) points(ind,1)]; %// unfold them
points(:,1) = x(1) + (x(2)-x(1))/2*points(:,1); %// stretch x as needed
points(:,2) = y(1) + (y(2)-y(1))*points(:,2); %// stretch y as needed
plot(points(:,1),points(:,2),'.')
We can generalize this case. If you want to sample points from some (n - 1)-dimensional simplex in Euclidean space UNIFORMLY (not necessarily a triangle - it can be any convex polytope), just sample a vector from a symmetric n-dimensional Dirichlet distribution with parameter 1 - these are the convex (or barycentric) coordinates relative to the vertices of the polytope.

how to find k-th nearest neighbor of a point in a set of point

I have a set of point (x,y) on a 2d plane. Given a point (x0,y0), and the number k, how to find the k-th nearest neighbor of (x0,x0) in the point set. In detail, the point set are represented by two array: x and y. The point (x0,y0) is given by the index i0. It means x0=x(i0) and y0=y(i0).
Is there any function or something in Matlab helps me this problem. If Matlab doesn't have such kind of function, can you suggest any other effective ways.
EDIT: I have to calculate this kind of distance for every point (x0,y0) in the set. The size of the set is about 1000. The value of k should be about sqrt(1500). The worst thing is that I do this many times. At each iteration, the set changes, and I calculate the distances again. So, the running time is a critical problem.
if you will do this check for many points you might want to construct a inter-point distance table first
squareform(pdist([x y]))
If you have the statistics toolbox, you can use the function knnsearch.
A brute force algorithm would be something like this:
array x[n] = ()
array y[n] = ()
array d[n] = ()
... populate x and y arrays with your n points ...
/* step over each point and calculate its distance from (x0, y0) */
for i = 1 to n
do
d[i] = distance((x0, y0), (x[i], y[i])
end
/* sort the distances in increasing order */
sort(d)
/* the k'th element of d, is the k'th nearest point to (x0, y0) */
return d[k]
The free and opensource VLFeat toolbox contains a kd-tree implementation, amongst other useful things.
The brute force approach looks something like this:
%Create some points
n = 10;
x = randn(n,1);
y = randn(n,1);
%Choose x0
ix0 = randi(n);
%Get distances
d = sqrt(...
(x - x(ix0) ).^2 + ...
(y - y(ix0) ).^2 );
%Sort distances
[sorted_Dstances, ixSort] = sort(d);
%Get kth point
k = 3;
kth = [x(ixSort(k+1)); y(ixSort(k+1))]; %+1 since the first element will always be the x0 element.