Defining an efficient distance function in matlab - matlab

I'm using kNN search function in matlab, but I'm calculating the distance between two objects of my own defined class, so I've written a new distance function. This is it:
function d = allRepDistance(obj1, obj2)
%calculates the min dist. between repr.
% obj2 is a vector, to fit kNN function requirements
n = size(obj2,1);
d = zeros(n,1);
for i=1:n
M = dist(obj1.Repr, [obj2(i,:).Repr]');
d(i) = min(min(M));
end
end
The difference is that obj.Repr may be a matrix, and I want to calculate the minimal distance between all the rows of each argument. But even if obj1.Repr is just a vector, which gives essentially the normal euclidian distance between two vectors, the kNN function is slower by a factor of 200!
I've checked the performance of just the distance function (no kNN). I measured the time it takes to calculate the distance between a vector and the rows of a matrix (when they are in the object), and it work slower by a factor of 3 then the normal distance function.
Does that make any sense? Is there a solution?

You are using dist(), which corresponds to the Euclidean distance weight function. However, you are not weighting your data, i.e. you don't consider that one dimension is more important that others. Thus, you can directly use the Euclidean distance pdist():
function d = allRepDistance(obj1, obj2)
% calculates the min dist. between repr.
% obj2 is a vector, to fit kNN function requirements
n = size(obj2,1);
d = zeros(n,1);
for i=1:n
X = [obj1.Repr, obj2(i,:).Repr'];
M = pdist(X,'euclidean');
d(i) = min(min(M));
end
end
BTW, I don't know your matrix dimensions, so you will need to deal with the concatenation of elements to create X correctly.

Related

MATLAB: Computing euclidean distance in an efficient way?

What I am currently doing is computing the euclidean distance between all elements in a vector (the elements are pixel locations in a 2D image) to see if the elements are close to each other. I create a reference vector that takes on the value of each index within the vector incrementally. The euclidean distance between the reference vector and all the elements in the pixel location vector is computed using the MATLAB function "pdist2" and the result is applied to some conditions; however, upon running the code, this function seems to be taking the longest to compute (i.e. for one run, the function was called upon 27,245 times and contributed to about 54% of the overall program's run time). Is there a more efficient method to do this and speed up the program?
[~, n] = size(xArray); %xArray and yArray are same size
%Pair the x and y coordinates of the interest pixels
pairLocations = [xArray; yArray].';
%Preallocate cells with the max amount (# of interest pixels)
p = cell(1,n);
for i = 1:n
ref = [xArray(i), yArray(i)];
d = pdist2(ref,pairLocations,'euclidean');
d = d < dTh;
d = find(d==1);
[~,k] = size(d);
if (k >= num)
p{1,i} = d;
end
end
For squared Euclidean distance, there is a trick using matrix dot product:
||a-b||² = <a-b, a-b> = ||a||² - 2<a,b> + ||b||²
Let C = [xArray; yArray]; a 2×n matrix of all locations, then
n2 = sum(C.^2); % sq norm of coordinates
D = bsxfun(#plus, n2, n2.') - 2 * C.' * C;
Now D(ii,jj) holds the square distance between point ii and point jj.
Should run quite quickly.

Mahalanobis distance in Matlab

I would like to calculate the mahalanobis distance of input feature vector Y (1x14) to all feature vectors in matrix X (18x14). Each 6 vectors of X represent one class (So I have 3 classes). Then based on mahalanobis distances I will choose the vector that is the nearest to the input and classify it to one of the three classes as well.
My problem is when I use the following code I got only one value. How can I get mahalanobis distance between the input Y and every vector in X. So at the end I have 18 values and then I choose the smallest one. Any help will be appreciated. Thank you.
Note: I know that mahalanobis distance is a measure of the distance between a point P and a distribution D, but I don't how could this be applied in my situation.
Y = test1; % Y: 1x14 vector
S = cov(X); % X: 18x14 matrix
mu = mean(X,1);
d = ((Y-mu)/S)*(Y-mu)'
I also tried to separate the matrix X into 3; so each one represent the feature vectors of one class. This is the code, but it doesn't work properly and I got 3 distances and some have negative value!
Y = test1;
X1 = Action1;
S1 = cov(X1);
mu1 = mean(X1,1);
d1 = ((Y-mu1)/S1)*(Y-mu1)'
X2 = Action2;
S2 = cov(X2);
mu2 = mean(X2,1);
d2 = ((Y-mu2)/S2)*(Y-mu2)'
X3= Action3;
S3 = cov(X3);
mu3 = mean(X3,1);
d3 = ((Y-mu3)/S3)*(Y-mu3)'
d= [d1,d2,d3];
MahalanobisDist= min(d)
One last thing, when I used mahal function provided by Matlab I got this error:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
If you have to implement the distance yourself (school assignment for instance) this is of absolutely no use to you, but if you just need to calculate the distance as an intermediate step for other calculations I highly recommend d = Pdist2(a,b, distance_measure) the documentation is on matlabs site
It computes the pairwise distance between a vector (or even a matrix) b and all elements in a and stores them in vector d where the columns correspond to entries in b and the rows are entries from a. So d(i,j) is the distance between row j in b and row i in a (hope that made sense). If you want it could even parameters to find the k nearest neighbors, it's a great function.
in your case you would use the following code and you'd end up with the distance between elements, and the index as well
%number of neighbors
K = 1;
% X=18x14, Y=1x14, dist=18x1
[dist, iidx] = pdist2(X,Y,'mahalanobis','smallest',K);
%to find the class, you can do something like this
num_samples_per_class = 6;
matching_class = ceil(iidx/ num_samples_per_class);

Matlab calculating nearest neighbour distance for all (u, v) vectors in an array

I am trying to calculate the distance between nearest neighbours within a nx2 matrix like the one shown below
point_coordinates =
11.4179 103.1400
16.7710 10.6691
16.6068 119.7024
25.1379 74.3382
30.3651 23.2635
31.7231 105.9109
31.8653 36.9388
%for loop going from the top of the vector column to the bottom
for counter = 1:size(point_coordinates,1)
%current point defined selected
current_point = point_coordinates(counter,:);
%math to calculate distance between the current point and all the points
distance_search= point_coordinates-repmat(current_point,[size(point_coordinates,1) 1]);
dist_from_current_point = sqrt(distance_search(:,1).^2+distance_search(:,2).^2);
%line to omit self subtraction that gives zero
dist_from_current_point (dist_from_current_point <= 0)=[];
%gives the shortest distance calculated for a certain vector and current_point
nearest_dist=min(dist_from_current_point);
end
%final line to plot the u,v vectors and the corresponding nearest neighbour
%distances
matnndist = [point_coordinates nearest_dist]
I am not sure how to structure the 'for' loop/nearest_neighbour line to be able to get the nearest neighbour distance for each u,v vector.
I would like to have, for example ;
for the first vector you could have the coordinates and the corresponding shortest distance, for the second vector another its shortest distance, and this goes on till n
Hope someone can help.
Thanks
I understand you want to obtain the minimum distance between different points.
You can compute the distance for each pair of points with bsxfun; remove self-distances; minimize. It's more computationally efficient to work with squared distances, and take the square root only at the end.
n = size(point_coordinates,1);
dist = bsxfun(#minus, point_coordinates(:,1), point_coordinates(:,1).').^2 + ...
bsxfun(#minus, point_coordinates(:,2), point_coordinates(:,2).').^2;
dist(1:n+1:end) = inf; %// remove self-distances
min_dist = sqrt(min(dist(:)));
Alternatively, you could use pdist. This avoids computing each distance twice, and also avoids self-distances:
dist = pdist(point_coordinates);
min_dist = min(dist(:));
If I can suggest a built-in function, use knnsearch from the statistics toolbox. What you are essentially doing is a K-Nearest Neighbour (KNN) algorithm, but you are ignoring self-distances. The way you would call knnsearch is in the following way:
[idx,d] = knnsearch(X, Y, 'k', k);
In simple terms, the KNN algorithm returns the k closest points to your data set given a query point. Usually, the Euclidean distance is the distance metric that is used. For MATLAB's knnsearch, X is a 2D array that consists of your dataset where each row is an observation and each column is a variable. Y would be the query points. Y is also a 2D array where each row is a query point and you need to have the same number of columns as X. We would also specify the flag 'k' to denote how many closest points you want returned. By default, k = 1.
As such, idx would be a N x K matrix, where N is the total number of query points (number of rows of Y) and K would be those k closest points to the dataset for each query point we have. idx indicates the particular points in your dataset that were closest to each query. d is also a N x K matrix that returns the smallest distances for these corresponding closest points.
As such, what you want to do is find the closest point for your dataset to each of the other points, ignoring self-distances. Therefore, you would set both X and Y to be the same, and set k = 2, discarding the first column of both outputs to get the result you're looking for.
Therefore:
[idx,d] = knnsearch(point_coordinates, point_coordinates, 'k', 2)
idx = idx(:,2);
d = d(:,2);
We thus get for idx and d:
>> idx
idx =
3
5
1
1
7
3
5
>> d
d =
17.3562
18.5316
17.3562
31.9027
13.7573
20.4624
13.7573
As such, this tells us that for the first point in your data set, it matched with point #3 the best. This matched with the closest distance of 17.3562. For the second point in your data set, it matched with point #5 the best with the closest distance being 18.5316. You can continue on with the rest of the results in a similar pattern.
If you don't have access to the statistics toolbox, consider reading my StackOverflow post on how I compute KNN from first principles.
Finding K-nearest neighbors and its implementation
In fact, it is very similar to Luis Mendo's post to you earlier.
Good luck!

Numerical derivative of a vector

I have a problem with numerical derivative of a vector that is x: Nx1 with respect to another vector t (time) that is the same size of x.
I do the following (x is chosen to be sine function as an example):
t=t0:ts:tf;
x=sin(t);
xd=diff(x)/ts;
but the answer xd is (N-1)x1 and I figured out that it does not compute derivative corresponding to the first element of x.
is there any other way to compute this derivative?
You are looking for the numerical gradient I assume.
t0 = 0;
ts = pi/10;
tf = 2*pi;
t = t0:ts:tf;
x = sin(t);
dx = gradient(x)/ts
The purpose of this function is a different one (vector fields), but it offers what diff doesn't: input and output vector of equal length.
gradient calculates the central difference between data points. For an
array, matrix, or vector with N values in each row, the ith value is
defined by
The gradient at the end points, where i=1 and i=N, is calculated with
a single-sided difference between the endpoint value and the next
adjacent value within the row. If two or more outputs are specified,
gradient also calculates central differences along other dimensions.
Unlike the diff function, gradient returns an array with the same
number of elements as the input.
I know I'm a little late to the game here, but you can also get an approximation of the numerical derivative by taking the derivatives of the polynomial (cubic) splines that runs through your data:
function dy = splineDerivative(x,y)
% the spline has continuous first and second derivatives
pp = spline(x,y); % could also use pp = pchip(x,y);
[breaks,coefs,K,r,d] = unmkpp(pp);
% pre-allocate the coefficient vector
dCoeff = zeroes(K,r-1);
% Columns are ordered from highest to lowest power. Both spline and pchip
% return 4xn matrices, ordered from 3rd to zeroth power. (Thanks to the
% anonymous person who suggested this edit).
dCoeff(:, 1) = 3 * coefs(:, 1); % d(ax^3)/dx = 3ax^2;
dCoeff(:, 2) = 2 * coefs(:, 2); % d(ax^2)/dx = 2ax;
dCoeff(:, 3) = 1 * coefs(:, 3); % d(ax^1)/dx = a;
dpp = mkpp(breaks,dCoeff,d);
dy = ppval(dpp,x);
The spline polynomial is always guaranteed to have continuous first and second derivatives at each point. I haven not tested and compared this against using pchip instead of spline, but that might be another option as it too has continuous first derivatives (but not second derivatives) at every point.
The advantage of this is that there is no requirement that the step size be even.
There are some options to work-around your issue.
First: you can make your domain larger. Instead of N, use N+1 gridpoints.
Second: depending on the end-point of interest, you can use
Forward difference: F(x + dx) - F(x)
Backward difference: F(x) - F(x - dx)

Mixture of Gaussians (EM) how to calculate the responsabilities

I have an assignment to implement MoG with EM in matlab. The assignment:
My code atm;
clear
clc
load('data2')
%% INITIALIZE
K = 20
pi = 0.01:((1-0.01)/K):1;
for k=1:20
sigma{k} = eye(2);
mu(k,:) = [rand(1),rand(1)];
end
%% Posterior over the laten variables
addition = 0;
for k =1:20
addition = addition + (pi(k)*mvnpdf(x,mu(k,:), sigma{k}));
end
test = 0;
for k =1:20
gamma{k} = (pi(k)*mvnpdf(x,mu(k), sigma{k})) ./ addition;
end
data has 1000 rows and 2 columns (so 1000 datapoints). My question is now how do I calculate the responsibilities. When I try to calculate the covariance matrix I get a 1x1000 matrix. While I believe the covariance matrix should be 2x2.
Unfortunately, I don't speak Matlab, so I can't really see where your code is incorrect, but I can answer generally (and maybe someone who knows Matlab can see if your code can be salvaged). Each datapoint has a gamma associated with it, which is the expectation of an indicator variable for each component in the mixture. Calculating them is pretty simple: for the i-th datapoint and the k-th component, gamma_ik is just the density of the k-th component at the i-th point, multiplied by the k-th mixture coefficient (the prior probability that the point came from the k-th component, which is pi in your assignment), normalised by this quantity computed over all k. Thus for each datapoint, you have a vector of responsibilities (of length k) with a sum of one.