Errors using distance metric in MATLAB - matlab

My test feature vector is 'testpg' and trained feature vector is 'trainpg' and both are of dimension 2000*1 .I am aiming to find the distance between the two histogram feature vectors and hence i do
distance = norm(trainpg-testpg)
Next I compare it to a scalar threshold value to check whether it satisfies my condition , The above code works well as i get a scalar value of this distance ie : for eg distance = 5.4 which is a scalar
But when I change code to use any other histogram based distance metric it doesn't work
i used the pdist2 function from http://www.mathworks.com/matlabcentral/fileexchange/29004-feature-points-in-image--keypoint-extraction/content/FPS_in_image/FPS%20in%20image/Help%20Functions/SearchingMatches/pdist2.m
The new code I used is
distance = pdist2(trainpg,testpg, 'chisq')
d = size(distance)
Here I am getting subscripted assignment dimension mismatch error as the dimensions of my distance are now 2000*2000 instead of 1*1
How could I get scalar value for the distances?

If the dimensions are 2000 x 1 for both of your feature vectors, then you should do:
distance = pdist2(trainpg.',testpg.', 'chisq');
pdist2 considers each row as a different sample. Thus you need to take the transpose, telling MATLAB that its a 2000-D feature of a single example rather than 1-D feature for 2000 examples. As a side note, you may want to check Histogram intersection distance, if your feature is a histogram. Look here for the code.

Related

Matlab ksdensity for bivariate data with more evaluation points

I'm running Matlab code for kernel density, i.e., [f,xi] = ksdensity(x), where x is a two column bivariate data. The resulting output f is the density vector, while xi is the meshgrid of evaluation points that is 30x30 in dimension. See the documentation here: Link.
I'm trying to increase number of evaluation points that I receive from this code. There is an option mentioned in the documentation called 'NumPoints' that is only applicable for univariate data. Is there an option or ways that I can increase the meshgrid points of evaluation points of bivariate data to, say, 100x100?
You need to use the optional second input argument pts to specify the range and number of the output points in your grid. See this example in the documentation. Depending on your input data, you could specify something like this:
pts = [linspace(min(x(:,1)),max(x(:,1)),1000).' linspace(min(x(:,2)),max(x(:,2)),1000).'];
The NumPoints is npoints in the ksdensity(). e.g., [f,xi] = ksdensity(x, 'npoints', 1000) will return 1000 points of xi and f.

How to order one dimensional matrices base on values

I want to determine a point in space by geometry and I have math computations that gives me several theta values. After evaluating the theta values, I could get N 1 x 3 dimension matrix where N is the number of theta evaluated. Since I have my targeted point, I only need to decide which of the matrices is closest to the target with adequate focus on the three coordinates (x,y,z).
Take a view of the analysis in the figure below:
Fig 1: Determining Closest Point with all points having minimal error
It can easily be seen that the third matrix is closest using sum(abs(Matrix[x,y,z])). However, if the method is applied on another figure given below, obviously, the result is wrong.
Fig 2: One Point has closest values with 2-axes of the reference point
Looking at point B, it is closer to the reference point on y-,z- axes but just that it strayed greatly on x-axis.
So how can I evaluate the matrices and select the closest one to point of reference and adequate emphasis will be on error differences in all coordinates (x,y,z)?
If your results is in terms of (x,y,z), why don't evaluate the euclidean distance of each matrix you have obtained from the reference point?
Sort of matlab code:
Ref_point = [48.98, 20.56, -1.44];
Curr_point = [x,y,z];
Xd = (x-Ref_point(1))^2 ;
Yd = (y-Ref_point(2))^2 ;
Zd = (z-Ref_point(3))^2 ;
distance = sqrt(Xd + Yd + Zd);
%find the minimum distance

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Matlab counting the distance

I have a problem in computing the distance between two different matrices. The first matrix is 5000x6, the second matrix is 5x80.
I want to use this syntax to calculate the distances:
pdist2(mCe(1,:),row);
But this gives me an error saying "columns in x have to be same in y".
Is there a way to compute the distances when the matrices have a different amount of columns?
The pdist2 function calculates the distance between a set of points based on a metric. A metric is a function of 2 vector arguments from the same metric space and as such they are required to have the same dimension. What you want to do is not possible based on the definition of a metric. Read this link for more details
http://en.wikipedia.org/wiki/Metric_space