How to calculate normalized euclidean distance on two vectors? - matlab

Let's say I have the following two vectors:
x = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
y = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
The first seven elements are continuous values in the range [1,10]. The last element is an integer in the range [1,10].
Now I would like to compute the euclidean distance between x and y. I think the integer element is a problem because all other elements can get very close but the integer element has always spacings of ones. So there is a bias towards the integer element.
How can I calculate something like a normalized euclidean distance on it?

According to Wolfram Alpha, and the following answer from cross validated, the normalized Eucledean distance is defined by:
You can calculate it with MATLAB by using:
0.5*(std(x-y)^2) / (std(x)^2+std(y)^2)
Alternatively, you can use:
0.5*((norm((x-mean(x))-(y-mean(y)))^2)/(norm(x-mean(x))^2+norm(y-mean(y))^2))

I would rather normalise x and y before calculating the distance and then vanilla Euclidean would suffice.
In your example
x_norm = (x -1) / 9; % normalised x
y_norm = (y -1) / 9; % normalised y
dist = norm(x_norm - y_norm); % Euclidean distance between normalised x, y
However, I am not sure about whether having an integer element contributes to some sort of bias but we have already gotten kind of off-topic for stack overflow :)

From Euclidean Distance - raw, normalized and double‐scaled coefficients
SYSTAT, Primer 5, and SPSS provide Normalization options for the data so as to permit an investigator to compute a distance
coefficient which is essentially “scale free”. Systat 10.2’s
normalised Euclidean distance produces its “normalisation” by dividing
each squared discrepancy between attributes or persons by the total
number of squared discrepancies (or sample size).
Frankly, I can see little point in this standardization – as the final
coefficient still remains scale‐sensitive. That is, it is impossible
to know whether the value indicates high or low dissimilarity from the
coefficient value alone

Related

How to plot a sinus uniformly in MATLAB?

If I plot sinus like this
x=0:0.05:2*pi;
y=sin(x);
plot(x,y,'.-')
I'm getting obviously non-uniformly density of points.Please see attachment.sin
What I want is, that points should be at the equivalent distance each other. So, I need to define x array somehow.. or is there is another way?
The point density is uniform in x. If you want the points to be uniform in y, you could use:
y=-1:.05:1;
plot(asin(y),y,'o')
But then the points aren't uniform in x.
EDIT: Just for fun or for any future readers, to get points uniform in overall distance, the distance between points is d=sqrt(h^2+(f(x+h)-f(x))^2) which is approximately d=h*sqrt(1+f'(x)^2), i.e. h=d/sqrt(1+cos(x)^2) in this case. The curve length is the integral of sqrt(1+f'(x)^2) which in this case is 4*sqrt(2)*ellipticE(1/2) = 7.6404:
N = 100;
d = 7.6404/N;
x = zeros(1,N);
for n = 2:N
x(n) = x(n-1) + d/sqrt(1+cos(x(n-1))^2);
end
y = sin(x);
plot(x,y,'x')
You can check that the distance between points is approximately constant by looking at sqrt(diff(y).^2+diff(x).^2). It's only approximate because of the use of the derivative (at the left endpoint of the interval at that) for the distance, but this gets better as N increases. To get the distance exact, we'd need to numerically solve a trig equation for each point. The curve length is also affected by the approximation and tends to miss the last point.

Matlab calculating nearest neighbour distance for all (u, v) vectors in an array

I am trying to calculate the distance between nearest neighbours within a nx2 matrix like the one shown below
point_coordinates =
11.4179 103.1400
16.7710 10.6691
16.6068 119.7024
25.1379 74.3382
30.3651 23.2635
31.7231 105.9109
31.8653 36.9388
%for loop going from the top of the vector column to the bottom
for counter = 1:size(point_coordinates,1)
%current point defined selected
current_point = point_coordinates(counter,:);
%math to calculate distance between the current point and all the points
distance_search= point_coordinates-repmat(current_point,[size(point_coordinates,1) 1]);
dist_from_current_point = sqrt(distance_search(:,1).^2+distance_search(:,2).^2);
%line to omit self subtraction that gives zero
dist_from_current_point (dist_from_current_point <= 0)=[];
%gives the shortest distance calculated for a certain vector and current_point
nearest_dist=min(dist_from_current_point);
end
%final line to plot the u,v vectors and the corresponding nearest neighbour
%distances
matnndist = [point_coordinates nearest_dist]
I am not sure how to structure the 'for' loop/nearest_neighbour line to be able to get the nearest neighbour distance for each u,v vector.
I would like to have, for example ;
for the first vector you could have the coordinates and the corresponding shortest distance, for the second vector another its shortest distance, and this goes on till n
Hope someone can help.
Thanks
I understand you want to obtain the minimum distance between different points.
You can compute the distance for each pair of points with bsxfun; remove self-distances; minimize. It's more computationally efficient to work with squared distances, and take the square root only at the end.
n = size(point_coordinates,1);
dist = bsxfun(#minus, point_coordinates(:,1), point_coordinates(:,1).').^2 + ...
bsxfun(#minus, point_coordinates(:,2), point_coordinates(:,2).').^2;
dist(1:n+1:end) = inf; %// remove self-distances
min_dist = sqrt(min(dist(:)));
Alternatively, you could use pdist. This avoids computing each distance twice, and also avoids self-distances:
dist = pdist(point_coordinates);
min_dist = min(dist(:));
If I can suggest a built-in function, use knnsearch from the statistics toolbox. What you are essentially doing is a K-Nearest Neighbour (KNN) algorithm, but you are ignoring self-distances. The way you would call knnsearch is in the following way:
[idx,d] = knnsearch(X, Y, 'k', k);
In simple terms, the KNN algorithm returns the k closest points to your data set given a query point. Usually, the Euclidean distance is the distance metric that is used. For MATLAB's knnsearch, X is a 2D array that consists of your dataset where each row is an observation and each column is a variable. Y would be the query points. Y is also a 2D array where each row is a query point and you need to have the same number of columns as X. We would also specify the flag 'k' to denote how many closest points you want returned. By default, k = 1.
As such, idx would be a N x K matrix, where N is the total number of query points (number of rows of Y) and K would be those k closest points to the dataset for each query point we have. idx indicates the particular points in your dataset that were closest to each query. d is also a N x K matrix that returns the smallest distances for these corresponding closest points.
As such, what you want to do is find the closest point for your dataset to each of the other points, ignoring self-distances. Therefore, you would set both X and Y to be the same, and set k = 2, discarding the first column of both outputs to get the result you're looking for.
Therefore:
[idx,d] = knnsearch(point_coordinates, point_coordinates, 'k', 2)
idx = idx(:,2);
d = d(:,2);
We thus get for idx and d:
>> idx
idx =
3
5
1
1
7
3
5
>> d
d =
17.3562
18.5316
17.3562
31.9027
13.7573
20.4624
13.7573
As such, this tells us that for the first point in your data set, it matched with point #3 the best. This matched with the closest distance of 17.3562. For the second point in your data set, it matched with point #5 the best with the closest distance being 18.5316. You can continue on with the rest of the results in a similar pattern.
If you don't have access to the statistics toolbox, consider reading my StackOverflow post on how I compute KNN from first principles.
Finding K-nearest neighbors and its implementation
In fact, it is very similar to Luis Mendo's post to you earlier.
Good luck!

How to find the frequency response of the Rosenberg Glottal Model

Is there an easy way to calculate the frequency response of the following function?
I tried using heaviside function but with no luck.
Basically I want to write a function to return the frequency response based on input N1 and N2 and also the number of points (lets say x) between 0 and pi
The output would be a vector which returns x values for the frequency response for corresponding frequencies => 0:pi/x:pi
Assuming that N1 + N2 < num_points, where num_points is the length of the sequence, you can simply write the function like so:
function [gr] = rosenburg(N1, N2, num_points)
gr = zeros(num_points,1);
range1 = 0:N1;
range2 = N1+1:N1+N2;
gr(range1+1) = 0.5*(1 - cos(pi*range1/N1));
gr(range2+1) = cos(pi*(range2-N1) / (2*N2));
end
The function prototype, rosenburg takes in N1, N2 and the total number of points you want this function to take in, num_points. How this code works is that we first allocate an array that is all zeroes of size num_points. We then compute two linear ranges: One from 0 <= n <= N1 and the other from N1 < n <= N2. Note that the second range starts by offsetting N1 by 1 because we have already computed the value at n = N1. Once we compute these ranges, we simply apply the right relationship in the right ranges. Note that when I'm assigning the relationships to the correct intervals in the array, I need to offset by 1 because MATLAB begins indexing arrays at index 1. The rest of the values are zero due to the initialization at the beginning of the function.
Now, if you want to find the frequency response of this signal, just use fft which is the Fast Fourier Transform. It's the classic method to find the frequency domain version of a discrete input signal on a numerical basis. As such, once you create your signal using the rosenburg function, then throw this into the FFT function. How you call it is like so:
X = fft(gr);
This computes the N point FFT, where N is the length of the signal gr. Alternatively, you can provide the number of points you want to compute the FFT for. Specifically:
X = fft(gr, N);
Basically, the higher N is, the finer or granular the frequency components will be. Note that the frequency axis is normalized between 0 to 2*pi, and so the higher N is, the finer resolution you will have between neighbouring points on the axis. Specifically, each point on this axis has the following frequency:
w = i*(2*pi)/x;
i would be the index on the x-axis (0, 1, 2, ..., num_points-1) and x would be the total number of points for the FFT. Normally, people show the spectrum between -pi <= w <= pi, and so some people apply fftshift to shift the spectrum so that the DC component is located at the centre of the spectrum, which is how we naturally perceive the spectrum to be.
When you say "frequency response", I believe you are referring to the magnitude, and so use abs to calculate the complex magnitude of each value, as the fft is generally complex valued. Therefore, assuming that you wish to compute the FFT to be as many points as the length of your signal, and let's say we choose N1 = 4, N2 = 8 and we want 64 points, and we want to plot the spectrum. Simply do this:
gr = rosenburg(4, 8, 64);
X = fft(gr);
Xshift = fftshift(X);
plot(linspace(-pi,pi,64), abs(Xshift));
grid;
The above code will shift the spectrum, then plot its magnitude between -pi to pi. This is what I get:
As an illustration, this is what the spectrum looks like before we apply fftshift:
Here's the code to generate the above figure:
plot(linspace(0,2*pi,64), abs(X));
grid;
You can see that the spectra is symmetric. Right at the frequency pi, you can see that it is mirror reflected, which makes sense as the range from pi to 2*pi, precisely maps to -pi to 0. Because the signal is real, the spectrum is symmetric. In fact, we can call this signal Hermitian symmetric. Obviously, the frequency components are a bit sparsely spaced. It may be better to increase the total number of points to something like 256. This is what I get when I change the number of points to 256:
Pretty smooth! Now, if you want to extract the frequency components from 0 to pi, you need to extract half of the frequency decomposition that is stored in X. Therefore, you would simply do:
f = X(1:numel(X)/2);
numel determines how many elements are in an array or matrix. However, remember that each frequency point was defined as:
w = i*(2*pi)/x
You specifically want:
w = i*pi/x
As such, you'll need to compute the FFT at twice the size of your signal first, then extract half of the spectra in the same way. For example, for 64 points:
gr = rosenburg(4, 8, 64);
X = fft(gr, 128);
f = X(1:numel(X)/2);
This should hopefully get you started. Good luck!

Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.
E.g. for the graph:
the vector:
v1 = {100, 50, 0 30, 0}
would mean that we spent:
100secs at vertex 1
50secs at vertex 2 and
30secs at vertex 4
(vertices 3 & 5 where not visited, thus the 0s).
I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:
as described here.
But I noticed the following. Assume k=2 and one of the vectors is:
v1 = {90,0,0,0,0}
In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:
c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}
Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.
I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.
Q1. Why is this assumption wrong?
Q2. Is the cosine distance a correct distance function for this case?
Q3. What would be a better one given the nature of the problem?
Let's divide cosine similarity into parts and see how and why it works.
Cosine between 2 vectors - a and b - is defined as:
cos(a, b) = sum(a .* b) / (length(a) * length(b))
where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:
cos(a, b) = sum(a .* b) / L
which, in its turn, may be rewritten as:
cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L =
= a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L
Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:
cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n])
That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.
With cosine function similarity between individual pairs looks like this:
g(x, y) = x * y / L
where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?
Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:
g(x, y) = abs(x - y)
This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.
We may also want to increase impact of large mismatches by squaring the difference:
g(x, y) = (x - y)^2
Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.
Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:
g(x, y) = log(abs(x - y))
We can use special treatment for zeros like this:
g(x, y) = sign(x)*sign(y)*abs(x - y) # sign(0) will turn whole expression to 0
Or we can go back from distance to similarity by inversing the difference:
g(x, y) = 1 / abs(x - y)
Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.
cos([90, 90, 90]) == cos(1000, 1000, 1000) # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000]) # measuring both - angle and magnitude
Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.
Cosine similarity is meant for the case where you do not want to take length into accoun, but the angle only.
If you want to also include length, choose a different distance function.
Cosine distance is closely related to squared Euclidean distance (the only distance for which k-means is really defined); which is why spherical k-means works.
The relationship is quite simple:
squared Euclidean distance sum_i (x_i-y_i)^2 can be factored into sum_i x_i^2 + sum_i y_i^2 - 2 * sum_i x_i*y_i. If both vectors are normalized, i.e. length does not matter, then the first two terms are 1. In this case, squared Euclidean distance is 2 - 2 * cos(x,y)!
In other words: Cosine distance is squared Euclidean distance with the data normalized to unit length.
If you don't want to normalize your data, don't use Cosine.
Q1. Why is this assumption wrong?
As we see from the definition, cosine similarity measures angle between 2 vectors.
In your case, vector v1 lies flat on the first dimension, while c1 and c2 both are equally aligned from the axes, and thus, cosine similarity has to be same.
Note that the trouble lies with c1 and c2 pointing in the same direction. Any v1 will have the same cosine similarity with both of them. For illustration :
Q2. Is the cosine distance a correct distance function for this case?
As we see from the example in hand, probably not.
Q3. What would be a better one given the nature of the problem?
Consider Euclidean Distance.

"Frequency" shift in discrete FFT in MATLAB

(Disclaimer: I thought about posting this on math.statsexchange, but found similar questions there that were moved to SO, so here I am)
The context:
I'm using fft/ifft to determine probability distributions for sums of random variables.
So e.g. I'm having two uniform probability distributions - in the simplest case two uniform distributions on the interval [0,1].
So to get the probability distribution for the sum of two random variables sampled from these two distributions, one can calculate the product of the fourier-transformed of each probabilty density.
Doing the inverse fft on this product, you get back the probability density for the sum.
An example:
function usumdist_example()
x = linspace(-1, 2, 1e5);
dx = diff(x(1:2));
NFFT = 2^nextpow2(numel(x));
% take two uniform distributions on [0,0.5]
intervals = [0, 0.5;
0, 0.5];
figure();
hold all;
for i=1:size(intervals,1)
% construct the prob. dens. function
P_x = x >= intervals(i,1) & x <= intervals(i,2);
plot(x, P_x);
% for each pdf, get the characteristic function fft(pdf,NFFT)
% and form the product of all char. functions in Y
if i==1
Y = fft(P_x,NFFT) / NFFT;
else
Y = Y .* fft(P_x,NFFT) / NFFT;
end
end
y = ifft(Y, NFFT);
x_plot = x(1) + (0:dx:(NFFT-1)*dx);
plot(x_plot, y / max(y), '.');
end
My issue is, the shape of the resulting prob. dens. function is perfect.
However, the x-axis does not fit to the x I create in the beginning, but is shifted.
In the example, the peak is at 1.5, while it should be 0.5.
The shift changes if I e.g. add a third random variable or if I modify the range of x.
But I can't get figure how.
I'm afraid it might have to do with the fact that I'm having negative x values, while fourier transforms usually work in a time/frequency domain, where frequencies < 0 don't make sense.
I'm aware I could find e.g. the peak and shift it to its proper place, but seems nasty and error prone...
Glad about any ideas!
The problem is that your x origin is -1, not 0. You expect the center of the triangular pdf to be at .5, because that's twice the value of the center of the uniform pdf. However, the correct reasoning is: the center of the uniform pdf is 1.25 above your minimum x, and you get the center of the triangle at 2*1.25 = 2.5 above the minimum x (that is, at 1.5).
In other words: although your original x axis is (-1, 2), the convolution (or the FFT) behave as if it were (0, 3). In fact, the FFT knows nothing about your x axis; it only uses the y samples. Since your uniform is zero for the first samples, that zero interval of width 1 is amplified to twice its width when you do the convolution (or the FFT). I suggest drawing the convolution on paper to see this (draw original signal, reflected signal about y axis, displace the latter and see when both begin to overlap). So you need a correction in the x_plot line to compensate for this increased width of the zero interval: use
x_plot = 2*x(1) + (0:dx:(NFFT-1)*dx);
and then plot(x_plot, y / max(y), '.') will give the correct graph: