A question about Milvus distance calculation - distance

I have a question about distance calculation in Milvus. In Milvus, I use the L2 distance calculation to query a vector for top1 and Milvus returns a distance of 9.340524. whereas the distance I get between the query vector and the return using the L2 formula is 2.156227.
Why is the formula for distance calculated differently from the result returned by Milvus?

The L2 distance is returned by FAISS, it is a square value. For example, vector1=[1,2], vector2=[0, 4], the returned L2 distance is 5.
You got L2 distance 9.340524, but it is not equal to 2.156227*2.156227, I think there must be some mistakes. You can do the steps to verify:
create a new collection
insert the vector that you got from the previous query(the top1 vector)
query again

Related

How to calculate normalized euclidean distance on two vectors?

Let's say I have the following two vectors:
x = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
y = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
The first seven elements are continuous values in the range [1,10]. The last element is an integer in the range [1,10].
Now I would like to compute the euclidean distance between x and y. I think the integer element is a problem because all other elements can get very close but the integer element has always spacings of ones. So there is a bias towards the integer element.
How can I calculate something like a normalized euclidean distance on it?
According to Wolfram Alpha, and the following answer from cross validated, the normalized Eucledean distance is defined by:
You can calculate it with MATLAB by using:
0.5*(std(x-y)^2) / (std(x)^2+std(y)^2)
Alternatively, you can use:
0.5*((norm((x-mean(x))-(y-mean(y)))^2)/(norm(x-mean(x))^2+norm(y-mean(y))^2))
I would rather normalise x and y before calculating the distance and then vanilla Euclidean would suffice.
In your example
x_norm = (x -1) / 9; % normalised x
y_norm = (y -1) / 9; % normalised y
dist = norm(x_norm - y_norm); % Euclidean distance between normalised x, y
However, I am not sure about whether having an integer element contributes to some sort of bias but we have already gotten kind of off-topic for stack overflow :)
From Euclidean Distance - raw, normalized and double‐scaled coefficients
SYSTAT, Primer 5, and SPSS provide Normalization options for the data so as to permit an investigator to compute a distance
coefficient which is essentially “scale free”. Systat 10.2’s
normalised Euclidean distance produces its “normalisation” by dividing
each squared discrepancy between attributes or persons by the total
number of squared discrepancies (or sample size).
Frankly, I can see little point in this standardization – as the final
coefficient still remains scale‐sensitive. That is, it is impossible
to know whether the value indicates high or low dissimilarity from the
coefficient value alone

Matlab calculating nearest neighbour distance for all (u, v) vectors in an array

I am trying to calculate the distance between nearest neighbours within a nx2 matrix like the one shown below
point_coordinates =
11.4179 103.1400
16.7710 10.6691
16.6068 119.7024
25.1379 74.3382
30.3651 23.2635
31.7231 105.9109
31.8653 36.9388
%for loop going from the top of the vector column to the bottom
for counter = 1:size(point_coordinates,1)
%current point defined selected
current_point = point_coordinates(counter,:);
%math to calculate distance between the current point and all the points
distance_search= point_coordinates-repmat(current_point,[size(point_coordinates,1) 1]);
dist_from_current_point = sqrt(distance_search(:,1).^2+distance_search(:,2).^2);
%line to omit self subtraction that gives zero
dist_from_current_point (dist_from_current_point <= 0)=[];
%gives the shortest distance calculated for a certain vector and current_point
nearest_dist=min(dist_from_current_point);
end
%final line to plot the u,v vectors and the corresponding nearest neighbour
%distances
matnndist = [point_coordinates nearest_dist]
I am not sure how to structure the 'for' loop/nearest_neighbour line to be able to get the nearest neighbour distance for each u,v vector.
I would like to have, for example ;
for the first vector you could have the coordinates and the corresponding shortest distance, for the second vector another its shortest distance, and this goes on till n
Hope someone can help.
Thanks
I understand you want to obtain the minimum distance between different points.
You can compute the distance for each pair of points with bsxfun; remove self-distances; minimize. It's more computationally efficient to work with squared distances, and take the square root only at the end.
n = size(point_coordinates,1);
dist = bsxfun(#minus, point_coordinates(:,1), point_coordinates(:,1).').^2 + ...
bsxfun(#minus, point_coordinates(:,2), point_coordinates(:,2).').^2;
dist(1:n+1:end) = inf; %// remove self-distances
min_dist = sqrt(min(dist(:)));
Alternatively, you could use pdist. This avoids computing each distance twice, and also avoids self-distances:
dist = pdist(point_coordinates);
min_dist = min(dist(:));
If I can suggest a built-in function, use knnsearch from the statistics toolbox. What you are essentially doing is a K-Nearest Neighbour (KNN) algorithm, but you are ignoring self-distances. The way you would call knnsearch is in the following way:
[idx,d] = knnsearch(X, Y, 'k', k);
In simple terms, the KNN algorithm returns the k closest points to your data set given a query point. Usually, the Euclidean distance is the distance metric that is used. For MATLAB's knnsearch, X is a 2D array that consists of your dataset where each row is an observation and each column is a variable. Y would be the query points. Y is also a 2D array where each row is a query point and you need to have the same number of columns as X. We would also specify the flag 'k' to denote how many closest points you want returned. By default, k = 1.
As such, idx would be a N x K matrix, where N is the total number of query points (number of rows of Y) and K would be those k closest points to the dataset for each query point we have. idx indicates the particular points in your dataset that were closest to each query. d is also a N x K matrix that returns the smallest distances for these corresponding closest points.
As such, what you want to do is find the closest point for your dataset to each of the other points, ignoring self-distances. Therefore, you would set both X and Y to be the same, and set k = 2, discarding the first column of both outputs to get the result you're looking for.
Therefore:
[idx,d] = knnsearch(point_coordinates, point_coordinates, 'k', 2)
idx = idx(:,2);
d = d(:,2);
We thus get for idx and d:
>> idx
idx =
3
5
1
1
7
3
5
>> d
d =
17.3562
18.5316
17.3562
31.9027
13.7573
20.4624
13.7573
As such, this tells us that for the first point in your data set, it matched with point #3 the best. This matched with the closest distance of 17.3562. For the second point in your data set, it matched with point #5 the best with the closest distance being 18.5316. You can continue on with the rest of the results in a similar pattern.
If you don't have access to the statistics toolbox, consider reading my StackOverflow post on how I compute KNN from first principles.
Finding K-nearest neighbors and its implementation
In fact, it is very similar to Luis Mendo's post to you earlier.
Good luck!

Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.
E.g. for the graph:
the vector:
v1 = {100, 50, 0 30, 0}
would mean that we spent:
100secs at vertex 1
50secs at vertex 2 and
30secs at vertex 4
(vertices 3 & 5 where not visited, thus the 0s).
I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:
as described here.
But I noticed the following. Assume k=2 and one of the vectors is:
v1 = {90,0,0,0,0}
In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:
c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}
Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.
I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.
Q1. Why is this assumption wrong?
Q2. Is the cosine distance a correct distance function for this case?
Q3. What would be a better one given the nature of the problem?
Let's divide cosine similarity into parts and see how and why it works.
Cosine between 2 vectors - a and b - is defined as:
cos(a, b) = sum(a .* b) / (length(a) * length(b))
where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:
cos(a, b) = sum(a .* b) / L
which, in its turn, may be rewritten as:
cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L =
= a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L
Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:
cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n])
That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.
With cosine function similarity between individual pairs looks like this:
g(x, y) = x * y / L
where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?
Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:
g(x, y) = abs(x - y)
This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.
We may also want to increase impact of large mismatches by squaring the difference:
g(x, y) = (x - y)^2
Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.
Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:
g(x, y) = log(abs(x - y))
We can use special treatment for zeros like this:
g(x, y) = sign(x)*sign(y)*abs(x - y) # sign(0) will turn whole expression to 0
Or we can go back from distance to similarity by inversing the difference:
g(x, y) = 1 / abs(x - y)
Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.
cos([90, 90, 90]) == cos(1000, 1000, 1000) # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000]) # measuring both - angle and magnitude
Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.
Cosine similarity is meant for the case where you do not want to take length into accoun, but the angle only.
If you want to also include length, choose a different distance function.
Cosine distance is closely related to squared Euclidean distance (the only distance for which k-means is really defined); which is why spherical k-means works.
The relationship is quite simple:
squared Euclidean distance sum_i (x_i-y_i)^2 can be factored into sum_i x_i^2 + sum_i y_i^2 - 2 * sum_i x_i*y_i. If both vectors are normalized, i.e. length does not matter, then the first two terms are 1. In this case, squared Euclidean distance is 2 - 2 * cos(x,y)!
In other words: Cosine distance is squared Euclidean distance with the data normalized to unit length.
If you don't want to normalize your data, don't use Cosine.
Q1. Why is this assumption wrong?
As we see from the definition, cosine similarity measures angle between 2 vectors.
In your case, vector v1 lies flat on the first dimension, while c1 and c2 both are equally aligned from the axes, and thus, cosine similarity has to be same.
Note that the trouble lies with c1 and c2 pointing in the same direction. Any v1 will have the same cosine similarity with both of them. For illustration :
Q2. Is the cosine distance a correct distance function for this case?
As we see from the example in hand, probably not.
Q3. What would be a better one given the nature of the problem?
Consider Euclidean Distance.

Euclidean distance between two columns of two vector Matlab

I have two vectors A & B of size 250x4. The first column in each vector has the X values and the second column has the Y values. I want to calculate the euclidean distance between each the X & Y of each row in the two vectors and save the result in a new vector C of size 250x1 which holds the result of the euclidean distance. For example, if the first row in A is A1x, A1y, A1n, A1m and the first row in B is B1x, B1y, B1n, B1m so I want to get the eucledian distance which will be [(A1x-B1x)^2 + (A1y-B1y)^2]^0.5 and the result will be saved in C1 and same will be done for the rest of the 250 rows. So if anyone could please advise how to do this in Matlab.
Like this:
%// First extract on x-y data from A and B
Axy = A(:,1:2);
Bxy = B(:,1:2);
%// Find all euclidean distances (row-wise)
C1 = sqrt(sum((Axy-Bxy).^2,2));
plus it handles higher dimension too
use pdist2:
C1=diag(pdist2(A(:,1:2),B(:,1:2)));
Actually, pdist2 will give you a 250x250 matrix, because it calculate all the distances. You need only the main diagonal, so calling diag on the result (as in the code above) will produce the wanted result.

Calculating distance of all the points in a region with each other

I have a region with about 144 points. What i want to achieve is to measure the distance of a point with all others and store it in an array. I want to do this for all the points. If possible i would like to store this data in such a manner that there in no repetition. And i should be able to make queries like- All the distances between all the points without repetition, sum of all the distances for point no56 etc.
I have a 3*144 array with two columns storing the coordinates of the points.
A possible solution (I am not really clear with what you mean by no repetition, though):
X are your points with coordinates x = X(:,1), y = X(:,2)
dist = sqrt(bsxfun(#minus,X(:,1),X(:,1)').^2 + bsxfun(#minus,X(:,2),X(:,2)').^2)
so
dist(i,j) is the euclidean distance between i and j
of course the matrix is symmetric. You can easily reduce the complexity involved.
Let's say your array is A, where each column stores the coordinates of a single point. To obtain the combinations of all point pairs (without repetitions), use nchoosek:
pairs = nchoosek(1:size(A, 2), 2)
Then calculate the Euclidean distance like so:
dist = sqrt(sum((A(:, pairs(:, 1)) - A(:, pairs(:, 2))) .^ 2, 1))
If you have the Statistics Toolbox installed, you can use pdist(A) instead for the same effect.
If you have the statistics toolbox, and if you have all your data in the array X, then
D = pdist(X)
gives all the pairwise distances between all points in X.