Hi,
I would like to create a correlation matrix between the two data set presented above that will ignore any appearances of zeros (in the picture above, the green color), anyone knows what is the most efficient way that will produce a smooth result?
Is there any correlation method that can identify the similarity point by point and by thus the results will have the "shape" of the original matrix?
thank u
note: I do not have the matlab statistic toolbox
2. Is there any correlation method that can identify the similarity point by
point and by thus the results will have the "shape" of the original matrix?
Let's start with your second point because it is more clear, what you want there. You want to do a point-by-point comparison of two images, say, A and B. This boils down to measuring the similarity of two scalars a and b. Let's assume that these scalars are from the interval [0, Q], where Q depends on your image format (Q == 1 or Q == 255 are common in Matlab).
Now, the simplest measure of distance is the difference d = |a - b|. You might want to normalize this to [0, 1] and also invert the values to measure similarity instead of distance. In Matlab:
S = 1 - abs(A - B) / Q;
You mentioned about ignoring the zeros in the images. Well, you need to define, what similarity measure you expect for a zero. One possibility is to set the similarity to zero, whenever one pixel is zero:
S(A == 0 | B == 0) = 0;
You could also say that the similarity there is undefined and set the similarity to NaN:
S(A == 0 | B == 0) = nan;
Of course, you can also say that the mismatch between 10 and 11 is as bad as the mismatch between 100 and 110. In this case, you could take the distance with respect to the sum a + b (known as Bray Curtis normalization or normalized Euclidean metric)
D = abs(A - B) ./ (A + B)
S = 1 - D / max(D(:));
You run into problems, if both matrices have a zero-value pixel at the same location. Again, there are several possibilities: You can augment the sum with a small positive value alpha (e.g. alpha = 1e-6) which prevents a division by zero: D = abs(A - B) ./ (alpha + A + B).
Another option is to ignore the infinite values in D and add your 'zero-processing' here, i.e.
D = abs(A - B) ./ (A + B)
D(A == 0 | B == 0) = nan;
S = 1 - D / max(D(:));
You see, there are plenty of possibilities.
1. I would like to create a correlation matrix [...]
You should definitly think more about this point and come up with a better description of what to compute. If your matrices are of size m x m, you have m^2 variables. From this you could compute a correlation matrix m^2 x m^2, which measures the correlation of every pixel to every other pixel. This matrix will also have the largest values in the diagonal (these are the variances). However, I would not suggest to compute the correlation matrix if you only have two realisations.
Another option is to measure the similarity of rows or columns in the two images. Then you end up with a vector 1 x m of correlation coefficients.
However, I do not know how to compute a correlation matrix of size m x m from two inputs of size m x m, which has the largest values in the diagonal.
To just get a general correlation coefficient I'd use corr2. From the docs:
r = corr2(A,B)
Returns the correlation coefficient r
between A and B, where A and B are matrices or vectors of the same
size. r is a scalar double.
Roughly, I believe it's just calculating corr(A(:), B(:)).
Related
I want to create a time vector, from 1e-7 to 1e-5 with a higher resolution (smaller spacing) at the end.
The standard v = logspace(-7,-5) creates a vector with logarithmically increasing spacing. If I switch the order of a and b (logspace(-5,-7)) and use flip(v) the spacing is still the same, just the order of the numbers change.
You would need to specify an additional parameter besides the limits and the number of values: the base of the logarithm. This is equivalent to choosing where you sample the values on the logarithmic curve.
This code generates a sequence of logarithmically decreasing values in between your two limits:
lims = [1e-7,1e-5];
N = 10;
e = 10; % we'll generate linear values from 1 to e
% Generate logarithmic sequence (we need to flip for decreasing intervals)
d = flip(exp(linspace(1, e, N)));
% Map the sequence to our limits
d = (d - d(1)) / (d(end) - d(1));
d = d * (lims(2) - lims(1)) + lims(1);
d is:
1.0e-05 *
0.0100 0.6359 0.8661 0.9508 0.9820 0.9935 0.9977 0.9992 0.9998 1.0000
You could mirror the vector onto the half plane x<0 by multiplication with -1. Then the spacing is largest for the smaller numbers and decreasing, but v is in the interval -10^-5 to -10^-7.
Move v to the desired interval by adding 10^-5+10^-7.
Use the flip function so that v is ordered with the smallest element first and increasing.
v = logspace(-7,-5);
v = 1E-7+1E-5-v;
v = flip(v);
I have the lists xA, yA, zA and the lists xB, yB, zB in Matlab. The contain the the x,y and z coordinates of points of type A and type B. There may be different numbers of type A and type B points.
I would like to calculate a matrix containing the Euclidean distances between points of type A and type B. Of course, I only need to calculate one half of the matrix, since the other half contains duplicate data.
What is the most efficient way to do that?
When I'm done with that, I want to find the points of type B that are closest to one point of type A. How do I then find the coordinates the closest, second closest, third closest and so on points of type B?
Given a matrix A of size [N,3] and a matrix B of size [M,3], you can use the pdist2 function to get a matrix of size [M,N] containing all the pairwise distances.
If you want to order the points from B by their distance to the rth point in A then you can sort the rth row of the pairwise distance matrix.
% generate some example data
N = 4
M = 7
A = randn(N,3)
B = randn(M,3)
% compute N x M matrix containing pairwise distances
D = pdist2(A, B, 'euclidean')
% sort points in B by their distance to the rth point in A
r = 3
[~, b_idx] = sort(D(r,:))
Then b_idx will contain the indices of the points in B sorted by their distance to the rth point in A. So the actual points in B ordered by b_idx can be obtained with B(b_idx,:), which has the same size as B.
If you want to do this for all r you could use
[~, B_idx] = sort(D, 1)
to sort all the rows of D at the same time. Then the rth row of B_idx will contain b_idx.
If your only objective is to find the closest k points in B for each point in A (for some positive integer k which is less than M), then we would not generally want to compute all the pairwise distances. This is because space-partitioning data structures like K-D-trees can be used to improve the efficiency of searching without explicitly computing all the pairwise distances.
Matlab provides a knnsearch function that uses K-D-trees for this exact purpose. For example, if we do
k = 2
B_kidx = knnsearch(B, A, 'K', k)
then B_kidx will be the first two columns of B_idx, i.e. for each point in A the indices of the nearest two points in B. Also, note that this is only going to be more efficient than the pdist2 method when k is relatively small. If k is too large then knnsearch will automatically use the explicit method from before instead of the K-D-tree approach.
I'm trying to reconstruct a 3d image from two calibrated cameras. One of the steps involved is to calculate the 3x3 essential matrix E, from two sets of corresponding (homogeneous) points (more than the 8 required) P_a_orig and P_b_orig and the two camera's 3x3 internal calibration matrices K_a and K_b.
We start off by normalizing our points with
P_a = inv(K_a) * p_a_orig
and
P_b = inv(K_b) * p_b_orig
We also know the constraint
P_b' * E * P_a = 0
I'm following it this far, but how do you actually solve that last problem, e.g. finding the nine values of the E matrix? I've read several different lecture notes on this subject, but they all leave out that crucial last step. Likely because it is supposedly trivial math, but I can't remember when I last did this and I haven't been able to find a solution yet.
This equation is actually pretty common in geometry algorithms, essentially, you are trying to calculate the matrix X from the equation AXB=0. To solve this, you vectorise the equation, which means,
vec() means vectorised form of a matrix, i.e., simply stack the coloumns of the matrix one over the another to produce a single coloumn vector. If you don't know the meaning of the scary looking symbol, its called Kronecker product and you can read it from here, its easy, trust me :-)
Now, say I call the matrix obtained by Kronecker product of B^T and A as C.
Then, vec(X) is the null vector of the matrix C and the way to obtain that is by doing the SVD decomposition of C^TC (C transpose multiplied by C) and take the the last coloumn of the matrix V. This last coloumn is nothing but your vec(X). Reshape X to 3 by 3 matrix. This is you Essential matrix.
In case you find this maths too daunting to code, simply use the following code by Y.Ma et.al:
% p are homogenius coordinates of the first image of size 3 by n
% q are homogenius coordinates of the second image of size 3 by n
function [E] = essentialDiscrete(p,q)
n = size(p);
NPOINTS = n(2);
% set up matrix A such that A*[v1,v2,v3,s1,s2,s3,s4,s5,s6]' = 0
A = zeros(NPOINTS, 9);
if NPOINTS < 9
error('Too few mesurements')
return;
end
for i = 1:NPOINTS
A(i,:) = kron(p(:,i),q(:,i))';
end
r = rank(A);
if r < 8
warning('Measurement matrix rank defficient')
T0 = 0; R = [];
end;
[U,S,V] = svd(A);
% pick the eigenvector corresponding to the smallest eigenvalue
e = V(:,9);
e = (round(1.0e+10*e))*(1.0e-10);
% essential matrix
E = reshape(e, 3, 3);
You can do several things:
The Essential matrix can be estimated using the 8-point algorithm, which you can implement yourself.
You can use the estimateFundamentalMatrix function from the Computer Vision System Toolbox, and then get the Essential matrix from the Fundamental matrix.
Alternatively, you can calibrate your stereo camera system using the estimateCameraParameters function in the Computer Vision System Toolbox, which will compute the Essential matrix for you.
I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.
E.g. for the graph:
the vector:
v1 = {100, 50, 0 30, 0}
would mean that we spent:
100secs at vertex 1
50secs at vertex 2 and
30secs at vertex 4
(vertices 3 & 5 where not visited, thus the 0s).
I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:
as described here.
But I noticed the following. Assume k=2 and one of the vectors is:
v1 = {90,0,0,0,0}
In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:
c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}
Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.
I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.
Q1. Why is this assumption wrong?
Q2. Is the cosine distance a correct distance function for this case?
Q3. What would be a better one given the nature of the problem?
Let's divide cosine similarity into parts and see how and why it works.
Cosine between 2 vectors - a and b - is defined as:
cos(a, b) = sum(a .* b) / (length(a) * length(b))
where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:
cos(a, b) = sum(a .* b) / L
which, in its turn, may be rewritten as:
cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L =
= a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L
Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:
cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n])
That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.
With cosine function similarity between individual pairs looks like this:
g(x, y) = x * y / L
where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?
Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:
g(x, y) = abs(x - y)
This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.
We may also want to increase impact of large mismatches by squaring the difference:
g(x, y) = (x - y)^2
Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.
Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:
g(x, y) = log(abs(x - y))
We can use special treatment for zeros like this:
g(x, y) = sign(x)*sign(y)*abs(x - y) # sign(0) will turn whole expression to 0
Or we can go back from distance to similarity by inversing the difference:
g(x, y) = 1 / abs(x - y)
Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.
cos([90, 90, 90]) == cos(1000, 1000, 1000) # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000]) # measuring both - angle and magnitude
Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.
Cosine similarity is meant for the case where you do not want to take length into accoun, but the angle only.
If you want to also include length, choose a different distance function.
Cosine distance is closely related to squared Euclidean distance (the only distance for which k-means is really defined); which is why spherical k-means works.
The relationship is quite simple:
squared Euclidean distance sum_i (x_i-y_i)^2 can be factored into sum_i x_i^2 + sum_i y_i^2 - 2 * sum_i x_i*y_i. If both vectors are normalized, i.e. length does not matter, then the first two terms are 1. In this case, squared Euclidean distance is 2 - 2 * cos(x,y)!
In other words: Cosine distance is squared Euclidean distance with the data normalized to unit length.
If you don't want to normalize your data, don't use Cosine.
Q1. Why is this assumption wrong?
As we see from the definition, cosine similarity measures angle between 2 vectors.
In your case, vector v1 lies flat on the first dimension, while c1 and c2 both are equally aligned from the axes, and thus, cosine similarity has to be same.
Note that the trouble lies with c1 and c2 pointing in the same direction. Any v1 will have the same cosine similarity with both of them. For illustration :
Q2. Is the cosine distance a correct distance function for this case?
As we see from the example in hand, probably not.
Q3. What would be a better one given the nature of the problem?
Consider Euclidean Distance.
I have a high dimensional Gaussian with mean M and covariance matrix V. I would like to calculate the distance from point p to M, taking V into consideration (I guess it's the distance in standard deviations of p from M?).
Phrased differentially, I take an ellipse one sigma away from M, and would like to check whether p is inside that ellipse.
If V is a valid covariance matrix of a gaussian, it then is symmetric positive definite and therefore defines a valid scalar product. By the way inv(V) also does.
Therefore, assuming that M and p are column vectors, you could define distances as:
d1 = sqrt((M-p)'*V*(M-p));
d2 = sqrt((M-p)'*inv(V)*(M-p));
the Matlab way one would rewrite d2as (probably some unnecessary parentheses):
d2 = sqrt((M-p)'*(V\(M-p)));
The nice thing is that when V is the unit matrix, then d1==d2and it correspond to the classical euclidian distance. To find wether you have to use d1 or d2is left as an exercise (sorry, part of my job is teaching). Write the multi-dimensional gaussian formula and compare it to the 1D case, since the multidimensional case is only a particular case of the 1D (or perform some numerical experiment).
NB: in very high dimensional spaces or for very many points to test, you might find a clever / faster way from the eigenvectors and eigenvalues of V (i.e. the principal axes of the ellipsoid and their corresponding variance).
Hope this helps.
A.
Consider computing the probability of the point given the normal distribution:
M = [1 -1]; %# mean vector
V = [.9 .4; .4 .3]; %# covariance matrix
p = [0.5 -1.5]; %# 2d-point
prob = mvnpdf(p,M,V); %# probability P(p|mu,cov)
The function MVNPDF is provided by the Statistics Toolbox
Maybe I'm totally off, but isn't this the same as just asking for each dimension: Am I inside the sigma?
PSEUDOCODE:
foreach(dimension d)
(M(d) - sigma(d) < p(d) < M(d) + sigma(d)) ?
Because you want to know if p is inside every dimension of your gaussian. So actually, this is just a space problem and your Gaussian hasn't have to do anything with it (except for M and sigma which are just distances).
In MATLAB you could try something like:
all(M - sigma < p < M + sigma)
A distance to that place could be, where I don't know the function for the Euclidean distance. Maybe dist works:
dist(M, p)
Because M is just a point in space and p as well. Just 2 vectors.
And now the final one. You want to know the distance in a form of sigma's:
% create a distance vector and divide it by sigma
M - p ./ sigma
I think that will do the trick.