some questions on cosine similarity - cluster-analysis

Yesterday I learnt that the cosine similarity, defined as
can effectively measure how similar two vectors are.
I find that the definition here uses the L2-norm to normalize the dot product of A and B, what I am interested in is that why not use the L1-norm of A and B in the denominator?
My teacher told me that if I use the L1-norm in the denominator, then cosine similarity would not be 1 if A=B. Then, I further ask him, if I modify the cosine similarity definition as follows, what the advantages and disadvantages the modified model are, as compared with the original model?
sim(A,B) = (A * B) / (||A||1 * ||B||1) if A!=B
sim(A,B) = 1 if A==B
I would appreciate if someone could give me some more explanations.

If you used L1-norm, your are not computing the cosine anymore.
Cosine is a geometrical concept, not a random definition. There is a whole string of mathematics attached to it. If you used the L1, you are not measuring angles anymore.
See also: Wikipedia: Trigonometric functions - Cosine
Note that cosine is monotone to Euclidean distance on L2 normalized vectors.
Euclidean(x,y)^2 = sum( (x-y)^2 ) = sum(x^2) + sum(y^2) - 2 sum(x*y)
if x and y are L2 normalized, then sum(x^2)=sum(y^2)=1, and then
Euclidean(x_norm,y_norm)^2 = 2 * (1 - sum(x_norm*y_norm)) = 2 * (1 - cossim(x,y))
So using cosine similarity essentially means standardizing your data to unit length. But there are also computational benefits associated with this, as sum(x*y) is cheaper to compute for sparse data.
If you L2 normalized your data, then
Euclidean(x_norm, y_norm) = sqrt(2) * sqrt(1-cossim(x,y))
For the second part of your question: fixing L1 norm isn't that easy. Consider the vectors (1,1) and (2,2). Obviously, these two vectors have the same angle, and thus should have cosine similarity 1.
Using your equation, they would have similarity (2+2)/(2*4) = 0.5
Looking at the vectors (0,1) and (0,2) - where most people agree they should have a similar similarity than above example (and where cosine indeed gives the same similarity), your equation yields (0+2)/(1+2) = 0.6666.... So your similarity does not match any intuition, does it?

Related

Symmetric Regression In Stan

I have to vectors of data points (Gene expression in Tissue A and B) and I want to see, if their is any systematic bias along its magnitude (same expression of Gene X in A and B).
The idea was to build a simple regression model in stan and see how much the posterior for the slope (beta) overlaps with 1.
model {
for (n in 1:N){
y[n] ~ normal(alpha[i[n]] + beta[i[n]] * x[n], sigma[i[n]]);
}
}
However, depending on which vector is x and which is y, I get different results, where one slope is about 1 and other not (see Image, where x and y a swapped and the colored lines represents the regressions I get from the model (gray is slope 1)). As I found out, this a typical thing for regression methods like ordinary least squares, which makes sense if one value is dependent on the other. However, here there is no dependency and both vectors are "equal".
Now the question is, what would be an appropriate model to perform a symmetrical regression in stan.
Following the suggestion from LukasNeugebauer by standardizing the data first and working without an intercept, does not solve the problem.
I cheated a bit and found a solution:
When you rotate the coordinate system by 45 degrees, the new y-Axis (y') represents the information of x and y in equal amounts. Therefor, assuming a variance only on the new y-Axis involves both x and y.
x' = x*cos((pi/180)*45) + y*sin((pi/180)*45)
y' = -x*sin((pi/180)*45) + y*cos((pi/180)*45)
The above model now results in symmetric results. Where a slope of 0, represents a slope of 1 in the old system.

MATLAB: How to compute the similarity of two signals and get the correct consistency or coherence metric

I was wondering about the consistency metric. Generally, it allows us to deduce the parity or similarity between two signals, right? If so, if the probability is higher (from 0.5 to 1), does it means that there is a strong similarity of the signals? If the margin is less than (0.1-0.43), can this predict the poor coherence between the signals (or poor similarity, the probability the signals are different)? So, if we got the metric <0, is this approved the signal is totally different? Because I'm getting negative numbers. Is this hypothesis possible?
Can I have a clear understanding of the consistency metric of the signal? Here is my small code and figure. Thanks in advance.
s1 = signal3
s2 = signal4
if s1 ~= s2
[C1] = xcorr(s1);
[C2] = xcorr(s2);
signal_mix = C1.*C2 %mixing vector
signal_mix1 = signal_mix
else
s1(1,:) == s2(1,:)
s3 = s1
s3= s2
signal_mix = s2
end
n =2;
for i = length(signal_mix1)
signal_mix1(i) = min(C1(i),C2(i))/ max(C1(i),C2(i)) % consistency score
signal_mix2 = sum(signal_mix1(i))
end
Depending on your use case you might want to consider a dynamic time wraping distance (Matlab has a build in function for that) as similarity metric. One problem with using correlation as a metric is that it compares always the same timestep of the signals. So two identical signals, where one is time delayed, could lead to low correlation. The DTW distance adresses this by comparing to values of adjacent timesteps.
The downside of the dtw distance is that the distance it self can't be interpretet on its only only relative to other distances. So you can tell that two signals A & B with a distance of 150 are more similar than A & C with a distance of 250. But the distance of 150 on its own doesn't tell you a lot.
first of all, you could use xcorrfunction to calculate cross-correlation between two signals.
from Matlab help:
r = xcorr(x,y) returns the cross-correlation of two discrete-time
sequences. Cross-correlation measures the similarity between a vector
x and shifted (lagged) copies of a vector y as a function of the lag.
If x and y have different lengths, the function appends zeros to the
end of the shorter vector so it has the same length as the other.
additionally you could use xcov:
xcov computes the mean of its inputs, subtracts the mean, and then
calls xcorr.
The result of xcov can be interpreted as an estimate of the covariance
between two random sequences or as the deterministic covariance
between two deterministic signals.
in case of your example you are using xcorr with one signal so it computes auto-correlation between the signal itself and its lagged signal.
update:
based on the comment, it seems you need linear correlation, it can be calculated by corr function:
p=corr(x,y)
the value of p is 1 when x , y behave exactly like each other, and is -1 when x and y behave quite the opposite of each other.
when p is 0 it means there is no correlation between two signals.

Why does treating the index as a continuous variable not work when performing an inverse discrete Fourier transform?

I have a set of points describing a closed curve in the complex plane, call it Z = [z_1, ..., z_N]. I'd like to interpolate this curve, and since it's periodic, trigonometric interpolation seemed a natural choice (especially because of its increased accuracy). By performing the FFT, we obtain the Fourier coefficients:
F = fft(Z);
At this point, we could get Z back by the formula (where 1i is the imaginary unit, and we use (k-1)*(n-1) because MATLAB indexing starts at 1)
N
Z(n) = (1/N) sum F(k)*exp( 1i*2*pi*(k-1)*(n-1)/N), 1 <= n <= N.
k=1
My question
Is there any reason why n must be an integer? Presumably, if we treat n as any real number between 1 and N, we will just get more points on the interpolated curve. Is this true? For example, if we wanted to double the number of points, could we not set
N
Z_new(n) = (1/N) sum F(k)*exp( 1i*2*pi*(k-1)*(n-1)/N), with n = 1, 1.5, 2, 2.5, ..., N-1, N-0.5, N
k=1
?
The new points are of course just subject to some interpolation error, but they'll be fairly accurate, right? The reason I'm asking this question is because this method is not working for me. When I try to do this, I get a garbled mess of points that makes no sense.
(By the way, I know that I could use the interpft() command, but I'd like to add points only in certain areas of the curve, for example between z_a and z_b)
The point is when n is integer, you have some primary functions which are orthogonal and can be as a basis for the space. When, n is not integer, The exponential functions in the formula, are not orthogonal. Hence, the expression of a function based on these non-orthogonal basis is not meaningful as much as you expected.
For orthogonality case you can see the following as an example (from here). As you can check, you can find two n_1 and n_2 which are not integer, the following integrals are not zero any more, and they are not orthogonal.

Delta coefficients from mfcc

Can somebody explain to meabout calculating delta coefficients from MFCC for a frame? I didn't understand the interpretation in practical cryptography's tutorial.
The delta coefficients are the approximate derivatives, so a simple way is to calculate:
delta: v(t) = ( c(t+1) - c(t-1) ) / 2
delta-delta: a(t) = c(t-1) - 2 * c(t) + c(t+1)
But I have read that in practice, "it is more common to make more sophisticated approximations to the slope, using a wider context of frames" (Jurafsky et al., 2007, Speech and Language Processing) to determine the delta and delta-delta. For example, we might consult a finite differences table (we can see that the two values above are the lowest order estimates from those tables, but higher order estimates use more points in the calculations).

Find the inverse of a Matrix in MATLAB, is inv(A) or A\eye(size(A)) more precise? [duplicate]

This question already has answers here:
Why is Matlab's inv slow and inaccurate?
(3 answers)
Closed 7 years ago.
The title explains it already. If I need to find an inverse of a matrix, is there any reason I should use A\eye(size(A)) instead of inv(A)?
And before you ask: Yes, I really need the inverse, not only for calculations.
PS:
isequal(inv(A), A\eye(size(A)))
ans =
0
So which one is more precise?
UPDATE: This question was closed as it appeard to be a duplicate of the question "why is inv in MATLAB so slow and inaccurate". This question here differs significantly by not addressing the speed, nor the accuarcy of the function inv but the difference of inv and .\eye to calculate the true inverse of a matrix.
Let's disregard performance (speed) and best practice for a bit.
eps(n) is a command that returns the distance to the next larger double precision number from n in MATLAB. So, eps(1) = 2.2204e-16 means that the first number after 1 is 1 + 2.2204e-16. Similarly, eps(3000) = 4.5475e-13. Now, let's look at the precision of you calculations:
n = 100;
A = rand(n);
inv_A_1 = inv(A);
inv_A_2 = A \ eye(n);
max(max(abs(inv_A_1-inv_A_2)))
ans =
1.6431e-14
eps(127) = 1.4211e-14
eps(128) = 2.8422e-14
For integers, the largest number you can use that has an accuracy higher than the max difference between your two matrices is 127.
Now, let's check how the accuracy when we try to recreate the identity matrix from the two inverse matrices.
error_1 = max(max(abs((A\eye(size(A))*A) - eye(size(A)))))
error_1 =
3.1114e-14
error_2 = max(max(abs((inv(A)*A) - eye(size(A)))))
error_2 =
2.3176e-14
The highest integer with a higher accuracy than the maximum difference between the two approaches is 255.
In summary, inv(A) is more accurate, but once you start using the inverse matrices, they are for all intended purposes identical.
Now, let's have a look at the performance of the two approaches:
n = fix(logspace(1,3,40));
for i = 1:numel(n)
A = rand(round(n(i)));
t1(i) = timeit(#()inv(A));
t2(i) = timeit(#()A\eye(n(i)));
end
loglog(n,[t1;t2])
It appears that which of the two approaches is fastest is dependent on the matrix size. For instance, using inv is slower for n = 255, but faster for n = 256.
In summary, choose approach based on what's important to you. For most intended purposes, the two approaches are identical.
Note that svd and pinv may be of interest if you're working with badly scaled matrices. If it's really really important you should consider the Symbolic toolbox.
I know you said that you "actually need the inverse", but I can't let this go unsaid: Using inv(A)*b is never the best approach for solving a linear equation! I won't explain further as I think you know this already.
If you need the inverse, you should use inv.
The inverse is calculated via LU decomposition, whereas the backslash operator mldivide calculates the solution to your linear system using different methods depending on the properties of your matrix A (see https://scicomp.stackexchange.com/a/1004), which can yield less accurate results for the inverse.
It should be noted that if you want to solve a linear system, the calculation is likely going to be much faster and more accurate using mldivide(\). The MATLAB documentation of inv is basically one big warning not to use inv to solve linear systems.
Just a way trying to check this, not sure if it's completely helpful though: multiply your inverse matrix result back with it's original version and check the deviation from the identity matrix:
A = rand( 111 );
E1 = abs( (A\eye(size(A) ) * A ) - eye( size(A) ) );
E2 = abs( ( inv(A) * A ) - eye( size(A) ) );
mean(E1(:))
mean(E2(:))
inv seems to be more accurate as I would have expected. Maybe somebody can re-evaluate this. ;)