I need to find the cosine similarity between two frequency vectors in MATLAB.
Example vectors:
a = [2,3,4,4,6,1]
b = [1,3,2,4,6,3]
How do I measure the cosine similarity between these vectors in MATLAB?
Take a quick look at the mathematical definition of Cosine similarity.
From the definition, you just need the dot product of the vectors divided by the product of the Euclidean norms of those vectors.
% MATLAB 2018b
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
cosSim = sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
Alternatively, you could use
cosSim = (a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
which gives the same result.
After reading this correct answer, to avoid sending you to another castle I've added another approach using MATLAB's built-in linear algebra functions, dot() and norm().
cosSim = dot(a,b)/(norm(a)*norm(b)); % 0.9436
See also the tag-wiki for cosine-similarity.
Performance by Approach:
sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2))
(a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2))
dot(a,b)/(norm(a)*norm(b))
Each point represents the geometric mean of the computation times for 10 randomly generated vectors.
If you have the Statistics toolbox, you can use the pdist2 function with the 'cosine' input flag, which gives 1 minus the cosine similarity:
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
result = 1-pdist2(a, b, 'cosine');
Related
What are the difference between the following two functions?
prepTransform.m
function [mu trmx] = prepTransform(tvec, comp_count)
% Computes transformation matrix to PCA space
% tvec - training set (one row represents one sample)
% comp_count - count of principal components in the final space
% mu - mean value of the training set
% trmx - transformation matrix to comp_count-dimensional PCA space
% this is memory-hungry version
% commented out is the version proper for Win32 environment
tic;
mu = mean(tvec);
cmx = cov(tvec);
%cmx = zeros(size(tvec,2));
%f1 = zeros(size(tvec,1), 1);
%f2 = zeros(size(tvec,1), 1);
%for i=1:size(tvec,2)
% f1(:,1) = tvec(:,i) - repmat(mu(i), size(tvec,1), 1);
% cmx(i, i) = f1' * f1;
% for j=i+1:size(tvec,2)
% f2(:,1) = tvec(:,j) - repmat(mu(j), size(tvec,1), 1);
% cmx(i, j) = f1' * f2;
% cmx(j, i) = cmx(i, j);
% end
%end
%cmx = cmx / (size(tvec,1)-1);
toc
[evec eval] = eig(cmx);
eval = sum(eval);
[eval evid] = sort(eval, 'descend');
evec = evec(:, evid(1:size(eval,2)));
% save 'nist_mu.mat' mu
% save 'nist_cov.mat' evec
trmx = evec(:, 1:comp_count);
pcaTransform.m
function [pcaSet] = pcaTransform(tvec, mu, trmx)
% tvec - matrix containing vectors to be transformed
% mu - mean value of the training set
% trmx - pca transformation matrix
% pcaSet - output set transforrmed to PCA space
pcaSet = tvec - repmat(mu, size(tvec,1), 1);
%pcaSet = zeros(size(tvec));
%for i=1:size(tvec,1)
% pcaSet(i,:) = tvec(i,:) - mu;
%end
pcaSet = pcaSet * trmx;
Which one is actually doing PCA?
If one is doing PCA, what is the other one doing?
The first function prepTransform is actually doing the PCA on your training data where you are determining the new axes to represent your data onto a lower dimensional space. What it does is that it finds the eigenvectors of the covariance matrix of your data and then orders the eigenvectors such that the eigenvector with the largest eigenvalue appears in the first column of the eigenvector matrix evec and the eigenvector with the smallest eigenvalue appears in the last column. What's important with this function is that you can define how many dimensions you want to reduce the data down to by keeping the first N columns of evec which will allow you to reduce your data down to N dimensions. The discarding of the other columns and keeping only the first N is what is set as trmx in the code. The variable N is defined by the prep_count variable in prepTransform function.
The second function pcaTransform finally transforms data that is defined within the same domain as your training data but not necessarily the training data itself (it could be if you wish) onto the lower dimensional space that is defined by the eigenvectors of the covariance matrix. To finally perform the reduction of dimensions, or dimensionality reduction as it is popularly known, you simply take your training data where each feature is subtracted from its mean and you multiply your training data by the matrix trmx. Note that prepTransform outputting the mean of each feature in the vector mu is important in order to mean subtract your data when you finally call pcaTransform.
How to use these functions
To use these functions effectively, first determine the trmx matrix, which contain the principal components of your data by first defining how many dimensions you want to reduce your data down to as well as the mean of each feature stored in mu:
N = 2; % Reduce down to two dimensions for example
[mu, trmx] = prepTransform(tvec, N);
Next you can finally perform dimensionality reduction on your data that is defined within the same domain as tvec (or even tvec if you wish, but it doesn't have to be) by:
pcaSet = pcaTransform(tvec, mu, trmx);
In terms of vocabulary, pcaSet contain what are known as the principal scores of your data, which is the term used for the transformation of your data to the lower dimensional space.
If I can recommend something...
Finding PCA through the eigenvector approach is known to be unstable. I highly recommend you use the Singular Value Decomposition via svd on the covariance matrix where the V matrix of the result already gives you the eigenvectors sorted which correspond to your principal components:
mu = mean(tvec, 1);
[~,~,V] = svd(cov(tvec));
Then perform the transformation by taking the mean subtracted data per feature and multiplying by the V matrix, once you subset and grab the first N columns of V:
N = 2;
X = bsxfun(#minus, tvec, mu);
pcaSet = X*V(:, 1:N);
X is the mean subtracted data which performs the same thing as doing pcaSet = tvec - repmat(mu, size(tvec,1), 1);, but you are not explicitly replicating the mean vector over each training example but letting bsxfun do that for you internally. However, taking advantage of MATLAB R2016b, this repeating can be done without the explicit call to bsxfun:
X = tvec - mu;
Further Reading
If you fully want to understand the code that was written and the theory behind what it's doing, I recommend the following two Stack Overflow posts that I have written that talk about the topic:
What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?
How to use eigenvectors obtained through PCA to reproject my data?
The first post brings the code you presented into light which performs PCA using the eigenvector approach. The second post touches base on how you'd do it using the SVD towards the end of the answer. This answer I've written here is a mix between the two posts above.
In a MATLAB code I am using the kullback_leibler_divergence dissimilarity function that can be found here.
I have a matrix A and I compute the dissimilarity matrix using the downloaded function.
In theory, if I calculate
clear
A = rand(132,6); % input matrix
diss_mat = pdist(A,'#kullback_leibler_divergence'); % calculate the dissimilarity
square_diss_mat = squareform(diss_mat); % I put the dissimilarities in a square matrix
one_dist = pdist2(A(1,:),A,#kullback_leibler_divergence);
I should get the first row of square_diss_mat equal to one_dist, but I am not.
If I use the Euclidean distance I get it:
diss_mat = pdist(A);
square_diss_mat = squareform(diss_mat);
one_dist = pdist2(A(1,:),A);
Could you please tell me why?
The kullback_leibler_divergence is not symmetric, thus the order matters:
one_dist = pdist2(A, A(1,:), #kullback_leibler_divergence);
I don't see any practical application using a non-symmetric function with pdist or pdist2.
I'm trying to get wavelet decomposition of arcsin(x) using, say, Haar wavelets
When using both Matlab's dwt or wavedec functions, I get strange values for approximating coefficients. Since applying low-pass Haar wavelets's filter equals to performing half-sum and the maximum of arcsin is pi/2, I assume that approximating coefficients can't surpass pi/2, yet this code:
x = linspace(0,1,128);
y = asin(x);
[cA, cD] = dwt(y, 'haar'); %//cA for approximating coefficients
returns values more than pi/2 in cA. Why is that?
I believe what makes you confused here is thinking that Haar's filter just averages two adjacent numbers when computing 1-level approximation coefficients. Due to the energy preservation feature of the scaling function, each pair of numbers gets divided by sqrt(2) instead of 2. In fact, you could see what a particular wavelet filter does by typing in the following command (for the Haar filter in this case):
[F1,F2] = wfilters('haar','d')
F1 =
0.7071 0.7071
F2 =
-0.7071 0.7071
You can then check the validity of what you have gotten above by constructing a simple loop:
CA_compare = zeros(1,64);
for k = 1 : 64
CA_compare(k) = dot( y(2*k-1 : 2*k), F1 );
end
You will then see that "CA_compare" contains exactly the same values as your "cA" does.
Hope this helps.
I was aquainted in using the fft in matlab with the code
fft(signal,[],n)
where n tells the dimension on which to apply the fft as from Matlab documentation:
http://www.mathworks.it/it/help/matlab/ref/fft.html
I would like to do the same with dct.
Is this possible? I could not find any useful information around.
Thanks for the help.
Luigi
dct does not have the option to pick a dimension like fft. You will have to either transpose your input to operate on rows or pick one vector from your signal and operate on that.
yep!
try it:
matrix = dctmtx(n);
signal_dct = matrix * signal;
Edit
Discrete cosine transform.
Y = dct(X) returns the discrete cosine transform of X.
The vector Y is the same size as X and contains the discrete cosine transform coefficients.
Y = dct(X,N) pads or truncates the vector X to length N before transforming.
If X is a matrix, the dct operation is applied to each column. This transform can be inverted using IDCT.
% Example:
% Find how many dct coefficients represent 99% of the energy
% in a sequence.
x = (1:100) + 50*cos((1:100)*2*pi/40); % Input Signal
X = dct(x); % Discrete cosine transform
[XX,ind] = sort(abs(X)); ind = fliplr(ind);
num_coeff = 1;
while (norm([X(ind(1:num_coeff)) zeros(1,100-num_coeff)])/norm(X)<.99)
num_coeff = num_coeff + 1;
end;
num_coeff
I've read some explanations of how autocorrelation can be more efficiently calculated using the fft of a signal, multiplying the real part by the complex conjugate (Fourier domain), then using the inverse fft, but I'm having trouble realizing this in Matlab because at a detailed level.
Just like you stated, take the fft and multiply pointwise by its complex conjugate, then use the inverse fft (or in the case of cross-correlation of two signals: Corr(x,y) <=> FFT(x)FFT(y)*)
x = rand(100,1);
len = length(x);
%# autocorrelation
nfft = 2^nextpow2(2*len-1);
r = ifft( fft(x,nfft) .* conj(fft(x,nfft)) );
%# rearrange and keep values corresponding to lags: -(len-1):+(len-1)
r = [r(end-len+2:end) ; r(1:len)];
%# compare with MATLAB's XCORR output
all( (xcorr(x)-r) < 1e-10 )
In fact, if you look at the code of xcorr.m, that's exactly what it's doing (only it has to deal with all the cases of padding, normalizing, vector/matrix input, etc...)
By the Wiener–Khinchin theorem, the power-spectral density (PSD) of a function is the Fourier transform of the autocorrelation. For deterministic signals, the PSD is simply the magnitude-squared of the Fourier transform. See also the convolution theorem.
When it comes to discrete Fourier transforms (i.e. using FFTs), you actually get the cyclic autocorrelation. In order to get proper (linear) autocorrelation, you must zero-pad the original data to twice its original length before taking the Fourier transform. So something like:
x = [ ... ];
x_pad = [x zeros(size(x))];
X = fft(x_pad);
X_psd = abs(X).^2;
r_xx = ifft(X_psd);