MATLAB: How to compute the similarity of two signals and get the correct consistency or coherence metric - matlab

I was wondering about the consistency metric. Generally, it allows us to deduce the parity or similarity between two signals, right? If so, if the probability is higher (from 0.5 to 1), does it means that there is a strong similarity of the signals? If the margin is less than (0.1-0.43), can this predict the poor coherence between the signals (or poor similarity, the probability the signals are different)? So, if we got the metric <0, is this approved the signal is totally different? Because I'm getting negative numbers. Is this hypothesis possible?
Can I have a clear understanding of the consistency metric of the signal? Here is my small code and figure. Thanks in advance.
s1 = signal3
s2 = signal4
if s1 ~= s2
[C1] = xcorr(s1);
[C2] = xcorr(s2);
signal_mix = C1.*C2 %mixing vector
signal_mix1 = signal_mix
else
s1(1,:) == s2(1,:)
s3 = s1
s3= s2
signal_mix = s2
end
n =2;
for i = length(signal_mix1)
signal_mix1(i) = min(C1(i),C2(i))/ max(C1(i),C2(i)) % consistency score
signal_mix2 = sum(signal_mix1(i))
end

Depending on your use case you might want to consider a dynamic time wraping distance (Matlab has a build in function for that) as similarity metric. One problem with using correlation as a metric is that it compares always the same timestep of the signals. So two identical signals, where one is time delayed, could lead to low correlation. The DTW distance adresses this by comparing to values of adjacent timesteps.
The downside of the dtw distance is that the distance it self can't be interpretet on its only only relative to other distances. So you can tell that two signals A & B with a distance of 150 are more similar than A & C with a distance of 250. But the distance of 150 on its own doesn't tell you a lot.

first of all, you could use xcorrfunction to calculate cross-correlation between two signals.
from Matlab help:
r = xcorr(x,y) returns the cross-correlation of two discrete-time
sequences. Cross-correlation measures the similarity between a vector
x and shifted (lagged) copies of a vector y as a function of the lag.
If x and y have different lengths, the function appends zeros to the
end of the shorter vector so it has the same length as the other.
additionally you could use xcov:
xcov computes the mean of its inputs, subtracts the mean, and then
calls xcorr.
The result of xcov can be interpreted as an estimate of the covariance
between two random sequences or as the deterministic covariance
between two deterministic signals.
in case of your example you are using xcorr with one signal so it computes auto-correlation between the signal itself and its lagged signal.
update:
based on the comment, it seems you need linear correlation, it can be calculated by corr function:
p=corr(x,y)
the value of p is 1 when x , y behave exactly like each other, and is -1 when x and y behave quite the opposite of each other.
when p is 0 it means there is no correlation between two signals.

Related

Approximate continuous probability distribution in Matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

MATLAB: FFT with specified modes

Using MATLAB I want to implement some kind of a spectral method. The idea is as following (described for a example which is working).
Dirichlet (and Neumann, and periodic) boundaries leads to eigenvalues in the fourier space of k=n*pi/L
Projecting all the linear operators in the fourier space to the discretized k-values:
e.g. L = -D*(k.*k) (for diffusion only)
Defining the propagator in time as P = exp( dt * L )
Calculating iteratively the evolution in time by uh_{n+1} = uh_n * P
return the calculated value to the real space every time I want to save the value by ifft( uh )
My question concerns another boundary conditions.
In my case I have Robin boundary conditions. So, the eigenvalues are defined through some weird equation of the form tan( x ) = x or the like. The problem of computing them is solved.
As I have the values, the step no. 2 and 3 is simple too, but:
For applying P on the fourier-transformed vector uh I have to ensure that my uh = fft(u) uses the same eigenvalues, which is not the case by default.
By default MATLAB uses equidistant modes for the fft.
Is there any simple trick for this?
Or, maybe, do I have any mistake in my thoughts?

implementation of Lomb-Scargle periodogram

from matlab official site , Lomb-Scargle periodogram is defined as
http://www.mathworks.com/help/signal/ref/plomb.html#lomb
suppose we have some random signal let say
x=rand(1,1000);
average of this signal can be easily implemented as
average=mean(x);
variance can be implemented as
>> average_vector=repmat(average,1,1000);
>> diff=x-average_vector;
>> variance= sum(diff.*diff)/(length(x)-1);
how should i continue? i mean how to choose frequencies ? calculation of time offset is not problem,let us suppose that we have time vector
t=0:0.1:99.9;
so that total length of time vector be 1000, in generally for DFT, frequencies bins are represented as a multiplier of 2*pi/N, where N is length of signal, but what about this case? thanks in advance
As can be seen from the provided link to MATLAB documentation, the algorithm does not depend on a specific sampling times tk selection. Note that for equally spaced sampling times (as you have selected), the same link indicates:
The offset depends only on the measurement times and vanishes when the times are equally spaced.
So, as you put it "calculation of time offset is not a problem".
Similar to the DFT which can be obtained from the Discrete-Time Fourier Transform (DTFT) by selecting a discrete set of frequencies, we can also choose f[n] = n * sampling_rate/N (where sampling_rate = 10 for your selection of tk). If we disregard the value of PLS(f[n]) for n=mN where m is any integer (since it's value is ill-formed, at least in the formula posted in the link), then:
Thus for real-valued data samples:
where Y can be expressed in terms of the diff vector you defined as:
Y = fft(diff);
That said, as indicated on wikipedia, the Lomb–Scargle method is primarilly intended for use with unequally spaced data.

Similarity between two signals: looking for simple measure

I have 20 signals (time-courses) in group A and 20 signals in group B. I want to find a measure to show that group A is different from group B. For example, I ran xcorr for the signals within each group. But now I need to compare them somehow. I tried to take a maximal amplitude of each xcorr pair, which is sort a measure of maximal similarity. Then I compared all these values between two groups, but there was no difference. What else can I do? I can also compare frequency spectrum, but then I again do not know what frequency bin to take.
Any suggestions / references are highly appreciated!
I have about 20 signals in each group. Those are my samples. I do not know a-prirori what might be the difference. Here I bring the 9 sample signals for each group, their auto-correlation and cross-correlation for a subset of signals (group 1 vs. group 1, group 2 vs. group 2, group 1 vs. group 2). I do not see any evident difference. I also do not understand how you propose to compare cross-correlations, what peaks should I take? All the signals were detrended and z-scored.
Well, this may be too simplistic of an answer, and too complex of a measure, but maybe its worth something.
In order to compare signals, we really have to establish some criterion by which we compare them. This could be so many things. If we want signals that look visually similar, we perform time domain analysis. If we are talking about audio signals that sound similar, we care about frequency or time-frequency analysis. If the signals are supposed to represent noise, then signal variance should be a good measure. In general we may want to use a combination of all sorts of measures. We can do this with a weighted index.
First let's establish what we have: there are two sets of signals: set A and set B. We want some measure that shows set A is different from set B. The signals are detrended.
We take signal a in A and signal b in B. The list of things we can compare:
Similarity in time domain (static): Multiply in place and sum.
Similarity in time domain (with shift*): Take fft of each signal,
multiply, and ifft. (I believe this equivalent to matlab's xcorr.)
Similarity in frequency domain (static**): Take fft of each signal,
multiply, and sum.
Similarity in frequency domain (with shift*): Multiply the two
signals and take fft. This will show if the signals share similar
spectral shapes.
Similarity in energy (or power if different lengths): Square the two
signals and sum each (and divide by signal length for power). (Since
the signals were detrended, this should be signal variance.) Then
subtract and take absolute value for a measure of signal variance
similarity.
* (with shift) -- You could choose to sum over the entire correlation vector to measure total general correlation, you could choose to sum only values in the correlation vector that surpass a certain threshold value (as if you expect echoes of one signal in the other), or just take the maximum value from the correlation vector (where its index is the shift in the second signal that results in maximal correlation with the first signal). Also, if the amount of shift that it takes to reach maximal correlation is important (i.e. if signals are similar only if it takes relatively small shift to reach the point of maximal correlation), then you can incorporate a measure of the index displacement.
** (frequency domain similarity) -- You may want to mask part of the spectrum that you're not concerned with, for instance, if you only care about the more high frequency structures (fs/4 and up), you could do:
mask = zeros(1,n); mask(n/4):
freq_static = mean(fft(a) .* fft(b) .* mask);
Also, we may want to implement a circular correlation like so:
function c = circular_xcorr(a,b)
c = xcorr(a,b);
mid = length(c) / 2;
c = c(1:mid) + c(mid+1:end);
end
Finally, we choose the characteristics that are important or relevant, and create a weighted index. Example:
n = 100;
a = rand(1,n); b = rand(1,n);
time_corr_thresh = .8 * n; freq_corr_thresh = .6 * n;
time_static = max(a .* b);
time_shifted = circular_xcorr(a,b); time_shifted = sum(time_shifted(time_shifted > time_corr_thresh));
freq_static = max(fft(a) .* fft(b));
freq_shifted = fft(a .* b); freq_shifted = sum(freq_shifted(freq_shifted > freq_corr_thresh));
w1 = 0; w2 = 1; w2 = .7; w3 = 0;
index = w1 * time_static + w1 * time_shifted + w2 * freq_static + w3 * freq_shifted;
We compute this index for each pair of signals.
I hope that this outline of signal characterization helps. Comment if anything is unclear.
With reference to Brian's answer above, I've written a Python Function to compute the similarity of time-series signal as below;
def compute_similarity(ref_rec,input_rec,weightage=[0.33,0.33,0.33]):
## Time domain similarity
ref_time = np.correlate(ref_rec,ref_rec)
inp_time = np.correlate(ref_rec,input_rec)
diff_time = abs(ref_time-inp_time)
## Freq domain similarity
ref_freq = np.correlate(np.fft.fft(ref_rec),np.fft.fft(ref_rec))
inp_freq = np.correlate(np.fft.fft(ref_rec),np.fft.fft(input_rec))
diff_freq = abs(ref_freq-inp_freq)
## Power similarity
ref_power = np.sum(ref_rec**2)
inp_power = np.sum(input_rec**2)
diff_power = abs(ref_power-inp_power)
return float(weightage[0]*diff_time+weightage[1]*diff_freq+weightage[2]*diff_power)

determining "how good" a correlation is in matlab?

I'm working with a set of data and I've obtained a certain correlations (using pearson's correlation coefficient). I've been asked to determine the "quality of the correlation," and by that my supervisor means he wants to see what the correlations would be if I tried permuting all the y values of my ordered pairs, and compared the obtained correlation coefficients. Does anyone know a nice way of doing this? Is there a matlab function that would determine how good a correlation is when compared to a correlation between random permutations of the data?
First, you have to check whether the correlation coefficient you got is significantly different from zero. The corr function can do this (see pval).
Second, if it's significantly different from zero, then you would like to decide whether this difference is also significant from a practical point of view. In practice, the square of the correlation coefficent (the coefficient of determination) is considered significant, if it's larger than 0.5, which means that the variations of one of the correlated parameters "explains" at least 50% of the variation of the other.
Third, there are cases when the coefficient of determination is close to one, but this is not enough to determine the "goodness of correlation". For example, if you measure the same variable using two different methods, you will usually get very similar values, so the correlation coefficient will be almost 1. In such cases you should apply the Bland-Altman analysis, which is very easy to implement in Matlab, and has its own "goodness" parameters (the bias and the so-called limits of agreement).
You can permute one vector's labels N times and calculate coefficient of correlations (cc) for each iteration. Then you can compare distribution of those values with the real correlation.
Something like this:
%# random data
n = 20;
x = (1:n)';
y = x + randn(n,1)*3;
%# real correlation
cc = corr(x,y);
%# do permutations
n_iter = 100; %# number of permutations
cc_iter = zeros(n_iter,1); %# preallocate the vector
for k = 1:n_iter
ind = randperm(n); %# vector of random permutations
cc_iter(k) = corr(x,y(ind));
end
%# calculate statistics
cc_mean = mean(cc_iter);
cc_std = std(cc_iter);
zval = cc - cc_mean ./ cc_std;
%# probability that the real cc belongs to the same distribution as cc from permuted data
pv = 2 * normcdf(-abs(zval),cc_mean,cc_std);
%# plot
hist(cc_iter,20)
line([cc cc],ylim,'color','r') %# real value
In addition, if you compute correlation with [cc pv] = corr(x,y), you get p-value of how your correlation is different from no correlation. This p-value is calculated from assumption that your vector distributed normally. However, if you calculate not Pearson, but Spearman or Kendall correlation (non-parametric), those p-values will be from randomly permuted data:
[cc pv] = corr(x,y,'type','Spearman')