JS divergence between two discrete probability distributions of unequal length - stat

I am implementing an online topic modeling as outlined in the paper - On-line Trend Analysis with Topic Models: #twitter trends detection topic model online
I need to find the Jensen-Shannon divergence measure between the word distribution of each topic t before and after an update, and classify a topic as being novel if the measure exceeds a threshold.
At each update the vocabulary is updated, so the word distribution over the vocabulary has different length after each update. How can I calculate the JS divergence between two distributions of unequal length?

Jensen-Shannon divergence is the relative entropy of two probability distributions, it is a symmetrical form of Kullback-Leibler (KL) divergence. It is the average of the KL divergence when the two arguments that you are comparing with respect to divergence are swapped.
You will need a good understanding of KL divergence before you can proceed. Here is a good starting point:
Given two probability distributions ie P and Q.
P=(p1....pi), Q=(q1....qi)
KL(P||Q)= sum([pi * log(pi/qi) for i in P if i in Q])
KL is not symmetric, hence it is not a metric. In order to make KL symmetric Jensen and Shannon proposed the Jensen-Shannon divergence which is the average of the KL divergence when the two arguments are swapped ie
JSd(P||Q)= (KL(P,Z) + KL(Q,Z))/2
where Z=(P + Q)/2
In simple terms, the Jensen-Shannon divergence is the average of the averaged KL divergence between two probability distributions.
I hope this helps.

Using random.choice sample data to make the same length of distribution of p and q
def jsd(p, q, base=np.e):
'''
Implementation of pairwise `jsd` based on
https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
'''
if len(p)>len(q):
p = np.random.choice(p,len(q)) # random.choice make same length to p/q
elif len(q)>len(p):
q = np.random.choice(q,len(p))
## convert to np.array
p, q = np.asarray(p), np.asarray(q)
## normalize p, q to probabilities
p, q = p/p.sum(), q/q.sum()
m = 1./2*(p + q)
return scipy.stats.entropy(p,m, base=base)/2. + scipy.stats.entropy(q, m, base=base)/2.

Related

MATLAB: How to compute the similarity of two signals and get the correct consistency or coherence metric

I was wondering about the consistency metric. Generally, it allows us to deduce the parity or similarity between two signals, right? If so, if the probability is higher (from 0.5 to 1), does it means that there is a strong similarity of the signals? If the margin is less than (0.1-0.43), can this predict the poor coherence between the signals (or poor similarity, the probability the signals are different)? So, if we got the metric <0, is this approved the signal is totally different? Because I'm getting negative numbers. Is this hypothesis possible?
Can I have a clear understanding of the consistency metric of the signal? Here is my small code and figure. Thanks in advance.
s1 = signal3
s2 = signal4
if s1 ~= s2
[C1] = xcorr(s1);
[C2] = xcorr(s2);
signal_mix = C1.*C2 %mixing vector
signal_mix1 = signal_mix
else
s1(1,:) == s2(1,:)
s3 = s1
s3= s2
signal_mix = s2
end
n =2;
for i = length(signal_mix1)
signal_mix1(i) = min(C1(i),C2(i))/ max(C1(i),C2(i)) % consistency score
signal_mix2 = sum(signal_mix1(i))
end
Depending on your use case you might want to consider a dynamic time wraping distance (Matlab has a build in function for that) as similarity metric. One problem with using correlation as a metric is that it compares always the same timestep of the signals. So two identical signals, where one is time delayed, could lead to low correlation. The DTW distance adresses this by comparing to values of adjacent timesteps.
The downside of the dtw distance is that the distance it self can't be interpretet on its only only relative to other distances. So you can tell that two signals A & B with a distance of 150 are more similar than A & C with a distance of 250. But the distance of 150 on its own doesn't tell you a lot.
first of all, you could use xcorrfunction to calculate cross-correlation between two signals.
from Matlab help:
r = xcorr(x,y) returns the cross-correlation of two discrete-time
sequences. Cross-correlation measures the similarity between a vector
x and shifted (lagged) copies of a vector y as a function of the lag.
If x and y have different lengths, the function appends zeros to the
end of the shorter vector so it has the same length as the other.
additionally you could use xcov:
xcov computes the mean of its inputs, subtracts the mean, and then
calls xcorr.
The result of xcov can be interpreted as an estimate of the covariance
between two random sequences or as the deterministic covariance
between two deterministic signals.
in case of your example you are using xcorr with one signal so it computes auto-correlation between the signal itself and its lagged signal.
update:
based on the comment, it seems you need linear correlation, it can be calculated by corr function:
p=corr(x,y)
the value of p is 1 when x , y behave exactly like each other, and is -1 when x and y behave quite the opposite of each other.
when p is 0 it means there is no correlation between two signals.

Is there a Matlab function for calculating std of a binomial distribution?

I have a binary vector V, in which each entry describes success (1) or failure (0) in the relevant trial out of a whole session.
(the length of the vector denotes the number of trials in the session).
I can easily calculate the success rate of the session (by taking the mean of the vector i.e. (sum(V)/length(V))).
However I also need to know the variance or std of each session.
In order to calculate that, is it OK to use the Matlab std function (i.e. to take std(V)/length(V))?
Or, should I use something which is specifically suited for the binomial distribution?
Is there a Matlab std (or variance) function which is specific for a "success/failure" distribution?
Thanks
If you satisfy the assumptions of the Binomial distribution,
a fixed number of n independent Bernoulli trials,
each with constant success probability p,
then I'm not sure that is necessary, since the parameters n and p are available from your data.
Note that we model number of successes (in n trials) as a random variable distributed with the Binomial(n,p) distribution.
n = length(V);
p = mean(V); % equivalently, sum(V)/length(V)
% the mean is the maximum likelihood estimator (MLE) for p
% note: need large n or replication to get true p
Then the standard deviation of the number of successes in n independent Bernoulli trials with constant success probability p is just sqrt(n*p*(1-p)).
Of course you can assess this from your data if you have multiple samples. Note this is different from std(V). In your data formatting, it would require having multiple vectors, V1, V2, V2, etc. (replication), then the sample standard deviation of the number of successes would obtained from the following.
% Given V1, V2, V3 sets of Bernoulli trials
std([sum(V1) sum(V2) sum(V3)])
If you already know your parameters: n, p
You can obtain it easily enough.
n = 10;
p = 0.65;
pd = makedist('Binomial',n, p)
std(pd) % 1.5083
or
sqrt(n*p*(1-p)) % 1.5083
as discussed earlier.
Does the standard deviation increase with n ?
The OP has asked:
Something is bothering me.. if std = sqrt(n*p*(1-p)), then it increases with n. Shoudn't the std decrease when n increases?
Confirmation & Derivation:
Definitions:
Then we know that
Then just from definitions of expectation and variance we can show the variance (similarly for standard deviation if you add the square root) increases with n.
Since the square root is a non-decreasing function, we know the same relationship holds for the standard deviation.

Minimum Description Length for Clustering

I would like to know how to calculate Minimum Description Length (MDL) to evaluate the clustering result.
I was looking at some papers on clustering algorithms, and one of them refers to MDL as a measurement to check if the clusters which are given by K-means follow Gaussian distribution.
According to that paper, MDL is given by:
MDL(K) = -log[p_y(y/K)] + 1/2 * L * log(n)
L = K(1 + n + (n + 1)n / 2) - 1
, where K is the number of the clusters, n is the total number of data values, and y is an n dimensional vector.
I am aware that the above explanation might be insufficient to answer this question, but the above is all the information I have now, and I have no idea how to reproduce the calculation introduced by the paper.
I would appreciate explanations on how to calculate MDL to evaluate clustering results.
MDL calculations always require some assumptions about how to encode the data. And that is where MDL papers are often wrong, because they compare their new encoding to a sub-quality encoding as baseline to get massive gains... Anyway, this value may be legit, but without context and proper definitions it's hard to tell.
When you approximate data with k-means, you have to store:
k itself
log k bits for each of n points to map points to centers
k vectors of d dimensions
the deviation of each point from the mean. If you assume small deviations are more frequent (Gaussian), use fewer bits for this, more bits for large deviations

some questions on cosine similarity

Yesterday I learnt that the cosine similarity, defined as
can effectively measure how similar two vectors are.
I find that the definition here uses the L2-norm to normalize the dot product of A and B, what I am interested in is that why not use the L1-norm of A and B in the denominator?
My teacher told me that if I use the L1-norm in the denominator, then cosine similarity would not be 1 if A=B. Then, I further ask him, if I modify the cosine similarity definition as follows, what the advantages and disadvantages the modified model are, as compared with the original model?
sim(A,B) = (A * B) / (||A||1 * ||B||1) if A!=B
sim(A,B) = 1 if A==B
I would appreciate if someone could give me some more explanations.
If you used L1-norm, your are not computing the cosine anymore.
Cosine is a geometrical concept, not a random definition. There is a whole string of mathematics attached to it. If you used the L1, you are not measuring angles anymore.
See also: Wikipedia: Trigonometric functions - Cosine
Note that cosine is monotone to Euclidean distance on L2 normalized vectors.
Euclidean(x,y)^2 = sum( (x-y)^2 ) = sum(x^2) + sum(y^2) - 2 sum(x*y)
if x and y are L2 normalized, then sum(x^2)=sum(y^2)=1, and then
Euclidean(x_norm,y_norm)^2 = 2 * (1 - sum(x_norm*y_norm)) = 2 * (1 - cossim(x,y))
So using cosine similarity essentially means standardizing your data to unit length. But there are also computational benefits associated with this, as sum(x*y) is cheaper to compute for sparse data.
If you L2 normalized your data, then
Euclidean(x_norm, y_norm) = sqrt(2) * sqrt(1-cossim(x,y))
For the second part of your question: fixing L1 norm isn't that easy. Consider the vectors (1,1) and (2,2). Obviously, these two vectors have the same angle, and thus should have cosine similarity 1.
Using your equation, they would have similarity (2+2)/(2*4) = 0.5
Looking at the vectors (0,1) and (0,2) - where most people agree they should have a similar similarity than above example (and where cosine indeed gives the same similarity), your equation yields (0+2)/(1+2) = 0.6666.... So your similarity does not match any intuition, does it?

sequence prediction using HMM Matlab

I'm currently learning the murphyk's toolbox for Hidden Markov's Model, However I'v a problem of determining my model's coefficients and also the algorithm for the sequence prediction by log likelihood.
My Scenario:
I have the flying bird's trajectory in 3D-space i.e its X,Y and Z which lies in Continuous HMM's category. I'v the 200 observations of flying bird i.e 500 rows data of trajectory, and I want to predict the sequence. I want to sample that in 20 datapoints . i.e after 10 points, so my first question is, Is following parameters are valid for my case?
O = 3; %Number of coefficients in a vector
T = 20; %Number of vectors in a sequence
nex = 50; %Number of sequences
M = 2; %Number of mixtures
Q = 20; %Number of states
And the second question is, What algorithm is appropriate for sequence prediction and is training is compulsory for that?
From what I understand, I'm assuming you're training 200 different classes (HMMs) and each class has 500 training examples (observation sequences).
O is the dimensionality of vectors, seems to be correct.
There is no need to have a fixed T, it depends on the observation sequences you have.
M is the number of multivariate Gaussians (or mixtures) in the GMM of a state. More will fit to your data better and give you better accuracy, but at the cost of performance. Choose a suitable value.
N does not need to be equal to T. For the best number of states N, you'll have to benchmark and see yourself:
Determinig the number of hidden states in a Hidden Markov Model
Yes, you have to train your classes using the Baum-Welch algorithm, optionally preceded by something like the segmental k-means procedure. After that you can easily perform isolated unit recognition using Forward/Backward probability or Viterbi probability by simply selecting the class with the highest probability.