How to find similarity between two signals with xcorr in matlab - matlab

I'm writing a code for speech recognition. I have a number n of database, each db contains the same number of words recorded by different persons.
I want to do the xcorr between, for example, the reference word "hello" with all the words in the "hello" db and all the words in the "door" db and then the code has to say to me which word is it. I need to make some mathematical paragon in order to make a decision.
Now, I know that the auto correlation between the same word has a symmetric graph. But if I compare the word "hello" said by a male and the same word said by a female it is not symmetric, and I obtain the same result if I compare the word "hello" with the word "door".
My question is: how can I find similarities between the two words doing the xcorr function? Do I need to find the lag or the maximum of the xcorr?
Thanks for the help.

My question is: how can I find similarities between the two words doing the xcorr function? Do I need to find the lag or the maximum of the xcorr?
To measure similarity to a single word recording, you need to take maximum, it's a measure of similarity
But if I compare the word "hello" said by a male and the same word said by a female it is not symmetric, and I obtain the same result if I compare the word "hello" with the word "door".
To make an optimal decision on which class the sample belongs to, you need to compare similarity measures for both clusters:
max(xcorr(sample, hello)) < > max(xcorr(sample, door)
the theory behind this is called "Bayes Optimal Classification".
If you have more word samples, you can make a better decision:
max_sample(max_lag(xcorr(sample, hello_sample_i)) < > max_sample(max_lag(xcorr(sample, door_sample_i))
I'm writing a code for speech recognition.
Speech samples are time-varying and xcorr is not invariant to time variation. A better measure would be DTW distance of spectrums for speech. You can find DTW distance implementation here:
http://www.ee.columbia.edu/ln/rosa/matlab/dtw/
DTW is invariant to time shifts, so you would be able to make more reliable decisions.

Related

ELKI DBSCAN epsilon value issue

i am trying to cluster word vectors using ELKI DBSCAN. I wish to use cosine distance to cluster the word vectors of 300 dimensions. The size of the dataset is 19,000 words (19000*300 size matrix). These are wordvectors computed using gensim word2vec and the list output is saved as a CSV
Below is the command i passed in the UI
KDDCLIApplication -dbc.in "D:\w2v\vectors.csv" -parser.colsep '","' -algorithm clustering.DBSCAN -algorithm.distancefunction CosineDistanceFunction -dbscan.epsilon 1.02 -dbscan.minpts 5 -vis.window.single
I played around with the epsilon value and while doing so i tried 3 values 0.8, 0.9, 1.0.
For 0.8 & 0.9 - i got "There are very few neighbors found. Epsilon may be too small."
while for 1.0 - i got "There are very many neighbors found. Epsilon may be too large."
What am i doing wrong here? I am quite new to ELKI so any help is appreciated
At 300 dimensions, you will be seeing the curse of dimensionality.
Contrary to popular claims, the curse of dimensionality does exist for cosine (as cosine is equivalent to Euclidean on normalized vectors, it can be at best 1 dimension "better" than Euclidean). What often makes cosine applications still work is that the intrinsic dimensionality is much less than the representation dimensionality on text (i.e., while your vocabulary may have a thousands of words, only few occur in the intersection of two documents).
Word vectors are usually not sparse, so your intrinsic dimension may be quite high, and you will see the curse of dimensionality.
So it is not surprising to see the Cosine distances to concentrate, and then you may need to choose a threshold with a few digits of precision.
For obvious reasons, 1.0 is a nonsense threshold for cosine distance. The maximum cosine distance is 1.0! So yes, you will need to try 0.95 and 0.99, for example.
You can use the KNNDistancesSampler to help you choose DBSCAN parameters, or you can use for example OPTICS (which will allow you to find clusters with different thresholds, not just one single threshold).
Beware that word vectors are trained for a very specific scenario: substitutability. They are by far not as universal as popularly interpreted based on the "king-man+woman=queen" example. Just try "king-man+boy", which often also returns "queen" (or "kings")... the result is mostly because that the nearest neighbors of king are "queen" and "kings". And the "capital" example is similarly overfitted due to the training data. It's trained on news articles, which often begin the text with "capital, country, blah blah". If you omit "capital", and if you omit "country", you get almost the exact same context. So the word2vec model learns that they are "substitutable". This works as long as the capital is also where the major US newspapers are based (e.g., Berlin, Paris). It often fails for countries like Canada, U.S., or Australia, where the main reporting hubs are located, e.g., in Toronto, New York, Sydney. And it does not really prove that the vectors have learned what a capital is. The reason that it worked in the first place is by overfitting on the news training data.

Pattern recognition teachniques that allow input as sequence of different length

I am trying to classify water end-use events expressed as a time-series sequences into appropriate categories (e.g. toilet, tap, shower, etc). My first attempt using HMM shows a quite promising result with an average accuracy of 80%. I just wonder if there is any other techniques that allow the training input as time-series sequences of different length like HMM does rather than the extracted feature vector of each sequence. I have tried Conditional Random Field (CRF) and SVM ;however, as far as I know, these two techniques require input as a pre-computed feature vector and the length of all input vectors must be the same for training purpose. I am not sure if I am right or wrong at this point. Any help would be appreciated.
Thanks, Will

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

Distance to nearest palindrome

I'd like an algorithm to provide some kind of measure of how symmetrical a string is.In looking through previous questions, I found one on finding the number of letters that need to be added to a string to turn it into a palindrome. This is close to what I'm looking for but too restrictive in the set of allowable editing operations.
My motivation for this is that I'd like to make an improved version of a video that I put on Youtube called "Numbers are Colorful" The video shows Golden Ratio bases and a couple other related systems using irrational bases. Surprisingly, one system is to begin with completely symmetrical. but the others exhibit partial symmetry which I would like to highlight.
Are you looking for repetition or symmetry? So far I have seen no example that points to symmetry only repetition. 1001010.0010101 is not symmetrical. They are related by a circular shift, i.e. take the first set of digits [1001010], shift it to the left by 1 [0010101] and now you have the right side.
Unless you make it clear what you are trying to identify, this question is too poorly defined to give a sensible answer. If you really mean symmetrical, show me an example of symmetry. You might as well mean "I can see some interesting pattern here" which is so poorly defined it's difficult to quantify.
That said, digital signal processing is the sort of area you might look into for identifying interesting patterns. For example, if you are looking for repetition then I suggest you attempt to use an algorithm designed for detecting repeating patterns.
Consider the digits in your number to be an input signal. Perform frequency analysis on this signal to detect repeating sections of numbers. If you have a strong repeating component in your series of digits this should relate to a strong frequency component in your analysis. You can measure the strength of this pattern from identifying the fundamental frequency by performing the Fourier transform, and summing all of the harmonics for the most significant frequency bin. Divide this by the total energy of the signal and this will give you a measure between 0 and 1 for how "repetitive" the signal is, and will also identify the periodicity of the signal. You may be better off using time-domain algorithms like Autocorrelation, AMDF, or the YIN estimator. (Particularly AMDF)
A similar approach can be adopted if you were to consider actual symmetry (i.e. the numbers are still very similar when you reverse them).Take your input number, create a new signal by reversing it, and then measure their "sameness" at each discrete phase. If you have a digit of length N you could consider padding it with 0's to the length 2N before performing the comparison of the signal with it's inverted self, to consider the possibility of digits lying outside the length of the number.
The time-domain techniques are more likely to work because they are not affected so much by discontinuities. They do literally compare "sameness" of a signal by either computing the difference of all the points at each phase or multiplying the numbers together at each phase. In the subtraction case you hope to get to 0 when they are similar. In the multiplication case you hope to get a peak in the function when the numbers are back in phase. They are however more prone to noise (which in this context means the numbers which aren't quite right).

Combining different similarities to build one final similarity

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
city
education
interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1 means the presence of the interest and n is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.
There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.
Here's the usual trick in machine learning.
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
I take this to mean you use a one-of-K coding. That's good.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.