Help me find the algorithm name - quantify difference between two words - word

I know there is an algorithm for seeing how "close" two words are together. The idea is that this algorithm adds 1 point to the score for every single letter addition or subtraction that is necessary to transform one word into the other. The lower this score, the "closer" the two words are together.
For example, if we take the word "word" and "sword", their distance is 1. To go from "word" to "sword" all you have to do as add an "s" in the beginning.
For "week" and "welk" the distance is 2. You need to subtract the "e" and add an "l".
I remember this algorithm is used for sorting the suggestion list in spell-checkers. I cannot recall the name of this algo.
What is this algorithm called?

Levenshtein Distance
Is it just me or is this simple algorithm great?

This sounds a lot like the Levenshtein distance algorithm

Levenshtein distance.

Levenshtein distance
http://en.wikipedia.org/wiki/Levenshtein_distance

Do you mean Levenshtein distance?

Related

Find the nearest positions

On each day, I have 10000 position pairs in the form of (x,y); Up to now, I have collected 30 days. I want to select a position pair from each day so that all the positions pairs have similar coordinates (x,y) value. By similar I mean the euclidean distance is minimized between any two pairs. How to do it in matlab with efficiency. Because with brute force, it is almost impossible.
In brute force case, we have 10000^30 possibilities, each operation say needs 10^-9 second,
It will run forever.
One idea would be to use the k-means algorithm or one of its variations. it is relatively easy to implement (it is also part of the Statistics Toolbox) and has a runtime about O(nkl).
Analysing all the possibilities will give you for sure the best result you are looking for.
If you want an approximate result you can consider the first two days and analyse all the possibilities for these 2 days and pick the best result. Then when analyse the next day keep the result obtained previously and find the point of the third column closest to the previous two.
In this way you will obtain an approximate solution but with a less computational complexity.

Is it possible to calculate Euclidean Distance between two varying length matrices?

I have started working on online signature data-set for verification purpose. I have two matrices containing digitized data of two signatures of varying length (the number of rows differ). e.g. one is 177×7 and second is 170×7.
I want to treat each column as one time series and I'd like to compare one time series of a signature with the corresponding time series of second signature.
How should I align the two time series?
I think this question really belongs on Math.StackExchange, but I will do my best to answer it here. The short answer is that the Euclidean distance cannot be applied in this case and you will need to define your own notion of distance. This may or may not actually be feasible.
The notion of distance relies on the existence of a "metric" defined on the space of interest. If your vectors are of different lengths then traditional metrics (including the Euclidean distance) are ill-defined and you need to define a new metric that works for you.
There are two things you'll need to do here:
Define the space you're working with. This seems to be the set of vectors of length 177 or length 170. This is a very unusual set.
Define your metric (and ensure that it actually meets all the properties of a metric).
The most obvious solution is to project vectors of length 177 into the space of vectors of length 170 and then compute the Euclidean distance as usual. For example, you could just ignore the last 7 elements of the vector. Note that this is not a metric on your original set as it violates the condition ( d(x,y)=0 iff x=y ), but it is a metric on the projected vectors. There may be a clever solution on the original set, but there is not an obvious one. Again, the people on Math.StackExchange may be able to help you more.

How to find similarity between two signals with xcorr in matlab

I'm writing a code for speech recognition. I have a number n of database, each db contains the same number of words recorded by different persons.
I want to do the xcorr between, for example, the reference word "hello" with all the words in the "hello" db and all the words in the "door" db and then the code has to say to me which word is it. I need to make some mathematical paragon in order to make a decision.
Now, I know that the auto correlation between the same word has a symmetric graph. But if I compare the word "hello" said by a male and the same word said by a female it is not symmetric, and I obtain the same result if I compare the word "hello" with the word "door".
My question is: how can I find similarities between the two words doing the xcorr function? Do I need to find the lag or the maximum of the xcorr?
Thanks for the help.
My question is: how can I find similarities between the two words doing the xcorr function? Do I need to find the lag or the maximum of the xcorr?
To measure similarity to a single word recording, you need to take maximum, it's a measure of similarity
But if I compare the word "hello" said by a male and the same word said by a female it is not symmetric, and I obtain the same result if I compare the word "hello" with the word "door".
To make an optimal decision on which class the sample belongs to, you need to compare similarity measures for both clusters:
max(xcorr(sample, hello)) < > max(xcorr(sample, door)
the theory behind this is called "Bayes Optimal Classification".
If you have more word samples, you can make a better decision:
max_sample(max_lag(xcorr(sample, hello_sample_i)) < > max_sample(max_lag(xcorr(sample, door_sample_i))
I'm writing a code for speech recognition.
Speech samples are time-varying and xcorr is not invariant to time variation. A better measure would be DTW distance of spectrums for speech. You can find DTW distance implementation here:
http://www.ee.columbia.edu/ln/rosa/matlab/dtw/
DTW is invariant to time shifts, so you would be able to make more reliable decisions.

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

Find K-farthest neighbors in a weighted graph in matlab

I want to find the K-farthest neighbors in a given undirected weighted graph (the graph is given as a sparse weight matrix, but I can use an representation advised).
Just to make sure the problem is well-defined: I want to find k nodes which have maximal distance from one another.
Solutions that are close to the optimal set are also ok - I just need it to find some farthest points in a mesh :)
Assuming you are just looking for a decent solution I would recommend a simple solution similar to the "furthest insertion" starting position for the travelling salesman problem:
Add 1 point to the empty set, preferably one in the corner or in the edge (Of course you can just try all of them)
Add the furthest point to the set (increase the distance most from current points in set)
Keep repeating the previous step untill there are k points in the set
It will not be optimal but probably not very bad.
If you want to improve on this you could use a heuristic to improve on the result, for example:
Consider the set with point 1 to j left out, j
Try all possible points to substitute these j points
record best possible solution
Consider the set with point 2 to j+1 left out
etcetera
Furthermore if k is not too large, say less than 5, and the total amount of points is not too large, say less than 100, it will probably be easier to just calculate all possible combinations. This is assuming that the norm calculation can be done efficiently.
EDIT:
Once you know you want to implement this the regular way to continue is find something similar and edit it a bit to suit your needs. If you scroll down on this page you should find an example of furthest insertion. Editing it to follow your measure of 'far' should be managable.
http://snipplr.com/view/4064/shortest-path-heuristics-nearest-neighborhood-2-opt-farthest-and-arbitrary-insertion-for-travelling-salesman-problem/