Is it possible to calculate Euclidean Distance between two varying length matrices? - matlab

I have started working on online signature data-set for verification purpose. I have two matrices containing digitized data of two signatures of varying length (the number of rows differ). e.g. one is 177×7 and second is 170×7.
I want to treat each column as one time series and I'd like to compare one time series of a signature with the corresponding time series of second signature.
How should I align the two time series?

I think this question really belongs on Math.StackExchange, but I will do my best to answer it here. The short answer is that the Euclidean distance cannot be applied in this case and you will need to define your own notion of distance. This may or may not actually be feasible.
The notion of distance relies on the existence of a "metric" defined on the space of interest. If your vectors are of different lengths then traditional metrics (including the Euclidean distance) are ill-defined and you need to define a new metric that works for you.
There are two things you'll need to do here:
Define the space you're working with. This seems to be the set of vectors of length 177 or length 170. This is a very unusual set.
Define your metric (and ensure that it actually meets all the properties of a metric).
The most obvious solution is to project vectors of length 177 into the space of vectors of length 170 and then compute the Euclidean distance as usual. For example, you could just ignore the last 7 elements of the vector. Note that this is not a metric on your original set as it violates the condition ( d(x,y)=0 iff x=y ), but it is a metric on the projected vectors. There may be a clever solution on the original set, but there is not an obvious one. Again, the people on Math.StackExchange may be able to help you more.

Related

Principal component analysis and feature reductions

I have a matrix composed of 35 features, I need to reduce those
feature because I think many variable are dependent. I undertsood PCA
could help me to do that, so using matlab, I calculated:
[coeff,score,latent] = pca(list_of_features)
I notice "coeff" contains matrix which I understood (correct me if I'm wrong) have column with high importance on the left, and second column with less importance and so on. However, it's not clear for me which column on "coeff" relate to which column on my original "list_of_features" so that I could know which variable is more important.
PCA doesn't give you an order relation on your original features (which feature is more 'important' then others), rather it gives you directions in feature space, ordered according to the variance, from high variance (1st direction, or principle component) to low variance. A direction is generally a linear combination of your original features, so you can't expect to get information about a single feature.
What you can do is to throw away a direction (one or more), or in other words project you data into the sub-space spanned by a subset of the principle components. Usually you want to throw the directions with low variance, but that's really a choice which depends on what is your application.
Let's say you want to leave only the first k principle components:
x = score(:,1:k) * coeff(:,1:k)';
Note however that pca centers the data, so you actually get the projection of the centered version of your data.

calculate distance between curves

I need to calculate some kind of distance between to curves.
Those are general curves, and may not be functions - that is, some values of x may be mapped to more then one value.
EDIT
The curves are given as a list of X,Y pairs and the logical curve is the line passing through all the points in the order given. a typical data set will include about 1000 points
as noted, the curve may not be a function, but is usually similar to a function
This issue us what prevents using interp1 or the curve fitting toolbox (in Matlab)
The distance measure I was thinking of the the area of the region between the curves - but any reasonable alternative is ok.
EDIT
a sample illustration of to curves, and the area I want to compute
A Matlab solution is preferred, but other languages are also fine.
If you have functions that are of the type y = f(x) and they are defined over the same domain, then a common way to find the "distance" is to use the L2 norm as explained here http://en.wikipedia.org/wiki/L2_norm#p-norm. This is simply the integral of the absolute value of the difference between the functions squared. If you have parametric curves then you cannot directly employ this approach. If the L2 norm is not good enough for your requirements then you will need to provide a more concrete definition of what you mean by "distance". If you are unclear as to what you need try taking a look at different types of mathematical norm and see if any of the commonly used ones are what you need (ie L1 norm, uniform norm). The wikepedia link above is a good start point. If the L2 is good enough then you need a way to calculate the integral that you have - there are many many numerical integration techniques out there and I suggest google is your friend here (or a good numerical analysis text book).
If you do have parametric type curves then this is very nontrivial. Using the "area" between curves is not a good idea as there is no clear way to define this area and would become even more complicated in the general case where you could have self-intersecting curves. If your curves are parametrized in the same way you could try some very crude measurement where you evaluate points on each curve at equally spaced values over the parameter range, then calculate the length of the distance between each, sum and take the average as a notion of "closeness". i.e. partition your parameter range into a set {u_0, ... , u_n} and evaluate curve1(u_i) and curve2(u_i) for each i to generate a set of n paired points. Then sum the euclidean distances between each pair of points.
This is very very crude though and if the parametrizations are different then it wont be much use.
You need to define what you mean by distance between the curves. If it is the closest approach between two general curves, then it becomes quite difficult to solve the problem.
If the "curves" are not even representable as single valued functions of x, then it becomes more complex yet.
Merely telling us that you need to define "some kind of distance" is too broad of a statement to be on-topic here, and it says that you have not yet thought out the problem you wish to solve.
If all you are willing to tell us is that the curves are two totally general parametric curves, which may be closed or not, or they may not even lie over the same domain, then the question becomes so totally ill-posed as to be impossible to answer. What is the area between two curves in that case?
If the curves are defined over the SAME support, then subtracting them and integration of the absolute value or square of the difference will be adequate. But you have already told us that these "curves" may be multi-valued. In that case, it is essentially impossible to do what you are asking.

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.