Find the nearest positions - matlab

On each day, I have 10000 position pairs in the form of (x,y); Up to now, I have collected 30 days. I want to select a position pair from each day so that all the positions pairs have similar coordinates (x,y) value. By similar I mean the euclidean distance is minimized between any two pairs. How to do it in matlab with efficiency. Because with brute force, it is almost impossible.
In brute force case, we have 10000^30 possibilities, each operation say needs 10^-9 second,
It will run forever.

One idea would be to use the k-means algorithm or one of its variations. it is relatively easy to implement (it is also part of the Statistics Toolbox) and has a runtime about O(nkl).

Analysing all the possibilities will give you for sure the best result you are looking for.
If you want an approximate result you can consider the first two days and analyse all the possibilities for these 2 days and pick the best result. Then when analyse the next day keep the result obtained previously and find the point of the third column closest to the previous two.
In this way you will obtain an approximate solution but with a less computational complexity.

Related

better building of kd-trees

Has anyone ever tried improving kd-trees using the following method?
Dividing each numeric dimension via some 1-d clustering method (e.g. Jenks Natural Breaks Optimization, or FayyadIranni or xyz...)
Sorting the dimensions on the expected value of the variance reduction within each division of that dimension
Building the KD-tree top-down selecting attributes from the order found in (2)
Breaking dimensions at each level of the KD-tree using the divisions found in (1)
And just to say the obvious. If (3) terminates when #rows is (say) less than 30 then nearest neighbor would require 30 distance measures, not N.
You want the tree to be balanced, so there is not much leeway in terms of where to split.
Also, you want the construction to be fast.
If you put in an O(n^2) method during construction, construction will likely be the new bottleneck.
In many cases, the very simple (original) k-d-tree is just as fast as any of the "optimized" techniques that try to determine the "best" splitting axis.

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Is it possible to calculate Euclidean Distance between two varying length matrices?

I have started working on online signature data-set for verification purpose. I have two matrices containing digitized data of two signatures of varying length (the number of rows differ). e.g. one is 177×7 and second is 170×7.
I want to treat each column as one time series and I'd like to compare one time series of a signature with the corresponding time series of second signature.
How should I align the two time series?
I think this question really belongs on Math.StackExchange, but I will do my best to answer it here. The short answer is that the Euclidean distance cannot be applied in this case and you will need to define your own notion of distance. This may or may not actually be feasible.
The notion of distance relies on the existence of a "metric" defined on the space of interest. If your vectors are of different lengths then traditional metrics (including the Euclidean distance) are ill-defined and you need to define a new metric that works for you.
There are two things you'll need to do here:
Define the space you're working with. This seems to be the set of vectors of length 177 or length 170. This is a very unusual set.
Define your metric (and ensure that it actually meets all the properties of a metric).
The most obvious solution is to project vectors of length 177 into the space of vectors of length 170 and then compute the Euclidean distance as usual. For example, you could just ignore the last 7 elements of the vector. Note that this is not a metric on your original set as it violates the condition ( d(x,y)=0 iff x=y ), but it is a metric on the projected vectors. There may be a clever solution on the original set, but there is not an obvious one. Again, the people on Math.StackExchange may be able to help you more.

Find K-farthest neighbors in a weighted graph in matlab

I want to find the K-farthest neighbors in a given undirected weighted graph (the graph is given as a sparse weight matrix, but I can use an representation advised).
Just to make sure the problem is well-defined: I want to find k nodes which have maximal distance from one another.
Solutions that are close to the optimal set are also ok - I just need it to find some farthest points in a mesh :)
Assuming you are just looking for a decent solution I would recommend a simple solution similar to the "furthest insertion" starting position for the travelling salesman problem:
Add 1 point to the empty set, preferably one in the corner or in the edge (Of course you can just try all of them)
Add the furthest point to the set (increase the distance most from current points in set)
Keep repeating the previous step untill there are k points in the set
It will not be optimal but probably not very bad.
If you want to improve on this you could use a heuristic to improve on the result, for example:
Consider the set with point 1 to j left out, j
Try all possible points to substitute these j points
record best possible solution
Consider the set with point 2 to j+1 left out
etcetera
Furthermore if k is not too large, say less than 5, and the total amount of points is not too large, say less than 100, it will probably be easier to just calculate all possible combinations. This is assuming that the norm calculation can be done efficiently.
EDIT:
Once you know you want to implement this the regular way to continue is find something similar and edit it a bit to suit your needs. If you scroll down on this page you should find an example of furthest insertion. Editing it to follow your measure of 'far' should be managable.
http://snipplr.com/view/4064/shortest-path-heuristics-nearest-neighborhood-2-opt-farthest-and-arbitrary-insertion-for-travelling-salesman-problem/

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.