I want to find the K-farthest neighbors in a given undirected weighted graph (the graph is given as a sparse weight matrix, but I can use an representation advised).
Just to make sure the problem is well-defined: I want to find k nodes which have maximal distance from one another.
Solutions that are close to the optimal set are also ok - I just need it to find some farthest points in a mesh :)
Assuming you are just looking for a decent solution I would recommend a simple solution similar to the "furthest insertion" starting position for the travelling salesman problem:
Add 1 point to the empty set, preferably one in the corner or in the edge (Of course you can just try all of them)
Add the furthest point to the set (increase the distance most from current points in set)
Keep repeating the previous step untill there are k points in the set
It will not be optimal but probably not very bad.
If you want to improve on this you could use a heuristic to improve on the result, for example:
Consider the set with point 1 to j left out, j
Try all possible points to substitute these j points
record best possible solution
Consider the set with point 2 to j+1 left out
etcetera
Furthermore if k is not too large, say less than 5, and the total amount of points is not too large, say less than 100, it will probably be easier to just calculate all possible combinations. This is assuming that the norm calculation can be done efficiently.
EDIT:
Once you know you want to implement this the regular way to continue is find something similar and edit it a bit to suit your needs. If you scroll down on this page you should find an example of furthest insertion. Editing it to follow your measure of 'far' should be managable.
http://snipplr.com/view/4064/shortest-path-heuristics-nearest-neighborhood-2-opt-farthest-and-arbitrary-insertion-for-travelling-salesman-problem/
Related
I need to divide data points into those that are similar to each other("good" points) and everyone else("bad" points).
It looks like some kind of clustering problem and what do I do:
I am assuming that there are at least two "good" points.
Find pairwise distance between all types of points.
Find minimum distance (minDist).
Do hierarchical clustering for all points.
Make a cut at the height of 5*minDist.
Say that all points that are in the same cluster as pair with minDist and under that cut belong to the desired "good" cluster.
And this works pretty well, but if there are two points that are very close to each other. minDist is very small and this 5*minDist cut is also small => only these 2 points are in the desired "good" cluster.
I would think that either I need to change this approach completely and here is question number 1:
[1] "What methods do exist to separate similar points from everyone else?"
Or I need to modify this 5*minDist to some other function of minDist. And question is:
[2] "What may I choose as reasonable alternative to 5*minDist?"
Vladimir
Instead of doing clustering, you want to do outlier detection.
There are dozens of algorithms for this (see ELKI for a large collection). Some very basic methods may solve your problem:
The number of neighbors in radius r. If +i < threshold, the point is an outlier.
Distance to the k nearest neighbor. Choose k>1 to avoid these 2 element clusters you are seeing.
Also, DBSCAN clustering could work for you. Consider all clusters to be good, and only noise to be bad!
On each day, I have 10000 position pairs in the form of (x,y); Up to now, I have collected 30 days. I want to select a position pair from each day so that all the positions pairs have similar coordinates (x,y) value. By similar I mean the euclidean distance is minimized between any two pairs. How to do it in matlab with efficiency. Because with brute force, it is almost impossible.
In brute force case, we have 10000^30 possibilities, each operation say needs 10^-9 second,
It will run forever.
One idea would be to use the k-means algorithm or one of its variations. it is relatively easy to implement (it is also part of the Statistics Toolbox) and has a runtime about O(nkl).
Analysing all the possibilities will give you for sure the best result you are looking for.
If you want an approximate result you can consider the first two days and analyse all the possibilities for these 2 days and pick the best result. Then when analyse the next day keep the result obtained previously and find the point of the third column closest to the previous two.
In this way you will obtain an approximate solution but with a less computational complexity.
This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.
I am given at of points x_1, x_2, ... x_n \in R^d. I wish to find a subset of k points such that the sum of the distances between these k points is minimal. Naively this is an O(n choose k) problem, but I am looking for a faster algorithm.
I can think of two alternative equivalent formulations:
The minimal edge weight clique problem: think of the points as a graph, edge weights are the distances, and finding the minimal weight clique. This is equivalent to maximal edge weight problem, which is known to be NP-complete. However, I have the benefit of knowing that my graph is embedded in R^d, and that all the weights are positive, so perhaps that might help?
The minimal unconstrained sub-matrix problem: I am given the symmetric distance matrix, and I want to find a kXk minor with minimal sum.
I'd appreciate any help in this.
The most obvious optimization doesn't really require any different formula.
Just greedily find a near-optimal candidate first. Try to refine it in linear time by swapping members. Then do an exhaustive search but stop whenever the new candidates are worse than the greedy-candidate to prune the search space.
E.g.
Compute the mean
Order objects by squared distance from mean
Test all n-k intervals of length k in this order, choose the best
For any non-chosen object, try to swap it with one of the chosen objects, if it improves the score
Now you should have a reasonably good candidate for pruning.
Then do an exhaustive search, and stop whenever it is worse than this candidate.
Note: steps 1-3 are an inspiration taken from fast convex hull algorithms.
I have a dataset consisting of a large collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is nearest to the area with the highest density of points.
So my problem consists of two steps:
1: Determine where density of the distribution of points is at its highest
2: Determine which point is nearest to the point found in 1
Point 2 i can manage, but i'm not sure how to solve point 1. I know there are a lot of functions for density estimation in Matlab, but i'm not sure which one would be the most suitable, or straightforward to use.
Does anyone know?
My command of statistics is a little bit rusty, but as far as i can tell, this type of problem calls for multivariate analysis. Someone suggested i use multivariate kernel density estimation, but i'm not really sure if that's the best solution.
Density is a measure of mass per unit volume. On the assumption that your points all have the same mass then you are, I suppose, trying to measure the number of points per unit volume. So one approach is to divide your subset of Euclidean space into lots of little unit volumes (let's call them voxels like everyone does) and count how many points there are in each one. The voxel with the most points is where the density of points is at its highest. This is, of course, numerical integration of a sort. If your points were distributed according to some analytic function (and I guess they are not) you could solve the problem with pencil and paper.
You might make this approach as sophisticated as you like, perhaps initially dividing your space into 2 x 2 x 2 voxels, then choosing the voxel with most points and sub-dividing that in turn until your criteria are satisfied.
I hope this will get you started on your point 1; you seem to be OK with point 2 so I'll stop now.
EDIT
It looks as if triplequad might be what you are looking for.