better building of kd-trees - cluster-analysis

Has anyone ever tried improving kd-trees using the following method?
Dividing each numeric dimension via some 1-d clustering method (e.g. Jenks Natural Breaks Optimization, or FayyadIranni or xyz...)
Sorting the dimensions on the expected value of the variance reduction within each division of that dimension
Building the KD-tree top-down selecting attributes from the order found in (2)
Breaking dimensions at each level of the KD-tree using the divisions found in (1)
And just to say the obvious. If (3) terminates when #rows is (say) less than 30 then nearest neighbor would require 30 distance measures, not N.

You want the tree to be balanced, so there is not much leeway in terms of where to split.
Also, you want the construction to be fast.
If you put in an O(n^2) method during construction, construction will likely be the new bottleneck.
In many cases, the very simple (original) k-d-tree is just as fast as any of the "optimized" techniques that try to determine the "best" splitting axis.

Related

Forcing fmincon to consider candidate solutions only from a specific set

Is it possible to force MATLAB's fmincon to generate candidate points from a specific set of points that I supply? I ask this because if this were possible, it would greatly reduce computation complexity, and knock out a couple of constraints from the optimization.
Edited to add more info: Essentially, I'm looking for a set of points in a high dimension space which satisfy a particular function (belonging to that space), along with a couple of other constraints. The objective function minimizes the total length of the path formed by these points. Think generalized geodesic problem, attempted via non-linear optimization. While I manage to get near a solution while using fmincon, it is painfully slow and is prone to get stuck in local minima. I already have a sizeable set of data-points which satisfy this high-dimensional function, and have no compulsion that this new set of points not belong to this pre-existing set.

What exactly is the nobalance option in Matlab eig function?

In Matlab you can issue the eig function with the 'nobalance' option. What exactly does it do differently from the default one?
From mathworks documentation:
Balance option, specified as one two strings: 'balance', which enables a preliminary balancing step, or 'nobalance' which disables it. In most cases, the balancing step improves the conditioning of A to produce more accurate results. However, there are cases in which balancing produces incorrect results. Specify 'nobalance' when A contains values whose scale differs dramatically. For example, if A contains nonzero integers, as well as very small (near zero) values, then the balancing step might scale the small values to make them as significant as the integers and produce inaccurate results.
EDIT: A related function balance is said to be the default preceding step in eig.
Note a few lines in the documentation - "The ill conditioning is concentrated in the scaling matrix" .... "If a matrix contains small elements that are due to roundoff error, balancing might scale them up to make them as significant as the other elements of the original matrix."
So, my answer to #Isopycnal's question is "nobalance suppresses amplification of round-off errors, when dealing with ill-conditioned matrices". Here are a few points that may help -
"balancing" a matrix A is essentially performing a similarity transformation B = T\A*T where B is called as a "balanced matrix".
by balancing a good-conditioned matrix (which means it has reasonable scale), the "asymmetry" is concentrated into the scaling matrix, T. According to the documentation of eig, "In most cases, the balancing step improves the conditioning of A to produce more accurate results. "
however, balancing an ill-conditioned (means very large scale) matrix will scale up the round-off errors, because Matlab is trying to make the small values (such as 1e-9) as significant as the large ones (say 1e10). Without careful thinking it's already known that the result will be less precise.
I know it has something to do with the matrix decomposition algorithms which Matlab picks when performing eig, eg "Pencil decomposition LU factorization etc", as #EJG89 has pointed out. But it's too deeply buried in my memory to recall :( Anyone who knows how Matlab perform commands like eig please consider expanding this answer! Thanks!
Just for completeness, the balancing method is along the lines of LAPACK's ?GEBAL and ?GEBAK routines but some testing suggests that there are some modifications as the results differ occasionally.
The balancing helps to improve the conditioning via similarity transformations. However, in some cases balancing actually makes the problem worse. The documented cases include Hessenberg matrices and matrices with numerical noise that is amplified by the scaling which the algorithm tries to balance with the actual data. Depending on the problem the data matrix is also permuted to bring the matrix to upper triangular form as much as possible.
The balancing algorithm can also be used via balance.m
Other relevant balancing routines deep in the toolboxes are mscale.m and arescale.m routines from control system toolbox which offers more refined control (excuse the pun).

Different results for Fundamental Matrix in Matlab

I am implementing stereo matching and as preprocessing I am trying to rectify images without camera calibration.
I am using surf detector to detect and match features on images and try to align them. After I find all matches, I remove all that doesn't lie on the epipolar lines, using this function:
[fMatrix, epipolarInliers, status] = estimateFundamentalMatrix(...
matchedPoints1, matchedPoints2, 'Method', 'RANSAC', ...
'NumTrials', 10000, 'DistanceThreshold', 0.1, 'Confidence', 99.99);
inlierPoints1 = matchedPoints1(epipolarInliers, :);
inlierPoints2 = matchedPoints2(epipolarInliers, :);
figure; showMatchedFeatures(I1, I2, inlierPoints1, inlierPoints2);
legend('Inlier points in I1', 'Inlier points in I2');
Problem is, that if I run this function with the same data, I am still getting different results causing differences in resulted disparity map in each run on the same data
Pulatively matched points are still the same, but inliners points differs in each run.
Here you can see that some matches are different in result:
UPDATE: I thought that differences was caused by RANSAC method, but using LMedS, MSAC, I am still getting different results on the same data
EDIT: Admittedly, this is only a partial answer, since I am only explaining why this is even possible with these fitting methods and not how to improve the input keypoints to avoid this problem from the start. There are problems with the distribution of your keypoint matches, as noted in the other answers, and there are ways to address that at the stage of keypoint detection. But, the reason the same input can yield different results for repeated executions of estimateFundamentalMatrix with the same pairs of keypoints is because of the following. (Again, this does not provide sound advice for improving keypoints so as to solve this problem).
The reason for different results on repeated executions, is related to the the RANSAC method (and LMedS and MSAC). They all utilize stochastic (random) sampling and are thus non-deterministic. All methods except Norm8Point operate by randomly sampling 8 pairs of points at a time for (up to) NumTrials.
But first, note that the different results you get for the same inputs are not equally suitable (they will not have the same residuals) but the search space can easily lead to any such minimum because the optimization algorithms are not deterministic. As the other answers rightly suggest, improve your keypoints and this won't be a problem, but here is why the robust fitting methods can do this and some ways to modify their behavior.
Notice the documentation for the 'NumTrials' option (ADDED NOTE: changing this is not the solution, but this does explain the behavior):
'NumTrials' — Number of random trials for finding the outliers
500 (default) | integer
Number of random trials for finding the outliers, specified as the comma-separated pair consisting of 'NumTrials' and an integer value. This parameter applies when you set the Method parameter to LMedS, RANSAC, MSAC, or LTS.
MSAC (M-estimator SAmple Consensus) is a modified RANSAC (RANdom SAmple Consensus). Deterministic algorithms for LMedS have exponential complexity and thus stochastic sampling is practically required.
Before you decide to use Norm8Point (again, not the solution), keep in mind that this method assumes NO outliers, and is thus not robust to erroneous matches. Try using more trials to stabilize the other methods (EDIT: I mean, rather than switching to Norm8Point, but if you are able to back up in your algorithms then address the the inputs -- the keypoints -- as a first line of attack). Also, to reset the random number generator, you could do rng('default') before each call to estimateFundamentalMatrix. But again, note that while this will force the same answer each run, improving your key point distribution is the better solution in general.
I know its too late for your answer, but I guess it would be useful for someone in the future. Actually, the problem in your case is two fold,
Degenerate location of features, i.e., The location of features is mostly localized (on you :P) and not well-spread throughout the image.
These matches are sort of on the same plane. I know you would argue that your body is not planar, but comparing it to the depth of the room, it sort of is.
Mathematically, this means you are kind of extracting E (or F) from a planar surface, which always has infinite solutions. To sort this out, I would suggest using some constrain on distance between any two extracted SURF features, i.e., any two SURF features used for matching should be at least 40 or 100 pixels apart (depending on the resolution of your image).
Another way to get better SURF features is to set 'NumOctaves' in detectSURFFeatures(rgb2gray(I1),'NumOctaves',5); to larger values.
I am facing the same problem and this has helped (a little bit).

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.