When i run some cepstral coefficient data generated from .wav
files in ELKI wit Kmeans Algorithm k =32 and max iter=100 it gives
negative values for the following Pair Counting Measures.
Jaccard=-3.3627 Recall=-3.3627 Rand=-3.3627 and F1 Measure=2.8465 I
searched somewhere for the range of these measures and they were
(0,1). I ran this data with several other algorithms and having the
same problem. Can anyone please interpret it?
The values should be in the range of [0;1], but:
only if you have complete labels (missing labels can be skipped, but I'm not sure if our implementation handles this case yet)
the clustering must be a complete, non-overlapping, crisp partitioning
Furthermore, when clusters degenerate (depending on your data and seeding, this may happen with k-means) there could be empty clusters, and these again may yield undesired results with the literate implementation of these measures.
How did you label your data?
We try our best to also handle corner cases right; but we can only diagnose and fix what we have observed and can reproduce.
Related
This question already has answers here:
Cluster analysis in R: determine the optimal number of clusters
(8 answers)
Closed 3 years ago.
I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.
For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.
I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.
K-means is not suitable for outlier detection. This keeps popping up here all the time.
K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
K-means itself is known to not work well on noisy data where data points do not belong to the clusters
It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.
Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.
Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.
The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.
The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.
In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.
I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).
In Matlab you can issue the eig function with the 'nobalance' option. What exactly does it do differently from the default one?
From mathworks documentation:
Balance option, specified as one two strings: 'balance', which enables a preliminary balancing step, or 'nobalance' which disables it. In most cases, the balancing step improves the conditioning of A to produce more accurate results. However, there are cases in which balancing produces incorrect results. Specify 'nobalance' when A contains values whose scale differs dramatically. For example, if A contains nonzero integers, as well as very small (near zero) values, then the balancing step might scale the small values to make them as significant as the integers and produce inaccurate results.
EDIT: A related function balance is said to be the default preceding step in eig.
Note a few lines in the documentation - "The ill conditioning is concentrated in the scaling matrix" .... "If a matrix contains small elements that are due to roundoff error, balancing might scale them up to make them as significant as the other elements of the original matrix."
So, my answer to #Isopycnal's question is "nobalance suppresses amplification of round-off errors, when dealing with ill-conditioned matrices". Here are a few points that may help -
"balancing" a matrix A is essentially performing a similarity transformation B = T\A*T where B is called as a "balanced matrix".
by balancing a good-conditioned matrix (which means it has reasonable scale), the "asymmetry" is concentrated into the scaling matrix, T. According to the documentation of eig, "In most cases, the balancing step improves the conditioning of A to produce more accurate results. "
however, balancing an ill-conditioned (means very large scale) matrix will scale up the round-off errors, because Matlab is trying to make the small values (such as 1e-9) as significant as the large ones (say 1e10). Without careful thinking it's already known that the result will be less precise.
I know it has something to do with the matrix decomposition algorithms which Matlab picks when performing eig, eg "Pencil decomposition LU factorization etc", as #EJG89 has pointed out. But it's too deeply buried in my memory to recall :( Anyone who knows how Matlab perform commands like eig please consider expanding this answer! Thanks!
Just for completeness, the balancing method is along the lines of LAPACK's ?GEBAL and ?GEBAK routines but some testing suggests that there are some modifications as the results differ occasionally.
The balancing helps to improve the conditioning via similarity transformations. However, in some cases balancing actually makes the problem worse. The documented cases include Hessenberg matrices and matrices with numerical noise that is amplified by the scaling which the algorithm tries to balance with the actual data. Depending on the problem the data matrix is also permuted to bring the matrix to upper triangular form as much as possible.
The balancing algorithm can also be used via balance.m
Other relevant balancing routines deep in the toolboxes are mscale.m and arescale.m routines from control system toolbox which offers more refined control (excuse the pun).
I am implementing stereo matching and as preprocessing I am trying to rectify images without camera calibration.
I am using surf detector to detect and match features on images and try to align them. After I find all matches, I remove all that doesn't lie on the epipolar lines, using this function:
[fMatrix, epipolarInliers, status] = estimateFundamentalMatrix(...
matchedPoints1, matchedPoints2, 'Method', 'RANSAC', ...
'NumTrials', 10000, 'DistanceThreshold', 0.1, 'Confidence', 99.99);
inlierPoints1 = matchedPoints1(epipolarInliers, :);
inlierPoints2 = matchedPoints2(epipolarInliers, :);
figure; showMatchedFeatures(I1, I2, inlierPoints1, inlierPoints2);
legend('Inlier points in I1', 'Inlier points in I2');
Problem is, that if I run this function with the same data, I am still getting different results causing differences in resulted disparity map in each run on the same data
Pulatively matched points are still the same, but inliners points differs in each run.
Here you can see that some matches are different in result:
UPDATE: I thought that differences was caused by RANSAC method, but using LMedS, MSAC, I am still getting different results on the same data
EDIT: Admittedly, this is only a partial answer, since I am only explaining why this is even possible with these fitting methods and not how to improve the input keypoints to avoid this problem from the start. There are problems with the distribution of your keypoint matches, as noted in the other answers, and there are ways to address that at the stage of keypoint detection. But, the reason the same input can yield different results for repeated executions of estimateFundamentalMatrix with the same pairs of keypoints is because of the following. (Again, this does not provide sound advice for improving keypoints so as to solve this problem).
The reason for different results on repeated executions, is related to the the RANSAC method (and LMedS and MSAC). They all utilize stochastic (random) sampling and are thus non-deterministic. All methods except Norm8Point operate by randomly sampling 8 pairs of points at a time for (up to) NumTrials.
But first, note that the different results you get for the same inputs are not equally suitable (they will not have the same residuals) but the search space can easily lead to any such minimum because the optimization algorithms are not deterministic. As the other answers rightly suggest, improve your keypoints and this won't be a problem, but here is why the robust fitting methods can do this and some ways to modify their behavior.
Notice the documentation for the 'NumTrials' option (ADDED NOTE: changing this is not the solution, but this does explain the behavior):
'NumTrials' — Number of random trials for finding the outliers
500 (default) | integer
Number of random trials for finding the outliers, specified as the comma-separated pair consisting of 'NumTrials' and an integer value. This parameter applies when you set the Method parameter to LMedS, RANSAC, MSAC, or LTS.
MSAC (M-estimator SAmple Consensus) is a modified RANSAC (RANdom SAmple Consensus). Deterministic algorithms for LMedS have exponential complexity and thus stochastic sampling is practically required.
Before you decide to use Norm8Point (again, not the solution), keep in mind that this method assumes NO outliers, and is thus not robust to erroneous matches. Try using more trials to stabilize the other methods (EDIT: I mean, rather than switching to Norm8Point, but if you are able to back up in your algorithms then address the the inputs -- the keypoints -- as a first line of attack). Also, to reset the random number generator, you could do rng('default') before each call to estimateFundamentalMatrix. But again, note that while this will force the same answer each run, improving your key point distribution is the better solution in general.
I know its too late for your answer, but I guess it would be useful for someone in the future. Actually, the problem in your case is two fold,
Degenerate location of features, i.e., The location of features is mostly localized (on you :P) and not well-spread throughout the image.
These matches are sort of on the same plane. I know you would argue that your body is not planar, but comparing it to the depth of the room, it sort of is.
Mathematically, this means you are kind of extracting E (or F) from a planar surface, which always has infinite solutions. To sort this out, I would suggest using some constrain on distance between any two extracted SURF features, i.e., any two SURF features used for matching should be at least 40 or 100 pixels apart (depending on the resolution of your image).
Another way to get better SURF features is to set 'NumOctaves' in detectSURFFeatures(rgb2gray(I1),'NumOctaves',5); to larger values.
I am facing the same problem and this has helped (a little bit).
I have discrete empirical data which forms a histogram with gaps. I.e. no observations were made of certain values. However in reality those values may well occur.
This is a fig of the scatter graph.
So my question is, SHOULD I interpolate between xaxis values to make bins for the histogram ? If so what would you suggest to be best practice?
Regards,
Don't do it.
With that many sample points, the probability (p-value) of getting empty bins if the distribution is smooth is quite low. There's some underlying reason they're empty, which you may want to investigate. I can think of two possibilities:
Your data actually is discrete (perhaps someone rounded off to 1 signficant figure during data collection, or quantization error was significantly in an ADC) and then unit conversion caused irregular gaps. Even conversion from .12 and .13 to 12,13 as shown could cause this issue, if .12 is actually represented as .11111111198 inside the computer. But this would tend to double-up in a neighboring bin and the gaps would tend to be regularly spaced, so I doubt this is the cause. (For example, if 128 trials of a Bernoulli coin-flip experiment were done for each data point, and someone recorded the percentage of heads in each series to the nearest 1%, you could multiply by 1.28/% to try to recover the actual number of heads, but there'd be 28 empty bins)
Your distribution has real lobes. Because the frequency is significantly reduced following each empty bin, I favor this explanation.
But these are just starting suggestions for your own investigation.