Automatically truncating a curve to discard outliers in matlab - matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.

This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Related

Predictive curve fitting matlab

I have a question about curve fitting, I have many curves like the one in the picture.
X axis : time
Y axis : temperature
Each sample comes out every 30s.
GOAL : predict the value at the end of the transient
What would you do in this situation?
What I am doing is this :
for every new sample I start a new fitting (and so each fitting is independent from the previous one) and check the value of the fitted curve 2 hours (all curves I have set before 2h) after the start of the measurement. If for a number (let's say 5) of subsequent fitting the value in the future stays more or less the same(+-0.2°C) I so assume that the estimation is the right one.
This approach seems to me far too simple and I think I am not exploiting all information. For example the info of the error I am making punctually (e.g. at minute 4:00 I predict and at 4:30 I see that I am doing an error).
In the picture the red part of the curve is excluded (but the real data in the future passes through it). the estimation is the blue one. You see in this case I don't have a good prediction... In general I have also more flat curves.
Based on the comments above, I tried to formulate an answer as no one else is giving some input.
I think your are using a good basic procedure. Better results may be obtained by using a more appropriate fitting curve, which includes all the dominant dynamics, but avoids overfitting of the data. Based on your figure, the simplest form I could think of is:
s + a(1-e^(-t/tau))
with parameters s (the initial temperature), a (amplitude = steady state value) and tau (dominant time constant). As you mentioned yourself, limiting the allowed range for the parameters may avoid overfitting and increase the quality of your estimation.
Using a random high order function, like you are using now, may give good interpolation results, but are dangerous to use for extrapolation, because strange effects may occur outside the fitting region.
Alternatives
Using the error (eg. correcting for the extrapolated error) may be possible, but is tricky and may not always give good results.
Training a neural network to perform the estimation is probably overkill, but may give better results if applied correctly. Note that you need a lot of training data which should be representative for the data for which you will use the neural network later on.

plotting histogram in matlab with highly unequal distribution

So I am plotting two years worth of data to see the change in distribution of values, but with one of the histograms, one of the years is heavily dependent upon the first column, this is because there are many zeros in the data set. Would you recommend creating a bar strictly for zeros? How would I create this index in matlabd? Or how can I better manipulate the histogram to reflect the actual data set and make it clear that zeros are accounting for the sharp initial rise?
Thanks.
This is more of a statistical question. I f you have good reason to ignore the zeros, for example one of your data aquisition system produced them because of malfunction. you can get simply get rid of them by
hist(data(data~=0))
but you would not need to look at the histograms anyways you can use the variance or even standard deviation to see how much your data shifted.
Furthermore to compare data populations boxplots are much better and easier to handle.
doc boxplot
If on the other hand your zeros are genuine to your data you have to keep them! I am sorry but also here the boxplot function might help you because the zeros might be outliers (shown as little red crosses) or the box is just starting at the zero line.

Different results for Fundamental Matrix in Matlab

I am implementing stereo matching and as preprocessing I am trying to rectify images without camera calibration.
I am using surf detector to detect and match features on images and try to align them. After I find all matches, I remove all that doesn't lie on the epipolar lines, using this function:
[fMatrix, epipolarInliers, status] = estimateFundamentalMatrix(...
matchedPoints1, matchedPoints2, 'Method', 'RANSAC', ...
'NumTrials', 10000, 'DistanceThreshold', 0.1, 'Confidence', 99.99);
inlierPoints1 = matchedPoints1(epipolarInliers, :);
inlierPoints2 = matchedPoints2(epipolarInliers, :);
figure; showMatchedFeatures(I1, I2, inlierPoints1, inlierPoints2);
legend('Inlier points in I1', 'Inlier points in I2');
Problem is, that if I run this function with the same data, I am still getting different results causing differences in resulted disparity map in each run on the same data
Pulatively matched points are still the same, but inliners points differs in each run.
Here you can see that some matches are different in result:
UPDATE: I thought that differences was caused by RANSAC method, but using LMedS, MSAC, I am still getting different results on the same data
EDIT: Admittedly, this is only a partial answer, since I am only explaining why this is even possible with these fitting methods and not how to improve the input keypoints to avoid this problem from the start. There are problems with the distribution of your keypoint matches, as noted in the other answers, and there are ways to address that at the stage of keypoint detection. But, the reason the same input can yield different results for repeated executions of estimateFundamentalMatrix with the same pairs of keypoints is because of the following. (Again, this does not provide sound advice for improving keypoints so as to solve this problem).
The reason for different results on repeated executions, is related to the the RANSAC method (and LMedS and MSAC). They all utilize stochastic (random) sampling and are thus non-deterministic. All methods except Norm8Point operate by randomly sampling 8 pairs of points at a time for (up to) NumTrials.
But first, note that the different results you get for the same inputs are not equally suitable (they will not have the same residuals) but the search space can easily lead to any such minimum because the optimization algorithms are not deterministic. As the other answers rightly suggest, improve your keypoints and this won't be a problem, but here is why the robust fitting methods can do this and some ways to modify their behavior.
Notice the documentation for the 'NumTrials' option (ADDED NOTE: changing this is not the solution, but this does explain the behavior):
'NumTrials' — Number of random trials for finding the outliers
500 (default) | integer
Number of random trials for finding the outliers, specified as the comma-separated pair consisting of 'NumTrials' and an integer value. This parameter applies when you set the Method parameter to LMedS, RANSAC, MSAC, or LTS.
MSAC (M-estimator SAmple Consensus) is a modified RANSAC (RANdom SAmple Consensus). Deterministic algorithms for LMedS have exponential complexity and thus stochastic sampling is practically required.
Before you decide to use Norm8Point (again, not the solution), keep in mind that this method assumes NO outliers, and is thus not robust to erroneous matches. Try using more trials to stabilize the other methods (EDIT: I mean, rather than switching to Norm8Point, but if you are able to back up in your algorithms then address the the inputs -- the keypoints -- as a first line of attack). Also, to reset the random number generator, you could do rng('default') before each call to estimateFundamentalMatrix. But again, note that while this will force the same answer each run, improving your key point distribution is the better solution in general.
I know its too late for your answer, but I guess it would be useful for someone in the future. Actually, the problem in your case is two fold,
Degenerate location of features, i.e., The location of features is mostly localized (on you :P) and not well-spread throughout the image.
These matches are sort of on the same plane. I know you would argue that your body is not planar, but comparing it to the depth of the room, it sort of is.
Mathematically, this means you are kind of extracting E (or F) from a planar surface, which always has infinite solutions. To sort this out, I would suggest using some constrain on distance between any two extracted SURF features, i.e., any two SURF features used for matching should be at least 40 or 100 pixels apart (depending on the resolution of your image).
Another way to get better SURF features is to set 'NumOctaves' in detectSURFFeatures(rgb2gray(I1),'NumOctaves',5); to larger values.
I am facing the same problem and this has helped (a little bit).

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.