I have posted a question about an Algorithm to make a polynomial fit of a part of a data set some time ago and received some propositions to do what I wanted. But I face another problem now I try to apply the ideas suggested in the answers.
My goal was to find the best linear fit of a data set, in which only a part of it was linear.
Here is an example of what I must do :
We have these two data sets, and I must make a linear trend of the linear part of the data that is at the left of the dashed line. In red, we have the ideal data set, that has a linear part from the beginning until the dashed line. In blue, we have the 'problematic' data set, that has a plateau. The bold part is the part that I have to use to do the linear fit of the data.
My problem is that I tried to do as mentionned in the question linked above : I found the second order derivative of the smoothed data and looked when it was not 'close enough' of 0. But here are my results for the problematic data set (first image) and for the ideal data set (second image) :
(Sorry for quality, I don't know why it is so blurred)
On both images, I plotted the first order derivative and in red, the second order derivative. On the first image, we see peaks of second derivative values. But the problem is that the peaks are not very 'high', making it difficult to establish a threshold that would tell if the set is linear or not... On the contrary, the peak of the first derivative is quite high, making it easy to see visually.
I thought that calculate the mean of the values of the first derivative and look when the value differ too much from the mean value would be enough... But when I take the mean of the values of the first derivative in order to see where the values differ from the mean value, there is a sort of offset due to the peak.
How to remove this offset in order to take only the mean value of the data at the right (the data at the left of the discontinuity that is seen on Image 1 could be non linear or be linear but have a different value from the values at the right!) of the peak efficiently ?
The mean operator (as you have noticed) is very sensitive to outliers (peaks). You may wish to use more robust estimators, such as the median or the x-percentile of the values (which should be more appropriate for your case).
Related
I have a question about curve fitting, I have many curves like the one in the picture.
X axis : time
Y axis : temperature
Each sample comes out every 30s.
GOAL : predict the value at the end of the transient
What would you do in this situation?
What I am doing is this :
for every new sample I start a new fitting (and so each fitting is independent from the previous one) and check the value of the fitted curve 2 hours (all curves I have set before 2h) after the start of the measurement. If for a number (let's say 5) of subsequent fitting the value in the future stays more or less the same(+-0.2°C) I so assume that the estimation is the right one.
This approach seems to me far too simple and I think I am not exploiting all information. For example the info of the error I am making punctually (e.g. at minute 4:00 I predict and at 4:30 I see that I am doing an error).
In the picture the red part of the curve is excluded (but the real data in the future passes through it). the estimation is the blue one. You see in this case I don't have a good prediction... In general I have also more flat curves.
Based on the comments above, I tried to formulate an answer as no one else is giving some input.
I think your are using a good basic procedure. Better results may be obtained by using a more appropriate fitting curve, which includes all the dominant dynamics, but avoids overfitting of the data. Based on your figure, the simplest form I could think of is:
s + a(1-e^(-t/tau))
with parameters s (the initial temperature), a (amplitude = steady state value) and tau (dominant time constant). As you mentioned yourself, limiting the allowed range for the parameters may avoid overfitting and increase the quality of your estimation.
Using a random high order function, like you are using now, may give good interpolation results, but are dangerous to use for extrapolation, because strange effects may occur outside the fitting region.
Alternatives
Using the error (eg. correcting for the extrapolated error) may be possible, but is tricky and may not always give good results.
Training a neural network to perform the estimation is probably overkill, but may give better results if applied correctly. Note that you need a lot of training data which should be representative for the data for which you will use the neural network later on.
I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).
This is a follow-up question on the one below:
Second moments question
MATLAB's regionprops function estimates an ellipse from a given set of 2d-points. This is done by using the image moments, they claim to use normalized second central moments, the formulas also follow what is suggested by the wikipedia link on image moments.
Effectively the covariance matrix of the region is calculated (in a slightly more efficient way) and then the square root of the eigenvalues of this matrix are calculated and put out as the major and minor axes - with one change: They are multiplied by a factor of 4.
Why?
Essentially, covariance estimation assumes a multivariate normal distribution. However, an arbitrary image region is most likely not normally distributed, I would rather expect a factor based on the assumption that data is uniformly distributed. So what is the justification for choosing 4?
Meanwhile I found the answer: Factor 4 yields correct results for regions with an elliptical shape. For e.g. rectangular or non-solid regions, the estimated axis lengths are incorrect, and the error varies nonlinearly with changes in the region.
I am registering multi-modal MRI slices that are 512x512 greyscale (each normalised to 0..1 range). The slices are of the same object but taken with different sequences and have very different intensities. I am currently finding the translation-only transformation between the two slices using imregister(moving,fixed,'translation',optimizer,metric) where optimizer and metric are from imregconfig('multimodal').
However, the transformation it finds (inspecting tform) is like '2.283' in the x and '-0.019' in the y, and actually I only wish for whole value translations i.e. '2' and '0' in this case.
How to modify imregister (or a similar function) to check only whole-pixel translations? This would save a lot of computation and it suits my needs better.
Without modifying imregister I assume the easiest solution to just round the x and y translations?
I'm not sure how imregister is implemented for the 'multimodal' case, but pure translation estimation for conventional image registration is done using image gradients and taylor apporximation and gives sub-pixel accuracy at the same cost as pixel-level accuracy.
So, in that case limiting yourself to pixel-wise translation does not seems to benefit you in any way.
If you do not want to bother with sib-pixel shifts, I suppose rounding would be the simplest approach.
I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.