Measuring curve “closeness” with unequal data ranges - classification

Provided that I have a similar example:
where the blue data is my calculated/measured data and my red data is the given groundtruth data. The task is to get the similarity/closeness between the data and each of the given curves so that a classification can be done, it could also be possible to choose multiple classes if the results seem to be very close.
I can divide the problem in my mind to several subproblems:
The data range is no the same
The resolution of the calculated/measured data is higher than the ground-truth data
The calculated data has some bias/shift
The following questions come to my mind when trying to solve those problems
Is it better to fit the calculated/measured data first then attempting to solve the problem?
Would it be fine to use the data points as is and calculate the mean squared error of each curve assuming it is a fitting attempt and thus getting the best fit? What would be the effect of the bias/shift in this case?
What is a good approach to dealing with the data/range mismatch, by decreasing the number of samples for the higher sampled version or increasing the number of samples for the lower sampled data in the given range?

Related

Linear regression returning bad fit with large x values

I'm looking to do a linear regression to determine the estimated date of depletion for a particular resource. I have a dataset containing a column of dates, and several columns of data, always decreasing. A linear regression using scikit learn's LinearRegression() function yields a bad fit.
I converted the date column to ordinal, which resulted in values ~700,000. Relative to the y axis of values between 0-200, this is rather large. I imagine that the regression function is starting at low values and working its way up, eventually giving up before it finds a good enough fit. If i could assign starting values to the parameters, large intercept and small slope, perhaps it would fix the problem. I don't know how to do this, and i am very curious as to other solutions.
Here is a link to some data-
https://pastebin.com/BKpeZGmN
And here is my current code
model=LinearRegression().fit(dates,y)
model.score(dates,y)
y_pred=model.predict(dates)
plt.scatter(dates,y)
plt.plot(dates,y_pred,color='red')
plt.show()
print(model.intercept_)
print(model.coef_)
This code plots the linear model over the data, yielding stunning inaccuracy. I would share in this post, but i am not sure how to post an image from my desktop.
My original data is dates, and i convert to ordinal in code i have not shared here. If there is an easier way to do this that would be more accurate, i would appreciate a suggestion.
Thanks,
Will

Comparison of set of signals

I have certain movement data acquired from motion capture system which I want to automatically choose which 5 signals are more alike.
Picture shows example of the particular data, all normalized to 100 samples due to the difference in speed.
Data set for knee flexion/extension
What I am looking for is some idea to actually compare the shapes of the curves.
The easiest solution is just to substract the "raw" curves and check which one has the smallest RMSE.
But looking at your data (which are smooth curves), another option is to use PLS or GMM to describe them. Then you can use RMSE to compute the error between your input curve and your database of curves and take the one with lowest error.

Do you have to normalize the data for a neural net if it is already scaled?

I'm currently trying to preprocess my training data ready for a multi-layered perceptron. The data I downloaded consists of 20,000 instances and 16 attributes, all of which are coordinate values of pixels as part of letter recognition. The data itself has already been scaled from its original form into values between 0 - 15 before being published.
However since it's already been scaled, is it still necessary to perform normalization on it? I've tried to read around and look at previous examples but have come up with conflicting points. In some papers, it has stated that scaling is a form of normalization, where as others have said that normalization would be bringing that values to a range of 0-1.
Since I'm using WEKA I've attempted their normalize filter during a pre-processing stage and it caused the accuracy to decrease by around 2% which makes me think it could be unnecessary. But again, I've read that it may only have a positive effect later in training.
So my question is:
What is the difference between scaling to a range such as 0 - 15 and normalizing it? Should I still normalize it on top of this scaling thats already done?
In your case you do not need to. Normalizing data is done so that an attribute with a different scale will not decide outcome of distance operations, ultimately decide clustering or classification results.
An example you have two attributes weight and income. Weight will be 10 and 200kg at most. While income can be 10,000$ and 20,000,000$. But most of the people's income will be 10,000 and 120,000, while above this values will be outliers. If you do not normalize your data before using Multi Layer Perceptron, outcome of your neural network will be decided by these outliers.
In your case this situation is already mitigated due to your scaling therefore you do not need normalizing.

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason