Linear regression returning bad fit with large x values - linear-regression

I'm looking to do a linear regression to determine the estimated date of depletion for a particular resource. I have a dataset containing a column of dates, and several columns of data, always decreasing. A linear regression using scikit learn's LinearRegression() function yields a bad fit.
I converted the date column to ordinal, which resulted in values ~700,000. Relative to the y axis of values between 0-200, this is rather large. I imagine that the regression function is starting at low values and working its way up, eventually giving up before it finds a good enough fit. If i could assign starting values to the parameters, large intercept and small slope, perhaps it would fix the problem. I don't know how to do this, and i am very curious as to other solutions.
Here is a link to some data-
https://pastebin.com/BKpeZGmN
And here is my current code
model=LinearRegression().fit(dates,y)
model.score(dates,y)
y_pred=model.predict(dates)
plt.scatter(dates,y)
plt.plot(dates,y_pred,color='red')
plt.show()
print(model.intercept_)
print(model.coef_)
This code plots the linear model over the data, yielding stunning inaccuracy. I would share in this post, but i am not sure how to post an image from my desktop.
My original data is dates, and i convert to ordinal in code i have not shared here. If there is an easier way to do this that would be more accurate, i would appreciate a suggestion.
Thanks,
Will

Related

Is pointwise multiple linear regression possible in Matlab

I am attempting to run a pointwise multiple linear regression in Matlab, i.e., to obtain a regression coefficient for each point in my dataset.
I have three independent variables and one dependent variable. Each variable is a column vector with ~1.6 million records. Each data point represents a geographic location; my point in doing all this is to try and see the effects of the predictor variables on the response variable on a pixel-per-pixel basis.
I have already successfully run fitlm, regress, and mldivide; these functions get me the three regression coefficients for my data. However, I want to run a multiple regression through all my points independently, so that ultimately I will get three columns of regression coefficients of 1.6 million records each.
My data contains some NaN. These rows cannot be ignored; the final column vector must be the same size as the original vectors since the data point's location is related to real-world coordinates.
I've looked into the code for bsxfun but don't believe it can help me. I also tried using dot notation but that didn't work. My thinking now is to create a for loop and use mldivide one row at a time. However, when I tried using 'regress' on scalars (mocking one row of data), I got the error "X is rank deficient to within machine precision." I didn't get this error when I used mldivide.
Is doing a pointwise multiple linear regression even possible? It seems to me that my sample size is way too small. Any feedback on the feasibility of this, and whether a for loop is a good direction to pursue, would be greatly appreciated.

Measuring curve “closeness” with unequal data ranges

Provided that I have a similar example:
where the blue data is my calculated/measured data and my red data is the given groundtruth data. The task is to get the similarity/closeness between the data and each of the given curves so that a classification can be done, it could also be possible to choose multiple classes if the results seem to be very close.
I can divide the problem in my mind to several subproblems:
The data range is no the same
The resolution of the calculated/measured data is higher than the ground-truth data
The calculated data has some bias/shift
The following questions come to my mind when trying to solve those problems
Is it better to fit the calculated/measured data first then attempting to solve the problem?
Would it be fine to use the data points as is and calculate the mean squared error of each curve assuming it is a fitting attempt and thus getting the best fit? What would be the effect of the bias/shift in this case?
What is a good approach to dealing with the data/range mismatch, by decreasing the number of samples for the higher sampled version or increasing the number of samples for the lower sampled data in the given range?

Matlab: Eliminating freak values in dataset

I am searching for a method to eliminate freak values out of given dataset. For example:
All these peaks should be eliminated. I've tried different filters like medfilt, but the peaks are still there. I've also tried a lowpass filter, but it didn't work either. I am a beginner in filtering signals, so I probably did it wrong.
You can download data sets for the x array here and y array here.
I could also imagine a loop to compare the values next to each other, but I am sure there has to be a built-in function?
Here is the result using medfilt1(input,15):
The peaks are vanishing, but the then I get these ugly steps, which I don't want.
just use median filter! medfilt1(data,3) will do if this is a 1 pixel "cosmic" spike. If the peaks remain, increase the window size to 5 or more...
EDIT:
so this is how op's data looks like:
So we see that the data is not exactly uniform or ordered, and there are a lot of data points in the spikes unlike what one first understand from the question (guys please plot your data correctly!) The question is now, is the data in the spikes or on in the baseline?

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason