Selecting model with maximum R-squared when curve fitting in MATLAB - matlab

I am modeling a time-series data set (x and y) with multiple methods (cubic, 4th-degree polynomial, and exponential).
Is there a way to program matlab such that it selects the model with the maximum R-squared value, and then uses that model to predict a future outcome? I understand this can be done manually with the curve fitting toolbox and looking at the results, but even then I think I would still need to write the equation out and solve for the value of interest manually.
I would like to automate the process a bit more via code to select the optimal model and use it to predict future results. Below is the main part of my code. Any help would be appreciated.
[f3, gof] = fit(x,y,'poly3','Normalize','on');
plot(f3,x,y);
[f4, gof] = fit(x,y,'poly4','Normalize','on');
[fexp, gof] = fit(x,y,'exp1');

See math on p12 of this working paper. You can download the PDF file from the Web site:
http://www.nber.org/papers/w5027
The authors also point at some older literature.

Related

Advice on Speeding up SciPy Custom Distribution Sampling & Fitting

I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.
This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.

One class learning to make predictions using MATLAB

I am using MATLAB to build a prediction model which the target is binary.
The problem is that those negative observations in my training data may indeed are positives but are just not detected.
I started with a logistic regression model assuming the data is accurate and the results are less than satisfactory. After some research, I moved to one class learning hoping that I can focus on the only the part of data (the positives) that I am certain with.
I looked up the related materials from MATLAB documentation and found that I can use fitcsvm to proceed.
My current problem is:
Am I on the right path? Can one class learning solve my problem?
I tried to use fitcsvm to create a ClassificationSVM using all the positive observations that I have.
model = fitcsvm(Instance,Label,'KernelScale','auto','Standardize',true)
However, when I try to use the model to predict
[label,score] = predict(model,Test)
All the labels predicted for my Test cases are 1. I think I did something wrong. So should I feed the svm only the positive observations that I have?
If not what should I do?

Why do I get disjoint data when I try extrapolate after using polynomial regression

I wanted to extrapolate some of the data I had, as shown in the plot below. The blue line is the original data and the red line is the extrapolation that I wanted.
To use regression analysis, I used the function polyfit:
sizespecial = size(i_C);
endgoal = sizespecial(2);
plothelp = 1:endgoal;
reg1 = polyfit(plothelp,i_C,2);
reg2 = polyfit(plothelp,i_D,2);
Where i_C and i_D are the vectors that represent the original data. I extended the data by using this code:
plothelp=1:endgoal+11;
for in = endgoal+1:endgoal+11
i_C(in) = (reg1(1)*(in^2))+(reg1(2)*in)+reg1(3);
i_D(in) = (reg2(1)*(in^2))+(reg2(2)*in)+reg2(3);
end
However, the graph I output now is:
I do not understand why the extra notch is introduced (circled in red). Do not hesitate to ask me to clarify any of the details on this questions and thank you for all your answers.
What I imagine is happening is that you are trying fit a second order polynomial over all your data. My guess is that this polynomial will look a lot like the curve I have drawn in in orange. If you follow Matt's advise from his comment and plot your regressed polynomial over the your original data as well (not just the extrapolated part) you should confirm this.
You might get better results by fitting a higher order polynomial. Your data have two points of inflection so a 3rd order polynomial will probably work quite well. One danger of extrapolating on higher order polynomial however is that they could have fairly dramatic inflections outside of the domain of your data and produce unexpected and wild results.
One way to mitigate against this is by rather performing a linear regression over the final x data points of your series. These are the points highlighted in yellow in the figure. You can tune x as a parameter such that it covers as much of the approximately linear final portion of your curve as makes sense. The red line I have drawn in will be the result of a linear regression performed on only those data (as opposed to the entire data set)
Another option might be to rather fit a spline curve and extrapolate on that. You can use the interp1 function specifying 'spline' or 'pchip' for that.
However which is the best choice will depend largely on the nature of the problem you are trying to solve.

Beginners issue in polynomial curve fitting [Part 1]

I have just started understanding modeling techniques based on regression models and was going through MATLAB curve fitting toolbox and the SO. I have fundamental doubts and unable to proceed further. I have a single vector set with k=100 data points which I want to fit into an AR model,MA model,ARMA model successively to see which is better suited.Starting with an AR(p) model of the form y(k+1)=a*y(k)+ b*y(k-1)The command
coeff = polyfit(x,y,d)
will fit a polynomial of degree say d=1 with p number of coefficients indicating the order of the model (AR(p)). But I just have 1 set of data which is the recording of the angular moment.So,what will go as the first parameter (x) of the function signature i.e what will be x,y?Then, what if the linear models are not good enough so I may have to select the nonlinear models.Can somebody please guide with code snippets what are the steps in fitting,checking for overfitting,residual calculation etc.
x is likely to be k (index of y). And the whole code:
c =polyfit(1:length(y), y, d).
Matlab has a curve fitting toolbox. You could use it to check different nonlinear fitting in GUI to get some intuition.
If you want steps there's a great Coursera Machine Learning course. The beginning of this course is related to linear regression and I recommend you to spend some hours at least on that beginning.

Linear least-squares fit with constraint - any ideas?

I have a problem where I am fitting a high-order polynomial to (not very) noisy data using linear least squares. Currently I'm using polynomial orders around 15 - 25, which work surprisingly well: The dependence is very nearly linear, but the accuracy of modelling the 'very nearly' is critical. I'm using Matlab's polyfit() function, and (obviously) normalising the x-data. This generally works fine, but I have come across an issue with some recent datasets. The fitted polynomial has extrema within the x-data interval. For the application I'm working on this is a non-no. The polynomial model must have no stationary points over the x-interval.
So I need to add a constraint to the least-squares problem: the derivative of the fitted polynomial must be strictly positive over a known x-range (or strictly negative - this depends on the data but a simple linear fit will quickly tell me which it is.) I have had a quick look at the available optimisation toolbox functions, but I admit I'm at a loss to know how to go about this. Does anyone have any suggestions?
[I appreciate there are probably better models than polynomials for this data, but in the short term it isn't feasible to change the form of the model]
[A closing note: I have finally got the go-ahead to replace this awful polynomial model! I am going to adopt a nonparametric approach, spline smoothing, using the excellent SPLINEFIT code by Jonas Lundgren. This has the advantage that I'm already using a spline model in the end-user application, so I already have C# code available to evaluate a spline model]
You could use cftool and use the exclude data points option.