Beginners issue in polynomial curve fitting [Part 1] - matlab

I have just started understanding modeling techniques based on regression models and was going through MATLAB curve fitting toolbox and the SO. I have fundamental doubts and unable to proceed further. I have a single vector set with k=100 data points which I want to fit into an AR model,MA model,ARMA model successively to see which is better suited.Starting with an AR(p) model of the form y(k+1)=a*y(k)+ b*y(k-1)The command
coeff = polyfit(x,y,d)
will fit a polynomial of degree say d=1 with p number of coefficients indicating the order of the model (AR(p)). But I just have 1 set of data which is the recording of the angular moment.So,what will go as the first parameter (x) of the function signature i.e what will be x,y?Then, what if the linear models are not good enough so I may have to select the nonlinear models.Can somebody please guide with code snippets what are the steps in fitting,checking for overfitting,residual calculation etc.

x is likely to be k (index of y). And the whole code:
c =polyfit(1:length(y), y, d).
Matlab has a curve fitting toolbox. You could use it to check different nonlinear fitting in GUI to get some intuition.
If you want steps there's a great Coursera Machine Learning course. The beginning of this course is related to linear regression and I recommend you to spend some hours at least on that beginning.

Related

Hyper-parameters of Gaussian Processes for Regression

I know a Gaussian Process Regression model is mainly specified by its covariance matrix and the free hyper-parameters act as the 'weights'of the model. But could anyone explain what do the 2 hyper-parameters (length-scale & amplitude) in the covariance matrix represent (since they are not 'real' parameters)? I'm a little confused on the 'actual' meaning of these 2 parameters.
Thank you for your help in advance. :)
First off I would like to point out that there are infinite number of kernels that could be used in a gaussian process. One of the most common however is the RBF (also referred to as squared exponential, the expodentiated quadratic, etc). This kernel is of the following form:
The above equation is of course for the simple 1D case. Here l is the length scale and sigma is the variance parameter (note they go under different names depending on the source). Effectively the length scale controls how two points appear to be similar as it simply magnifies the distance between x and x'. The variance parameter controls how smooth the function is. These are related but not the same.
The Kernel Cookbook give a nice little description and compares RBF kernels to other commonly used kernels.

Fitting sigmoid to data

There are many curve fitting and interpolation tools like polyfit (or even this nice logfit toolbox I found here), but I can't seem to find anything that will fit a sigmoid function to my x-y data.
Does such a tool exist or do I need to make my own?
If you have the Statistics Toolbox installed, you can use nonlinear regression with nlinfit:
sigfunc = #(A, x)(A(1) ./ (A(2) + exp(-x)));
A0 = ones(size(A)); %// Initial values fed into the iterative algorithm
A_fit = nlinfit(x, y, sigfunc, A0);
Here sigfunc is just an example for a sigmoid function, and A is the vector of the fitting coefficients.
nlinfit, and especially gatool, are big hammers for this problem. A sigmoid is not a specific function. Most commonly it is taken to be the same as the logistic function (also often the most efficient to calculate):
y = 1./(1+exp(-x));
or a generalized logistic. But all manner of curves can have sigmoidal shapes. If you know if your data corresponds to one in particular, fitting can be improved and more efficient methods can be applied. For example, the error function (erf) has a sigmoidal shape and shows up in the CDF of the normal distribution. If you know that your data is the result of a Gaussian process (i.e., the data is the CDF) and you have the Stats toolbox, you can use the normfit function. This function is based on maximum likelihood estimation (MLE). If you end up needing to write a custom fitting function - say, for performance reasons - I'd investigate MLE techniques for the particular form of sigmoid that you'd like to fit.
I would suggest you use MATLAB's Global Optimization Toolbox, and in particular the Genetic Algorithm Solver, which you can use for your problem by optimizing (= finding the best fit for your data) the sigmoid function's parameters through genetic algorithm. It has a GUI that is easy to use.
The Genetic Algorithm Solver's GUI, which you can call using gatool:

Generate bifurcation diagram for 2D system

Drawing bifurcation diagram for 1D system is clear but if I have 2D system on the following form
dx/dt=f(x,y,r),
dy/dt=g(x,y,r)
And I want to generate a bifurcation diagram in MATLAB for x versus r.
What is the main idea to do that or any hints which could help me?
You first have to do some math:
Setting each of the functions to zero gives you two functions y(x) (called the nullclines), which you can plot in a phase diagram. Where the two lines intersect are the fixed-points (equilibria) of your system.
Now, you have to take the jacobian of your system and plug each of those fixed-points in, which will give you the linear stability analysis of the system.
The location of the fixed points and the stability of each point can now be computed as a you vary r (the bifurcation parameter).
For the programming:
-use newton's method (fsolve in MATLAB) to find where the equations are zero
-eig will help you find the eigenvalues of the system.
However
It depends on your system.
If you're supposed to be looking for limit cycles or chaos or something, you'll have to use one of the ode solvers and then the analysis becomes more tricky. I suppose you could develop a poincare-bendixson algorithm, but that would be involved and details would depend on your system.
I don't think MATLAB has anything built in that would give you a bifurcation diagram. There is this third-party solution:
http://www.mathworks.com/matlabcentral/fileexchange/8382

Linear least-squares fit with constraint - any ideas?

I have a problem where I am fitting a high-order polynomial to (not very) noisy data using linear least squares. Currently I'm using polynomial orders around 15 - 25, which work surprisingly well: The dependence is very nearly linear, but the accuracy of modelling the 'very nearly' is critical. I'm using Matlab's polyfit() function, and (obviously) normalising the x-data. This generally works fine, but I have come across an issue with some recent datasets. The fitted polynomial has extrema within the x-data interval. For the application I'm working on this is a non-no. The polynomial model must have no stationary points over the x-interval.
So I need to add a constraint to the least-squares problem: the derivative of the fitted polynomial must be strictly positive over a known x-range (or strictly negative - this depends on the data but a simple linear fit will quickly tell me which it is.) I have had a quick look at the available optimisation toolbox functions, but I admit I'm at a loss to know how to go about this. Does anyone have any suggestions?
[I appreciate there are probably better models than polynomials for this data, but in the short term it isn't feasible to change the form of the model]
[A closing note: I have finally got the go-ahead to replace this awful polynomial model! I am going to adopt a nonparametric approach, spline smoothing, using the excellent SPLINEFIT code by Jonas Lundgren. This has the advantage that I'm already using a spline model in the end-user application, so I already have C# code available to evaluate a spline model]
You could use cftool and use the exclude data points option.

Matlab Question - Principal Component Analysis

I have a set of 100 observations where each observation has 45 characteristics. And each one of those observations have a label attached which I want to predict based on those 45 characteristics. So it's an input matrix with the dimension 45 x 100 and a target matrix with the dimension 1 x 100.
The thing is that I want to know how many of those 45 characteristics are relevant in my set of data, basically the principal component analysis, and I understand that I can do this with Matlab function processpca.
Could you please tell me how can I do this? Suppose that the input matrix is x with 45 rows and 100 columns and y is a vector with 100 elements.
Assuming that you want to construct a model of the 1x100 vector, based on the 45x100 matrix, I am not convinced that PCA will do what you think. PCA can be used to select variables for model estimation, but this is a somewhat indirect way to gather a set of model features. Anyway, I suggest reading both:
Principal Components Analysis
and...
Putting PCA to Work
...both of which provide code in MATLAB not requiring any Toolboxes.
Have you tried COEFF = princomp(x)?
COEFF = princomp(X) performs principal
components analysis (PCA) on the
n-by-p data matrix X, and returns the
principal component coefficients, also
known as loadings. Rows of X
correspond to observations, columns to
variables. COEFF is a p-by-p matrix,
each column containing coefficients
for one principal component. The
columns are in order of decreasing
component variance.
From your question I deduced you don't need to do it in MATLAB, but you just want to analyze your dataset. According to my opinion the key is visualization of the dependencies.
If you're not forced to do the analysis in MATLAB I'd suggest you try more specialized software something like WEKA (www.cs.waikato.ac.nz/ml/weka/) or RapidMiner (rapid-i.com). Both tools can provide PCA and other dimension reduction algorithms + they contain nice visualization tools.
Your use case sounds like a combination of Classification and Feature Selection.
Statistics Toolbox offers a lot of good capabilities in this area. The toolbox provides access to a number of classification algorithms including
Naive Bayes Classifiers Bagged
Decision Trees (aka Random Forests)
Binomial and Multinominal logistic regression
Linear Discriminant analysis
You also have a variety of options available for feature selection include
sequentialfs (forwards and backwards feature selection)
relifF
"treebagger" also supports options for feature selection and estimating variable importance.
Alternatively, you can use some of Optimization Toolbox's capabilities to write your own custom equations to estimate variable importance.
A couple monthes back, I did a webinar for The MathWorks titled "Compuational Statistics: Getting Started with Classification using MTALAB". You can watch the Webinar at
http://www.mathworks.com/company/events/webinars/wbnr51468.html?id=51468&p1=772996255&p2=772996273
The code and the data set for the examples is available at MATLAB Central
http://www.mathworks.com/matlabcentral/fileexchange/28770
With all this said and done, many people using Principal Component Analysis as a pre-processing step before applying classification algorithms. PCA gets used alot
When you need to extract features from images
When you're worried about multicollinearity
You should find correlation matrix. in the following example matlab finds correlation matrix with 'corr' function
http://www.mathworks.com/help/stats/feature-transformation.html#f75476