I'm trying to perform logistic regression to do classification using MATLAB. There seem to be two different methods in MATLAB's statistics toolbox to build a generalized linear model 'glmfit' and 'fitglm'. I can't figure out what the difference is between the two. Is one preferable over the other?
Here are the links for the function descriptions.
http://uk.mathworks.com/help/stats/glmfit.html
http://uk.mathworks.com/help/stats/fitglm.html
The difference is what the functions output. glmfit just outputs a vector of the regression coefficients (and some other stuff if you ask for it). fitglm outputs a regression object that packs all sorts of information and functionality inside (See the docs on GeneralizedLinearModel class). I would assume the fitglm is intended to replace glmfit.
In addition to Dan's answer, I would like to add the following.
The function fitglm, like newer functions from the statistics toolbox, accepts more flexible inputs than glmfit. For example, you can use a table as the data source, specifyy a formula of the form Y ~ X1 + X2 + ..., and use categorical variables.
As a side note, the function lassoglm uses (depends on) glmfit.
Related
I want to minimize some function J using gradient information. I found two functions in scipy that may do the job, scipy.optimize.fmin_tnc (here) and scipy.optimize.minimize (here), and I implemented them. However, now I need the stepwise output of the function evaluations at each step of the (e.g. Newton) algorithm to plot its convergence. Is it possible to get this vector somehow out of these functions? It is not part of default return values as it seems.
There are many curve fitting and interpolation tools like polyfit (or even this nice logfit toolbox I found here), but I can't seem to find anything that will fit a sigmoid function to my x-y data.
Does such a tool exist or do I need to make my own?
If you have the Statistics Toolbox installed, you can use nonlinear regression with nlinfit:
sigfunc = #(A, x)(A(1) ./ (A(2) + exp(-x)));
A0 = ones(size(A)); %// Initial values fed into the iterative algorithm
A_fit = nlinfit(x, y, sigfunc, A0);
Here sigfunc is just an example for a sigmoid function, and A is the vector of the fitting coefficients.
nlinfit, and especially gatool, are big hammers for this problem. A sigmoid is not a specific function. Most commonly it is taken to be the same as the logistic function (also often the most efficient to calculate):
y = 1./(1+exp(-x));
or a generalized logistic. But all manner of curves can have sigmoidal shapes. If you know if your data corresponds to one in particular, fitting can be improved and more efficient methods can be applied. For example, the error function (erf) has a sigmoidal shape and shows up in the CDF of the normal distribution. If you know that your data is the result of a Gaussian process (i.e., the data is the CDF) and you have the Stats toolbox, you can use the normfit function. This function is based on maximum likelihood estimation (MLE). If you end up needing to write a custom fitting function - say, for performance reasons - I'd investigate MLE techniques for the particular form of sigmoid that you'd like to fit.
I would suggest you use MATLAB's Global Optimization Toolbox, and in particular the Genetic Algorithm Solver, which you can use for your problem by optimizing (= finding the best fit for your data) the sigmoid function's parameters through genetic algorithm. It has a GUI that is easy to use.
The Genetic Algorithm Solver's GUI, which you can call using gatool:
I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.
I need to fit data in quite an indirect way. The original data to be recovered in the fit is some linear function with small oscillations and drifts on it, that I would like to identify. Let's call this f(t). We can not record this parameter in the experiment directly, but only indirectly, let's say as g(f) = sin(a f(t)). (The real transfer funcion is more complex, but it should not play a role in here)
So if f(t) changes direction towards the turning points of the sin function, it is difficult to identify and I tried an alternative approach to recover f(t) than just the inverse function of g and some data continuing guesses:
I create a model function fm(t) which undergoes the same and known transfer function g() and fit g(fm(t)) to the data. As the dataset is huge, I do this piecewise for successive chunks of data guaranteeing the continuity of fm across the whole set.
A first try was to use linear functions using the optimize.leastsq, where the error estimate is derived from g(fm). It is not completely satisfactory, and I think it would be far better to fit a spline to the data to get fspline(t) as a model for f(t), guaranteeing the continuity of the data and of its derivative.
The problem with it is, that spline fitting from the interpolate package works on the data directly, so I can not wrap the spline using g(fspline) and do the spline interpolation on this. Is there a way this can be done in scipy?
Any other ideas?
I tried quadratic functions and fixing the offset and slope such to match the ones of the preceeding fitted chunk of data, so there is only one fitting parameter, the curvature, which very quickly starts to deviate
Thanks
What you would need is a matrix of spline basis functions, b(t), so you can approximate f(t) as a linear combination of spline basis function
f(t) = np.dot(b(t), coefs)
and then estimate the coefficients, coefs, by optimize.leastsq.
However, spline basis functions are not readily available in python, as far as I know (unless you borrow experimental scripts or search through the code of some packages).
Instead you could also use polynomials, for example
b(t) = np.polynomial.chebvander(t, order)
and use a polynomial approximation instead of the splines.
The structure of this problem is very similar to generalized linear models where g is your known link function and similar to index problems in econometrics.
It would be possible to use the scipy splines in an indirect way if you create artificial data
y_i = f(t_i)
where f(t_i) are scipy.interpolate splines, and the y_i are the parameters to be estimated in the least squares optimization. (Loosely based on a script that I saw some time ago that used this for creating a different kind of smoothing splines than the scipy version. I don't remember where I saw this.)
Thank you for these comments. I tried out the polynomial basis suggested above, but polynomials are no option for my needs, ads they tend to create ringing, which is difficult to condition.
The solution on using splines I now found is quite simple and straightforward, and I think it is what you meant by "using the splines in an indirect way".
The fitting function f(t) is obtained by the interpolate.splev(x, (t,c,k)) function, but providing the spline coefficients c by the omptimize.leastsq function. In this way, f(t) is no direct spline fit (as one would usually obtain with the splrep(x, y) function) but indirectly optimized in the fit, and therefore it is possible to use the link function g on it. The initial guess for c might be obtained by one evaluation of splrep(xinit, yinit, t=knots) on model data.
One trick is to restrict the number of knots for the spline to below the number of datapoints by explicitly specifying them during the function call of splrep() and giving this reduced set during the evaluation using splev().
Upon some research I found two functions in MATLAB to do the task:
cvpartition function in the Statistics Toolbox
crossvalind function in the Bioinformatics Toolbox
Now I've used the cvpartition to create n-fold cross validation subsets before, along with the Dataset/Nominal classes from the Statistics toolbox. So I'm just wondering what are the differences between the two and the pros/cons of each?
Expanding on #Mr Fooz's answer
They look to be pretty similar based on the official docs of cvpartition and crossvalind, but crossvalind looks slightly more flexible (it allows for leave M out for arbitrary M, whereas cvpartition only allows for leave 1 out).
... isn't it true that you can always simulate a leave-M-out using kfold cross validation with an appropriate k value (split data into k fold, test on one, train on all others, and do this for all folds and take average) since leave-one-out is a special case of kfold where k=number of observations?
Amro, this is not directly an answer to your cvpartition vs crossvalind question, but there is a contribution at the Mathworks File Exchange called MulticlassGentleAdaboosting by user Sebastian Paris that includes a nice set of functions for enumerating array indices for computing training, testing and validation sets for the following sampling and cross-validation strategies:
Hold out
Bootstrap
K Cross-validation
Leave One Out
Stratified Cross Validation
Balanced Stratified Cross Validation
Stratified Hold out
Stratified Boot Strap
For details, see the demo files included in the package, and more specifically the functions sampling.m and sampling_set.m.
They look to be pretty similar based on the official docs of cvpartition and crossvalind, but crossvalind looks slightly more flexible (it allows for leave M out for arbitrary M, whereas cvpartition only allows for leave 1 out).
I know your question is not directly referring to the Neural network toolbox, but perhaps someone else might find this useful. To get your ANN input data seperated in to test/validation/train data, use the 'net.divideFcn' variable.
net.divideFcn = 'divideind';
net.divideParam.trainInd=1:94; % The first 94 inputs are for training.
net.divideParam.valInd=1:94; % The first 94 inputs are for validation.
net.divideParam.testInd=95:100; % The last 5 inputs are for testing the network.