Performing analysis of covariance with python/scipy/statsmodel - scipy

Could anyone please help in providing an example showing how ANCOVA (analysis of covariance) can be done in scipy/statsmodel, with python?
I am not sure if I am asking too much, but a quick search showed me this which is not informative enough for me.
Thanks!

Statsmodels uses the linear model, OLS, to estimate ANOVA. So, having additional continuous regressors as in ANCOVA does not change the analysis.
Here are a few links to the relevant documentation
Anova helper functions and examples for ANCOVA interactions
http://statsmodels.sourceforge.net/devel/examples/generated/example_interactions.html
using formulas to create the design matrix
http://statsmodels.sourceforge.net/devel/example_formulas.html
the core OLS model
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html

Related

Multivariate regression in Matlab

I have been all over Google trying to find a good function/package to perform multivariate regression (i.e. predict multiple continuous variables given another set of multiple continuous variables).
I wish to use something like fitlm(), since that also gives me p-value statistics and R squared statistics. Does anything like that exist?
Matlab has a bundle of tools for this, see this page.
I believe that mvregress is the most rounded and mainstream tool. See this page for setting up an analysis with it.
Also, a comment in this post may be useful for alternatives, if needed: it is possible to approach this via separate regression analyses, one for each response variable.

MATLAB - How to set sigma in templateSVM

I have a dataset with multiple labeled vectors and I wanted to perform a multi-class SVM with RBF Kernel with the integrated function in MATLAB called 'templateSVM'.
To do so, I use the templateSVM function with the following command:
t = templateSVM('BoxConstraint', 1, 'KernelFunction', 'rbf')
The problem is that I cannot find how to set the 'sigma' parameter.
Thanks to previous computations, I know that C=1 and sigma=8 are the best parameters to get the best results. Not knowing how to set sigma leads me to awful results.
Would you know how to set this parameter?
Thanks a lot in advance.
Unfortunately the options available with templateSVM seem to be quite limited (I had this problem myself and couldn't find a solution). There are some crucial options (such as the RBF sigma parameter) that do not seem to be available with templateSVM but are available with svmtrain.
I know that this isn't a real answer to your question, but I suggest that you look into using libsvm instead - it is very configurable and integrates well with Matlab.
I know it's an old question, but the answer would be useful for new users.
Link below can answer the question:
https://www.mathworks.com/matlabcentral/answers/336748-support-vector-machine-parameters-matlab
"setting SIGMA": Use the 'KernelScale' name-value pair.

Adding Interaction Terms to MATLAB Multiple Regression

I am currently running a multiple linear regression using MATLAB's LinearModel.fit function, and I am bit confused in regards to how to properly add interaction terms to the model by hand. As I am aware, LinearModel.fit does not standardize variables on its own, so I have been doing so manually.
So far, the way I have done it has been to
Standardize the observations for each variables
Multiply corresponding standardized values from specific variables to create the interaction terms and then add these new variables to the set of regression data
Run the regression
Is this the correct way to go about doing this? Should I standardize the interaction term variables also after calculating the 'raw' terms? Any help would be greatly appreciated!
Whether or not to standardize interaction terms probably depends on what you intend to do with the model. Standardization typically does not affect model performance as much as it allows for more straightforward model interpretation as your learned coefficients will be on similar scales. I suspect whether to do this or not is largely a matter of opinion. Here is a relevant stats.stackexchange post that may help.
My intuition would be the same as how you have described your process so far.

Feature Selection methods in MATLAB?

I am trying to do some text classification with SVMs in MATLAB and really would to know if MATLAB has any methods for feature selection(Chi Sq.,MI,....), For the reason that I wan to try various methods and keeping the best method, I don't have time to implement all of them. That's why I am looking for such methods in MATLAB.Does any one know?
svmtrain
MATLAB has other utilities for classification like cluster analysis, random forests, etc.
If you don't have the required toolbox for svmtrain, I recommend LIBSVM. It's free and I've used it a lot with good results.
The Statistics Toolbox has sequentialfs. See also the documentation on feature selection.
A similar approach is dimensionality reduction. In MATLAB you can easily perform PCA or Factor analysis.
Alternatively you can take a wrapper approach to feature selection. You would search through the space of features by taking a subset of features each time, and evaluating that subset using any classification algorithm you decide (LDA, Decision tree, SVM, ..). You can do this as an exhaustively or using some kind of heuristic to guide the search (greedy, GA, SA, ..)
If you have access to the Bioinformatics Toolbox, it has a randfeatures function that does a similar thing. There's even a couple of cool demos of actual use cases.
May be this might help:
There are two ways of selecting the features in the classification:
Using fselect.py from libsvm tool directory (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#feature_selection_tool)
Using sequentialfs from statistics toolbox.
I would recommend using fselect.py as it provides more options - like automatic grid search for optimum parameters (using grid.py). It also provides an F-score based on the discrimination ability of the features (see http://www.csie.ntu.edu.tw/~cjlin/papers/features.pdf for details of F-score).
Since fselect.py is written in python, either you can use python interface or as I prefer, use matlab to perform a system call to python:
system('python fselect.py <training file name>')
Its important that you have python installed, libsvm compiled (and you are in the tools directory of libsvm which has grid.py and other files).
It is necessary to have the training file in libsvm format (sparse format). You can do that by using sparse function in matlab and then libsvmwrite.
xtrain_sparse = sparse(xtrain)
libsvmwrite('filename.txt',ytrain,xtrain_sparse)
Hope this helps.
For sequentialfs with libsvm, you can see this post:
Features selection with sequentialfs with libsvm

How to optimize neural network by using genetic algorithm?

I'm quite new with this topic so any help would be great. What I need is to optimize a neural network in MATLAB by using GA. My network has [2x98] input and [1x98] target, I've tried consulting MATLAB help but I'm still kind of clueless about what to do :( so, any help would be appreciated. Thanks in advance.
Edit: I guess I didn't say what is there to be optimized as Dan said in the 1st answer. I guess most important thing is number of hidden neurons. And maybe number of hidden layers and training parameters like number of epochs or so. Sorry for not providing enough info, I'm still learning about this.
If this is a homework assignment, do whatever you were taught in class.
Otherwise, ditch the MLP entirely. Support vector regression ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ) is much more reliably trainable across a broad swath of problems, and pretty much never runs into the stuck-in-a-local-minima problem often hit with back-propagation trained MLP which forces you to solve a network topography optimization problem just to find a network which will actually train.
well, you need to be more specific about what you are trying to optimize. Is it the size of the hidden layer? Do you have a hidden layer? Is it parameter optimization (learning rate, kernel parameters)?
I assume you have a set of parameters (# of hidden layers, # of neurons per layer...) that needs to be tuned, instead of brute-force searching all combinations to pick a good one, GA can help you "jump" from this combination to another one. So, you can "explore" the search space for potential candidates.
GA can help in selecting "helpful" features. Some features might appear redundant and you want to prune them. However, say, data has too many features to search for the best set of features by some approaches such as forward selection. Again, GA can "jump" from this set candidate to another one.
You will need to find away to encode the data (input parameters, features...) fed to GA. For finding a set of input paras or a good set of features, I think binary encoding should work. In addition, choosing operators for GA to reproduce offsprings is also important. Yet GA needs to be tuned, too (early stopping which can also be applied to ANN).
Here are just some ideas. You might want to search for more info about GA, feature selection, ANN pruning...
Since you're using MATLAB already I suggest you look into the Genetic Algorithms solver (known as GATool, part of the Global Optimization Toolbox) and the Neural Network Toolbox. Between those two you should be able to save quite a bit of figuring out.
You'll basically have to do 2 main tasks:
Come up with a representation (or encoding) for your candidate solutions
Code your fitness function (which basically tests candidate solutions) and pass it as a parameter to the GA solver.
If you need help in terms of coming up with a fitness function, or encoding of candidate solutions then you'll have to be more specific.
Hope it helps.
Matlab has a simple but great explanation for this problem here. It explains both the ANN and GA part.
For more info on using ANN in command line see this.
There is also plenty of litterature on the subject if you google it. It is however not related to MATLAB, but simply the results and the method.
Look up Matthew Settles on Google Scholar. He did some work in this area at the University of Idaho in the last 5-6 years. He should have citations relevant to your work.