Are there simple in-built SPSS functions for linear regression?
How do I get the residual series?
I want to perform a White test, Durbin-Watson (or inspect a correlogram of residuals), F-test for redundant variables and look at the Variance Inflation Factors.
I would have assumed that these would be standard functions for any stats package but I can't find them...
thanks!
I am puzzled why you can't find these. The are available in the REGRESSION procedure (Analyze > Regression > Linear) or GENLIN (Analyze > Generalized Linear Models) or ACF (Analyze > Forecasting > Autocorrelations).
Related
I have a dataset of videos with x1,..,x24 features, and y as a score. I want to perform feature selection to reduce the error rate. I've found sequentialfs in Matlab but it's for classification.
Is there any feature selection function for regression? If it is could you provide an example of that?
It seems sequentialfs is for classification and regression, too.
Typical loss measures include sum of squared errors for regression
models (sequentialfs computes the mean-squared error in this case),
and the number of misclassified observations for classification models
(sequentialfs computes the misclassification rate in this case).
I have a large dataset of multidimensional data (240 dimensions).
I am a beginner at performing data mining and I want to apply Linear Discriminant Analysis by using MATLAB. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply LDA.
After this step I want to be able to do a reconstruction for my data.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like: size(x) = [2000 240]. So basically I have 240 features (dimensions) and 2000 data points. And I want to perform LDA on this data set.
The function classify from Statistics Toolbox does Linear (and, if you set some options, Quadratic) Discriminant Analysis. There are a couple of worked examples in the documentation that explain how it should be used: type doc classify or showdemo classdemo to see them.
240 features is quite a lot given that you only have 2000 observations, even if you have only two classes. You might want to apply a dimension reduction method before LDA, such as PCA (see doc princomp) or use a feature selection method (see doc sequentialfs for one such method).
you can use fitcdiscr for classification using LDA in matlab 2014
I have painstakingly gathered data for a proof-of-concept study I am performing. The data consists of 40 different subjects, each with 12 parameters measured at 60 time intervals and 1 output parameter being 0 or 1. So I am building a binary classifier.
I knew beforehand that there is a non-linear relation between the input-parameters and the output so a simple perceptron of Bayes classifier would be unable to classify the sample. This assumption proved correct after initial tests.
Therefore I went to neural networks and as I hoped the results were pretty good. An error of about 1-5% is generally the result. The training is done by using 70% as training and 30% as evaluation. Running the complete dataset again (100%) through the model I was very happy with the results. The following is a typical confusion matrix (P = positive, N = negative):
P N
P 13 2
N 3 42
So I am happy and with the notion that I used a 30% for evaluation I am confident that I am not fitting noise.
Therefore I resolved to SVM for a double check and the SVM was unable to converge to a good solution. Most of the time the solutions are terrible (say 90% error...). Maybe I am not fully aware of SVM's or the implementations are not correct, but it troubles me because I thought that when NN provide a good solution, SVM's are most of the time better in seperating the data due to their maximum-margin hyperplane.
What does this say of my result? Am I fitting noise? And how do I know if this is a correct result?
I am using Encog for the calculations but the NN results are comparable to home-grown NN models I made.
If it is your first time to use SVM, I strongly recommend you to take a look at A Practical Guide to Support Vector Classication, by authors of a famous SVM package libsvm. It gives a list of suggestions to train your SVM classifier.
Transform data to the format of an SVM package
Conduct simple scaling on the data
Consider the RBF kernel
Use cross-validation to nd the best parameter C and γ
Use the best parameter C and γ
to train the whole training set
Test
In short, try scaling your data and carefully choosing the kernal plus the parameters.
I have a set of 100 observations where each observation has 45 characteristics. And each one of those observations have a label attached which I want to predict based on those 45 characteristics. So it's an input matrix with the dimension 45 x 100 and a target matrix with the dimension 1 x 100.
The thing is that I want to know how many of those 45 characteristics are relevant in my set of data, basically the principal component analysis, and I understand that I can do this with Matlab function processpca.
Could you please tell me how can I do this? Suppose that the input matrix is x with 45 rows and 100 columns and y is a vector with 100 elements.
Assuming that you want to construct a model of the 1x100 vector, based on the 45x100 matrix, I am not convinced that PCA will do what you think. PCA can be used to select variables for model estimation, but this is a somewhat indirect way to gather a set of model features. Anyway, I suggest reading both:
Principal Components Analysis
and...
Putting PCA to Work
...both of which provide code in MATLAB not requiring any Toolboxes.
Have you tried COEFF = princomp(x)?
COEFF = princomp(X) performs principal
components analysis (PCA) on the
n-by-p data matrix X, and returns the
principal component coefficients, also
known as loadings. Rows of X
correspond to observations, columns to
variables. COEFF is a p-by-p matrix,
each column containing coefficients
for one principal component. The
columns are in order of decreasing
component variance.
From your question I deduced you don't need to do it in MATLAB, but you just want to analyze your dataset. According to my opinion the key is visualization of the dependencies.
If you're not forced to do the analysis in MATLAB I'd suggest you try more specialized software something like WEKA (www.cs.waikato.ac.nz/ml/weka/) or RapidMiner (rapid-i.com). Both tools can provide PCA and other dimension reduction algorithms + they contain nice visualization tools.
Your use case sounds like a combination of Classification and Feature Selection.
Statistics Toolbox offers a lot of good capabilities in this area. The toolbox provides access to a number of classification algorithms including
Naive Bayes Classifiers Bagged
Decision Trees (aka Random Forests)
Binomial and Multinominal logistic regression
Linear Discriminant analysis
You also have a variety of options available for feature selection include
sequentialfs (forwards and backwards feature selection)
relifF
"treebagger" also supports options for feature selection and estimating variable importance.
Alternatively, you can use some of Optimization Toolbox's capabilities to write your own custom equations to estimate variable importance.
A couple monthes back, I did a webinar for The MathWorks titled "Compuational Statistics: Getting Started with Classification using MTALAB". You can watch the Webinar at
http://www.mathworks.com/company/events/webinars/wbnr51468.html?id=51468&p1=772996255&p2=772996273
The code and the data set for the examples is available at MATLAB Central
http://www.mathworks.com/matlabcentral/fileexchange/28770
With all this said and done, many people using Principal Component Analysis as a pre-processing step before applying classification algorithms. PCA gets used alot
When you need to extract features from images
When you're worried about multicollinearity
You should find correlation matrix. in the following example matlab finds correlation matrix with 'corr' function
http://www.mathworks.com/help/stats/feature-transformation.html#f75476
I would like to implement similarity search in matlab. I wanna to know is it possible ?
My plan is to do use 2 popular similarity measurement which are Euclidean Distance and Dynamic Time Warping. Both of these will be applied on time series dataset. My question at this point is how can I evaluate both of these two measurement performance and accuracy ? I seen some literature saying i should use K-NN algorithm.
Then, I plan to apply dimensionality reduction on the time series dataset. After reducued the dimensionality of the dataset. I will need to index the dataset using R-tree or any indexing techniques available.
However my problem is that to do this, I need R-tree matlab code which I hardly able to find any in the internet ...
I do realised that most of the implementation for similarity search are in C++, C and Java ... But im not familiar with those. Im hoping I could implement these in Matlab ... Any Guru could help me with this ?
Also what kind of evaluation can I make to evaluate the performance for each algorithm.
Thanks
Recently (R2010a I believe), MATLAB added new functions for k-Nearest Neighbor (kNN) searching using KD-tree (a spatial indexing method similar to R-tree) to the Statistics Toolbox. Example:
load fisheriris % Iris dataset
Q = [6 3 4 1 ; 5 4 3 2]; % query points
% build kd-tree
knnObj = createns(meas, 'NSMethod','kdtree', 'Distance','euclidean');
% find k=5 Nearest Neighbors to Q
[idx Dist] = knnsearch(knnObj, Q, 'K',5);
Refer to this page for nice description.
Also if you have the Image Processing Toolbox, it contains (for a long time now) an implementation of the kd-tree and kNN searching. They are private functions though:
[matlabroot '\images\images\private\kdtree.m']
[matlabroot '\images\images\private\nnsearch.m']
To compare your two approaches (Dynamic Time Warping and Euclidean distance), you can design a classic problem of classification; given a set of labeled training/testing time series, the task is to predict the label of each test sequenceby finding the most similar ones using kNN then predict the majority class. To evaluate performance, use any of the standard measures of classification like accuracy/error, etc..
It turns out that it is MUCH faster, for both ED and DTW, to do a sequential scan, using the UCR suite.
See this video
https://www.youtube.com/watch?v=d_qLzMMuVQg
or this
https://www.youtube.com/watch?v=c7xz9pVr05Q
the code is free http://www.cs.ucr.edu/~eamonn/UCRsuite.html