Calculate cross validation for Generalized Linear Model in Matlab - matlab

I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far;
x = 'Some dataset, containing the input and the output'
X = x(:,1:7);
Y = x(:,8);
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson');
Ypred = predict(mdl,Xtest);
res = (Ypred - Ytest);
RMSE_test = sqrt(mean(res.^2));
The code below is for calculating cross validation for mulitple regression as obtained from this link. I want something similar for Generalized Linear Model.
c = cvpartition(Y,'k',10);
regf=#(Xtrain,Ytrain,Xtest)(Xtest*regress(Ytrain,Xtrain));
cvMse = crossval('mse',X,Y,'predfun',regf)

You can either perform the cross-validation process manually (training a model for each fold, predict outcome, compute error, then report the average across all folds), or you can use the CROSSVAL function which wraps this whole procedure in a single call.
To give an example, I will first load and prepare a dataset (a subset of the cars dataset which ships with the Statistics Toolbox):
% load regression dataset
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
Y = MPG;
% remove instances with missing values
missIdx = isnan(Y) | any(isnan(X),2);
X(missIdx,:) = [];
Y(missIdx) = [];
clearvars -except X Y
Option 1
Here we will manually partition the data using k-fold cross-validation using cvpartition (non-stratified). For each fold, we train a GLM model using the training data, then use the model to predict output of testing data. Next we compute and store the regression mean squared error for this fold. At the end, we report the average RMSE across all partitions.
% partition data into 10 folds
K = 10;
cv = cvpartition(numel(Y), 'kfold',K);
mse = zeros(K,1);
for k=1:K
% training/testing indices for this fold
trainIdx = cv.training(k);
testIdx = cv.test(k);
% train GLM model
mdl = GeneralizedLinearModel.fit(X(trainIdx,:), Y(trainIdx), ...
'linear', 'Distribution','poisson');
% predict regression output
Y_hat = predict(mdl, X(testIdx,:));
% compute mean squared error
mse(k) = mean((Y(testIdx) - Y_hat).^2);
end
% average RMSE across k-folds
avrg_rmse = mean(sqrt(mse))
Option 2
Here we can simply call CROSSVAL with an appropriate function handle which computes the regression output given a set of train/test instances. See the doc page to understand the parameters.
% prediction function given training/testing instances
fcn = #(Xtr, Ytr, Xte) predict(...
GeneralizedLinearModel.fit(Xtr,Ytr,'linear','distr','poisson'), ...
Xte);
% perform cross-validation, and return average MSE across folds
mse = crossval('mse', X, Y, 'Predfun',fcn, 'kfold',10);
% compute root mean squared error
avrg_rmse = sqrt(mse)
You should get a similar result compared to before (slightly different of course, on account of the randomness involved in the cross-validation).

Related

Function to implement SVM Matlab

I have to implement an SVM classifier that recognizes labels.
The code is that:
function[Y_SVM_test] = getSVM(x,y,z, labels)
%matrix that contain x,y,z
X = [];
%vector of labels
Y = [];
X = [X; x y z];
Y = [Y; labels];
cv = cvpartition(length(X),'holdout',0.2);
% Training set
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
% Test set
Xtest = X(test(cv),:);
Ytest = Y(test(cv));
tic
mySVM = fitcecoc(Xtrain,Ytrain);
toc
Y_SVM_test = predict(mySVM,Xtest);
end
With the function fitcecoc the execution never ends, I used it incorrectly? I tried to use also the function fitcsvm, which seems more specific from the documentation, but the error I get is the following: Error using ClassificationSVM.prepareData (line 686) You can not train an SVM model for more than 2 classes.
In general I have not understood well what is the best way to run SVM in Matlab. Can someone help me?
Your code looks good to me. When you say it never ends, I would guess you just haven't waited long enough. If your dataset is fairly large, fitting an ECOC SVM model can take a long time.
Using fitcecoc is the right way to fit a multiclass SVM model. SVMs by themselves are only a two-class model, which is fitted by fitcsvm. To fit a multiclass model, a wrapper is needed. ECOC is such a wrapper - what it does it to take each class, and separately fit a two-class model for that class against all the others. That's why it can take so long - it needs to fit multiple models, one for each class.
PS: you don't need X = []; and then X = [X; x y z];. Just say X = [x y z], it has the same effect. Similarly, just say Y = labels.

K-fold cross validation modification to generated ANN code?

My data set is basically a matrix of 3 variables (input), and a matrix of 1 variable (target). There are 50 total data sets for each of these (basically 50 samples of f(x,y,z) = t)
I have only done the ANN training using the GUI. Never really with the script/code.
My most simple objective now is to split the data manually for each train-test run, so I can just painstakingly run the neural network 5 times, but I'm not even sure how to manually select a range of the data set for use in training, and which one for testing.
Here's the full exported script from MATLAB. The point of focus is shown below the wall of code.
% Solve an Input-Output Fitting problem with a Neural Network
% Script generated by NFTOOL
% Created Mon Jul 17 02:39:31 SGT 2017
%
% This script assumes these variables are defined:
%
% DEinp - input data.
% DEcgl - target data.
inputs = DEinp;
targets = DEcgl;
% Create a Fitting Network
hiddenLayerSize = 10;
net = fitnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions
% For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing
% For a list of all data division functions type: help nndivide
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
% For help on training function 'trainlm' type: help trainlm
% For a list of all training functions type: help nntrain
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function
% For a list of all performance functions type: help nnperformance
net.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions
% For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
'plotregression', 'plotfit'};
% Train the Network
[net,tr] = train(net,inputs,targets);
% Test the Network
outputs = net(inputs);
errors = gsubtract(targets,outputs);
performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets .* tr.valMask{1};
testTargets = targets .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs)
valPerformance = perform(net,valTargets,outputs)
testPerformance = perform(net,testTargets,outputs)
% View the Network
view(net)
% Plots
% Uncomment these lines to enable various plots.
%figure, plotperform(tr)
%figure, plottrainstate(tr)
%figure, plotfit(net,inputs,targets)
%figure, plotregression(targets,outputs)
%figure, ploterrhist(errors)
I figured that all I needed to do was mess with the net.divideMode section, but I really have no idea how to change the syntax to complete my objective.
Network Parameters
The process of splitting the data into training, validation and test sets happens in the section that you identified. I'm just going to break down each of the lines. Starting with:
% Setup Division of Data for Training, Validation, Testing
% For a list of all data division functions type: help nndivide
net.divideMode = 'sample'; % Divide up every sample
The divideMode is well documented in Neural Network Object Properties
net.divideMode
This property defines the target data dimensions which
to divide up when the data division function is called. Its default
value is 'sample' for static networks and 'time' for dynamic networks.
It may also be set to 'sampletime' to divide targets by both sample
and timestep, 'all' to divide up targets by every scalar value, or
'none' to not divide up data at all (in which case all data is used
for training, none for validation or testing).
So your network is a static network which divides up every sample into a training example. This will remain the same for your cross-validation. What you are interested in manipulating is the training, test, and validation splits.
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
Okay, the variable names here seem promising, but you want a little more control than just choosing the ratio size.
Again the Neural Network Object Properties point us towards more information
net.divideParam
This property defines the parameters and values of the current
data-division function. To get a description of what each field means,
type the following command:
help(net.divideFcn)
This will print out information about how your dataset is partitioned into training, validation, and test splits. In your current configuration, the message reads
dividerand Partition indices into three sets using random indices.
[trainInd,valInd,testInd] = dividerand(Q,trainRatio,valRatio,testRatio) takes a number of
samples Q and divides up the sample indices 1:Q between training,
validation and test indices.
dividerand randomly assigns sample indices to the three sets according to the three ratios.
(...)
See also divideblock, divideind, divideint, dividetrain.
Since you want more control of the partitions, you should check out these additional options.
I think the most promising is divideind. This option allows you to specify the indices for each partition. You can calculate the indices for each fold in your k-fold cross validation and reassign the partitions in each iteration using this option.
To set this parameter, replace the net.divideParam lines above with something like,
net.divideFcn = 'divideind';
net.divideParam.Q = length(targets); %This is the total number of instances in your data
net.divideParam.trainInd = your_train_ind;
net.divideParam.valInd = your_val_ind;
net.divideParam.testInd = your_test_ind;
K-folds
Last detail, how to select the indices? First, a quick review on k-fold cross-validation.
The data is split into k equally sized subsamples.
In each iteration of cross-validation, we train on k-1 of the subsamples and test on the remaining subsamples, rotating to a new testing subsamples each time.
An implementation sketch might look like this
k = 5; % As an example, let's let k = 5
sample_size = length(targets)/k;
%Make a vector of all the indices of your data from 1 to the total number of instances
indices= 1:length(targets);
% Optional: Randomize samples
indices = randperm(length(targets));
% Iterate in steps of sample_size
for ii = 1: sample_size:length(targets) - sample_size
% Grab one subsample of indices for testing
your_test_ind = indices( ii:ii + sample_size - 1);
% Everything else
your_train_ind = indices( [1:ii, ii + sample_size:end]);
%Train and test your network here!
end
This is just an implementation sketch and doesn't handle some edge cases correctly. For example, the first element is always added to the training set, but it should be enough to get you started.

Example of 10-fold cross-validation with Neural network classification in MATLAB

I am looking for an example of applying 10-fold cross-validation in neural network.I need something link answer of this question: Example of 10-fold SVM classification in MATLAB
I would like to classify all 3 classes while in the example only two classes were considered.
Edit: here is the code I wrote for iris example
load fisheriris %# load iris dataset
k=10;
cvFolds = crossvalind('Kfold', species, k); %# get indices of 10-fold CV
net = feedforwardnet(10);
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train
net = train(net,meas(trainIdx,:)',species(trainIdx)');
%# test
outputs = net(meas(trainIdx,:)');
errors = gsubtract(species(trainIdx)',outputs);
performance = perform(net,species(trainIdx)',outputs)
figure, plotconfusion(species(trainIdx)',outputs)
end
error given by matlab:
Error using nntraining.setup>setupPerWorker (line 62)
Targets T{1,1} is not numeric or logical.
Error in nntraining.setup (line 43)
[net,data,tr,err] = setupPerWorker(net,trainFcn,X,Xi,Ai,T,EW,enableConfigure);
Error in network/train (line 335)
[net,data,tr,err] = nntraining.setup(net,net.trainFcn,X,Xi,Ai,T,EW,enableConfigure,isComposite);
Error in Untitled (line 17)
net = train(net,meas(trainIdx,:)',species(trainIdx)');
It's a lot simpler to just use MATLAB's crossval function than to do it manually using crossvalind. Since you are just asking how to get the test "score" from cross-validation, as opposed to using it to choose an optimal parameter like for example the number of hidden nodes, your code will be as simple as this:
load fisheriris;
% // Split up species into 3 binary dummy variables
S = unique(species);
O = [];
for s = 1:numel(S)
O(:,end+1) = strcmp(species, S{s});
end
% // Crossvalidation
vals = crossval(#(XTRAIN, YTRAIN, XTEST, YTEST)fun(XTRAIN, YTRAIN, XTEST, YTEST), meas, O);
All that remains is to write that function fun which takes in input and output training and test sets (all provided to it by the crossval function so you don't need to worry about splitting your data yourself), trains a neural net on the training set, tests it on the test set and then output a score using your preferred metric. So something like this:
function testval = fun(XTRAIN, YTRAIN, XTEST, YTEST)
net = feedforwardnet(10);
net = train(net, XTRAIN', YTRAIN');
yNet = net(XTEST');
%'// find which output (of the three dummy variables) has the highest probability
[~,classNet] = max(yNet',[],2);
%// convert YTEST into a format that can be compared with classNet
[~,classTest] = find(YTEST);
%'// Check the success of the classifier
cp = classperf(classTest, classNet);
testval = cp.CorrectRate; %// replace this with your preferred metric
end
I don't have the neural network toolbox so I am unable to test this I'm afraid. But it should demonstrate the principle.

10-fold cross validation for polynomial regressions

I want to use a 10-fold cross validation method, which tests which polynomial form (first, second, or
third order) gives a better fit. I want to divide my data set into 10 subsets and remove 1 subset from the 10 data sets. Derive a regression model without this subset, predict the output values for this subset using the derived regression model, and computed the residuals. Finally repeat the calculation routine for each subset and sum the squares of the resulting residuals.
I already coded the following on Matlab 2013b, which sample the data and test the regression on the training data. I am stuck on how to repeat this for every subset and how to compare which polynomial form gives a better fit.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
end
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
Regarding executing your code for every subset, you can put the fit inside the loop and store the results, e.g.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
bAll = []; bintAll = []; rAll = [];
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
bAll = [bAll, coeffvalues(b)]; bintAll = [bintAll,bint]; rAll = [rAll,r];
end
Regarding the best fit, you probably can pick the fit with the lowest rmse.

PCA using princomp in MATLAB (for face recognition)

I'm trying to do dimensionality reduction using MATLAB's princomp, but I'm not sure I'm doing it right.
Here is the my code just for testing, but I'm not sure that I'm doing projection right:
A = rand(4,3)
AMean = mean(A)
[n m] = size(A)
Ac = (A - repmat(AMean,[n 1]))
pc = princomp(A)
k = 2; %Number of first principal components
A_pca = Ac * pc(1:k,:)' %Not sure I'm doing projection right
reconstructedA = A_pca * pc(1:k,:)
error = reconstructedA- Ac
And my code for face recognition using ORL dataset:
%load orl_data 400x768 double matrix (400 images 768 features)
%make labels
orl_label = [];
for i = 1:40
orl_label = [orl_label;ones(10,1)*i];
end
n = size(orl_data,1);
k = randperm(n);
s = round(0.25*n); %Take 25% for train
%Raw pixels
%Split on test and train sets
data_tr = orl_data(k(1:s),:);
label_tr = orl_label(k(1:s),:);
data_te = orl_data(k(s+1:end),:);
label_te = orl_label(k(s+1:end),:);
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_tr,label_tr,data_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
%Using PCA
tic
pc = princomp(data_tr);
toc
mean_face = mean(data_tr);
pc_n = 100;
f_pc = pc(1:pc_n,:)';
data_pca_tr = (data_tr - repmat(mean_face, [s,1])) * f_pc;
data_pca_te = (data_te - repmat(mean_face, [n-s,1])) * f_pc;
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_pca_tr,label_tr,data_pca_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
If I choose enough principal components it gives me equal recognition rates. If I use a small number of principal components (PCA) then the rate using PCA is poorer.
Here are some questions:
Is princomp function the best way to calculate first k principal components using MATLAB?
Using PCA projected features vs raw features don't give extra accuracy, but only smaller features vector size? (faster to compare feature vectors).
How to automatically choose min k (number of principal components) that give the same accuracy vs raw feature vector?
What if I have very big set of samples can I use only subset of them with comparable accuracy? Or can I compute PCA on some set and later "add" some other set (I don't want to recompute pca for set1+set2, but somehow iteratively add information from set2 to existing PCA from set1)?
I also tried the GPU version simply using gpuArray:
%Test using GPU
tic
A_cpu = rand(30000,32*24);
A = gpuArray(A_cpu);
AMean = mean(A);
[n m] = size(A)
pc = princomp(A);
k = 100;
A_pca = (A - repmat(AMean,[n 1])) * pc(1:k,:)';
A_pca_cpu = gather(A_pca);
toc
clear;
tic
A = rand(30000,32*24);
AMean = mean(A);
[n m] = size(A)
pc = princomp(A);
k = 100;
A_pca = (A - repmat(AMean,[n 1])) * pc(1:k,:)';
toc
clear;
It is working faster, but it's not suitable for big matrices. Maybe I'm wrong?
If I use a big matrix, it gives me:
Error using gpuArray Out of memory on device.
"Is princomp function the best way to calculate first k principal components using MATLAB?"
It's computing a full SVD, so it will be slow on large datasets. You can speed this up significantly by specifying the number of dimensions you need at the start and computing a partial svd. The matlab functions for a partial svd is svds.
If svds' not fast enough for you there's a more modern implementation here:
http://cims.nyu.edu/~tygert/software.html (matlab version: http://code.google.com/p/framelet-mri/source/browse/pca.m )
(cf the paper describing the algorithm http://cims.nyu.edu/~tygert/blanczos.pdf )
You can control the error of your approximation by increasing the number of singular vectors computed, there's precise bounds on that in the linked paper. Here's an example:
>> A = rand(40,30); %random rank-30 matrix
>> [U,S,V] = pca(A,2); %compute a rank-2 approximation to A
>> norm(A-U*S*V',2)/norm(A,2) %relative error
ans =
0.1636
>> [U,S,V] = pca(A,25); %compute a rank-25 approximation to A
>> norm(A-U*S*V',2)/norm(A,2) %relative error
ans =
0.0410
When you have large data and a sparse matrix computing a full SVD is often impossible since the factors will never be sparse. In this case you must compute a partial SVD to fit within memory. Example:
>> A = sprandn(5000,5000,10000);
>> tic;[U,S,V]=pca(A,2);toc;
no pivots
Elapsed time is 124.282113 seconds.
>> tic;[U,S,V]=svd(A);toc;
??? Error using ==> svd
Use svds for sparse singular values and vectors.
>> tic;[U,S,V]=princomp(A);toc;
??? Error using ==> svd
Use svds for sparse singular values and vectors.
Error in ==> princomp at 86
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
>> tic;pc=princomp(A);toc;
??? Error using ==> eig
Use eigs for sparse eigenvalues and vectors.
Error in ==> princomp at 69
[coeff,~] = eig(x0'*x0);