Matlab - random forest classifier 10fold-cross validation accuracy - matlab

I have a dataset of 20000 instances with 4421 features. For scientific reasons ( publication), I need to perform a 10 fold-cross validation from this dataset as the individual and average accuracy of the classifier using random forest with Matlab. Please, could you tell me have to perform 10 cv from my dataset and obtaining the classification accuracy?
Here my code so far:
data = load ('HCTSA_N.mat');
% This makes sure we get the same results every time we run the code.
rng default
traindata = data.TS_DataMat;
trainlabels = {data.TimeSeries.Keywords};
% How many trees do you want in the forest?
nTrees = 20;
% Train the TreeBagger (Decision Forest).
B = TreeBagger(nTrees,traindata,trainlabels, 'Method', 'classification');

Related

K-fold cross validation modification to generated ANN code?

My data set is basically a matrix of 3 variables (input), and a matrix of 1 variable (target). There are 50 total data sets for each of these (basically 50 samples of f(x,y,z) = t)
I have only done the ANN training using the GUI. Never really with the script/code.
My most simple objective now is to split the data manually for each train-test run, so I can just painstakingly run the neural network 5 times, but I'm not even sure how to manually select a range of the data set for use in training, and which one for testing.
Here's the full exported script from MATLAB. The point of focus is shown below the wall of code.
% Solve an Input-Output Fitting problem with a Neural Network
% Script generated by NFTOOL
% Created Mon Jul 17 02:39:31 SGT 2017
%
% This script assumes these variables are defined:
%
% DEinp - input data.
% DEcgl - target data.
inputs = DEinp;
targets = DEcgl;
% Create a Fitting Network
hiddenLayerSize = 10;
net = fitnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions
% For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing
% For a list of all data division functions type: help nndivide
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
% For help on training function 'trainlm' type: help trainlm
% For a list of all training functions type: help nntrain
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function
% For a list of all performance functions type: help nnperformance
net.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions
% For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
'plotregression', 'plotfit'};
% Train the Network
[net,tr] = train(net,inputs,targets);
% Test the Network
outputs = net(inputs);
errors = gsubtract(targets,outputs);
performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets .* tr.valMask{1};
testTargets = targets .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs)
valPerformance = perform(net,valTargets,outputs)
testPerformance = perform(net,testTargets,outputs)
% View the Network
view(net)
% Plots
% Uncomment these lines to enable various plots.
%figure, plotperform(tr)
%figure, plottrainstate(tr)
%figure, plotfit(net,inputs,targets)
%figure, plotregression(targets,outputs)
%figure, ploterrhist(errors)
I figured that all I needed to do was mess with the net.divideMode section, but I really have no idea how to change the syntax to complete my objective.
Network Parameters
The process of splitting the data into training, validation and test sets happens in the section that you identified. I'm just going to break down each of the lines. Starting with:
% Setup Division of Data for Training, Validation, Testing
% For a list of all data division functions type: help nndivide
net.divideMode = 'sample'; % Divide up every sample
The divideMode is well documented in Neural Network Object Properties
net.divideMode
This property defines the target data dimensions which
to divide up when the data division function is called. Its default
value is 'sample' for static networks and 'time' for dynamic networks.
It may also be set to 'sampletime' to divide targets by both sample
and timestep, 'all' to divide up targets by every scalar value, or
'none' to not divide up data at all (in which case all data is used
for training, none for validation or testing).
So your network is a static network which divides up every sample into a training example. This will remain the same for your cross-validation. What you are interested in manipulating is the training, test, and validation splits.
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
Okay, the variable names here seem promising, but you want a little more control than just choosing the ratio size.
Again the Neural Network Object Properties point us towards more information
net.divideParam
This property defines the parameters and values of the current
data-division function. To get a description of what each field means,
type the following command:
help(net.divideFcn)
This will print out information about how your dataset is partitioned into training, validation, and test splits. In your current configuration, the message reads
dividerand Partition indices into three sets using random indices.
[trainInd,valInd,testInd] = dividerand(Q,trainRatio,valRatio,testRatio) takes a number of
samples Q and divides up the sample indices 1:Q between training,
validation and test indices.
dividerand randomly assigns sample indices to the three sets according to the three ratios.
(...)
See also divideblock, divideind, divideint, dividetrain.
Since you want more control of the partitions, you should check out these additional options.
I think the most promising is divideind. This option allows you to specify the indices for each partition. You can calculate the indices for each fold in your k-fold cross validation and reassign the partitions in each iteration using this option.
To set this parameter, replace the net.divideParam lines above with something like,
net.divideFcn = 'divideind';
net.divideParam.Q = length(targets); %This is the total number of instances in your data
net.divideParam.trainInd = your_train_ind;
net.divideParam.valInd = your_val_ind;
net.divideParam.testInd = your_test_ind;
K-folds
Last detail, how to select the indices? First, a quick review on k-fold cross-validation.
The data is split into k equally sized subsamples.
In each iteration of cross-validation, we train on k-1 of the subsamples and test on the remaining subsamples, rotating to a new testing subsamples each time.
An implementation sketch might look like this
k = 5; % As an example, let's let k = 5
sample_size = length(targets)/k;
%Make a vector of all the indices of your data from 1 to the total number of instances
indices= 1:length(targets);
% Optional: Randomize samples
indices = randperm(length(targets));
% Iterate in steps of sample_size
for ii = 1: sample_size:length(targets) - sample_size
% Grab one subsample of indices for testing
your_test_ind = indices( ii:ii + sample_size - 1);
% Everything else
your_train_ind = indices( [1:ii, ii + sample_size:end]);
%Train and test your network here!
end
This is just an implementation sketch and doesn't handle some edge cases correctly. For example, the first element is always added to the training set, but it should be enough to get you started.

How to increase accuracy in SVM training and classification in Matlab?

I am having svm training with several images. This is my first project with SVM. I am extracting features with HOG feature extraction. Training features and label their locations 1 if it is on the horizon line, 0 if it is on the background. I have 74 images for training and 7 images for testing. Unfortunately, I can't go above 50 percent accuracy. I have changed image sizes, I have played cell sizes in feature extraction. It does not change that much. What can I try? And what is the ideal dataset number, how many images for training and testing? For example in one image it predicts all correct in next image all wrong.
This is how I am calculating accuracy;
%%%%% Evaluation
% Testing Data
hfsTest = vertcat(dataset.HorizonFeatsTest{:});
bfsTest = vertcat(dataset.BgFeatsTest{:});
test_data = [hfsTest;bfsTest];
% Labels
hlabelTest = ones(size(hfsTest,1),1);
blabelTest = zeros(size(bfsTest,1),1);
test_label = [hlabelTest;blabelTest];
Predict_label = vertcat(results.predicted_label{:});
acc = numel(find(Predict_label==test_label))/length(test_label);
disp(['Accuracy ', num2str(acc)]);
%done
% Training Data
hfs = vertcat(dataset.HorizonFeats{:});
bfs = vertcat(dataset.BgFeats{:});
train_data = [hfs;bfs];
% Labels
hlabel = ones(size(hfs,1),1);
blabel = zeros(size(bfs,1),1);
train_label = [hlabel;blabel];
%%%
% do training ...
svmModel = svmtrain(train_data, train_label,'BoxConstraint',2e-1);
and I have used Predict_label_image = svmclassify (svmModel, image_feats); for testing.
You need to do a lot of tunning. Here in the documentation you have all the hyperparameters you can play with. I'll start with a rbf kernel and trying [0.01, 0.1, 1, 10] for BoxConstraint.
I'm afraid you can't expect svm to work if you don't try different hyperparameter configurations.

Calculate cross validation for Generalized Linear Model in Matlab

I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far;
x = 'Some dataset, containing the input and the output'
X = x(:,1:7);
Y = x(:,8);
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson');
Ypred = predict(mdl,Xtest);
res = (Ypred - Ytest);
RMSE_test = sqrt(mean(res.^2));
The code below is for calculating cross validation for mulitple regression as obtained from this link. I want something similar for Generalized Linear Model.
c = cvpartition(Y,'k',10);
regf=#(Xtrain,Ytrain,Xtest)(Xtest*regress(Ytrain,Xtrain));
cvMse = crossval('mse',X,Y,'predfun',regf)
You can either perform the cross-validation process manually (training a model for each fold, predict outcome, compute error, then report the average across all folds), or you can use the CROSSVAL function which wraps this whole procedure in a single call.
To give an example, I will first load and prepare a dataset (a subset of the cars dataset which ships with the Statistics Toolbox):
% load regression dataset
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
Y = MPG;
% remove instances with missing values
missIdx = isnan(Y) | any(isnan(X),2);
X(missIdx,:) = [];
Y(missIdx) = [];
clearvars -except X Y
Option 1
Here we will manually partition the data using k-fold cross-validation using cvpartition (non-stratified). For each fold, we train a GLM model using the training data, then use the model to predict output of testing data. Next we compute and store the regression mean squared error for this fold. At the end, we report the average RMSE across all partitions.
% partition data into 10 folds
K = 10;
cv = cvpartition(numel(Y), 'kfold',K);
mse = zeros(K,1);
for k=1:K
% training/testing indices for this fold
trainIdx = cv.training(k);
testIdx = cv.test(k);
% train GLM model
mdl = GeneralizedLinearModel.fit(X(trainIdx,:), Y(trainIdx), ...
'linear', 'Distribution','poisson');
% predict regression output
Y_hat = predict(mdl, X(testIdx,:));
% compute mean squared error
mse(k) = mean((Y(testIdx) - Y_hat).^2);
end
% average RMSE across k-folds
avrg_rmse = mean(sqrt(mse))
Option 2
Here we can simply call CROSSVAL with an appropriate function handle which computes the regression output given a set of train/test instances. See the doc page to understand the parameters.
% prediction function given training/testing instances
fcn = #(Xtr, Ytr, Xte) predict(...
GeneralizedLinearModel.fit(Xtr,Ytr,'linear','distr','poisson'), ...
Xte);
% perform cross-validation, and return average MSE across folds
mse = crossval('mse', X, Y, 'Predfun',fcn, 'kfold',10);
% compute root mean squared error
avrg_rmse = sqrt(mse)
You should get a similar result compared to before (slightly different of course, on account of the randomness involved in the cross-validation).

What are the Inputs, Outputs and Target in ANN

I am getting confusing about Inputs data set, outputs and target. I am studying about Artificial Neural Network in Matlab, my purposed is that I wanted to use the history data (I have rainfall and water levels for 20 years ago) to predict water level in the future (for example 2014). So, where is my inputs, targets, and output? For example i have a Excel sheet data as [Column1-Date| Column2-Rainfall | Column3 |Water level]
I am using this code to prediction, but it could not predict in the future, can anyone help me to fix it again? Thank you .
%% 1. Importing data
Data_Inputs=xlsread('demo.xls'); % Import file
Training_Set=Data_Inputs(1:end,2);%specific training set
Target_Set=Data_Inputs(1:end,3); %specific target set
Input=Training_Set'; %Convert to row
Target=Target_Set'; %Convert to row
X = con2seq(Input); %Convert to cell
T = con2seq(Target); %Convert to cell
%% 2. Data preparation
N = 365; % Multi-step ahead prediction
% Input and target series are divided in two groups of data:
% 1st group: used to train the network
inputSeries = X(1:end-N);
targetSeries = T(1:end-N);
inputSeriesVal = X(end-N+1:end);
targetSeriesVal = T(end-N+1:end);
% Create a Nonlinear Autoregressive Network with External Input
delay = 2;
inputDelays = 1:2;
feedbackDelays = 1:2;
hiddenLayerSize = 10;
net = narxnet(inputDelays,feedbackDelays,hiddenLayerSize);
% Prepare the Data for Training and Simulation
% The function PREPARETS prepares timeseries data for a particular network,
% shifting time by the minimum amount to fill input states and layer states.
% Using PREPARETS allows you to keep your original time series data unchanged, while
% easily customizing it for networks with differing numbers of delays, with
% open loop or closed loop feedback modes.
[inputs,inputStates,layerStates,targets] = preparets(net,inputSeries,{},targetSeries);
% Setup Division of Data for Training, Validation, Testing
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
% Train the Network
[net,tr] = train(net,inputs,targets,inputStates,layerStates);
% Test the Network
outputs = net(inputs,inputStates,layerStates);
errors = gsubtract(targets,outputs);
performance = perform(net,targets,outputs)
% View the Network
view(net)
% Plots
% Uncomment these lines to enable various plots.
%figure, plotperform(tr)
%figure, plottrainstate(tr)
%figure, plotregression(targets,outputs)
%figure, plotresponse(targets,outputs)
%figure, ploterrcorr(errors)
%figure, plotinerrcorr(inputs,errors)
% Closed Loop Network
% Use this network to do multi-step prediction.
% The function CLOSELOOP replaces the feedback input with a direct
% connection from the outout layer.
netc = closeloop(net);
netc.name = [net.name ' - Closed Loop'];
view(netc)
[xc,xic,aic,tc] = preparets(netc,inputSeries,{},targetSeries);
yc = netc(xc,xic,aic);
closedLoopPerformance = perform(netc,tc,yc)
% Early Prediction Network
% For some applications it helps to get the prediction a timestep early.
% The original network returns predicted y(t+1) at the same time it is given y(t+1).
% For some applications such as decision making, it would help to have predicted
% y(t+1) once y(t) is available, but before the actual y(t+1) occurs.
% The network can be made to return its output a timestep early by removing one delay
% so that its minimal tap delay is now 0 instead of 1. The new network returns the
% same outputs as the original network, but outputs are shifted left one timestep.
nets = removedelay(net);
nets.name = [net.name ' - Predict One Step Ahead'];
view(nets)
[xs,xis,ais,ts] = preparets(nets,inputSeries,{},targetSeries);
ys = nets(xs,xis,ais);
earlyPredictPerformance = perform(nets,ts,ys)
%% 5. Multi-step ahead prediction
inputSeriesPred = [inputSeries(end-delay+1:end),inputSeriesVal];
targetSeriesPred = [targetSeries(end-delay+1:end), con2seq(nan(1,N))];
[Xs,Xi,Ai,Ts] = preparets(netc,inputSeriesPred,{},targetSeriesPred);
yPred = netc(Xs,Xi,Ai);
perf = perform(net,yPred,targetSeriesVal);
figure;
plot([cell2mat(targetSeries),nan(1,N);
nan(1,length(targetSeries)),cell2mat(yPred);
nan(1,length(targetSeries)),cell2mat(targetSeriesVal)]')
legend('Original Targets','Network Predictions','Expected Outputs');
Inputs and targets are data you are using to train net.
Inputs and targets are correct data that is known. After you have trained net, you send again only inputs, and your output would be predicted based on inputs and targets you have sent in training session. So your targets would be the correct output for data you have already know.
As I can understand you are trying to predict future and about future you have only date? If I am wrong correct me. So in this case:
Before training:
input1 = date; input2 = rainFall;
input = [input1; input2];
target = waterLevel;
Because you want to get back the result of water level from the net, your targets should be also water level.
Now you train net;
..train(net, input, target..
After training
Now as you said you want to predict water level, but you gave only date for example 2015-11-11, so in this case it's impossible because you need rain fall info, so if you still want to predict your water level based on date you need to predict rain fall too, or eliminate it, because it's not helping when you don't know it anymore.
I'd say your inputs are both the rainfall and the water level, the target is the water level for the next year and the output is the predicted water level.
In other words, when training, your inputs should be rainfall(k-2:k-1) (direct input) and waterlevel(k-2:k-1) (as feedback). Your target is waterlevel(k). That should output an estimation of the water level for year k (waterlevel_hat(k)). You can compute the error e = waterlevel_hat(k) - waterlevel(k) and use it to train the network. You should repeat the same process for all k > 2 (the reason is that you have 2 input delays and 2 feedback delays).

How can I get predicted values in SVM using MATLAB?

I am trying to get a prediction column matrix in MATLAB but I don't quite know how to go about coding it. My current code is -
load DataWorkspace.mat
groups = ismember(Num,'Yes');
k=10;
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
cvFolds = crossvalind('Kfold', groups, k);
cp = classperf(groups);
for i = 1:k
testIdx = (cvFolds == i);
trainIdx = ~testIdx;
svmModel = svmtrain(Data(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','SMO', ...
'Kernel_Function','rbf');
pred = svmclassify(svmModel, Data(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
cp.CorrectRate
cp.CountingMatrix
The issue is that it's actually calculating the accuracy 11 times in total - 10 times for each fold and one final time as an average. But if I take the individual predictions of each fold and print pred for each loop, the accuracy understandable reduces greatly.
However, I need a column matrix of the predicted values for each row of the data. Any ideas on how I can go about modifying the code?
The whole idea of cross-validation is get an unbiased estimate of the performance of a classifier.
Once that done, you usually just train a model over the entire data. This model will be used to predict future instances.
So just do:
svmModel = svmtrain(Data, groups, ...);
pred = svmclassify(svmModel, otherData, ...);