Make predictions on new data using a SVM in matlab - matlab

I trained a SVM classifcation model using "fitcsvm" function and tested with the test data set. Now I want to use this model to predict the classes of new (previously unseen) data. What should be done ?
Following is the code I used.
load FeatureLabelsNum.csv
load FeatureOne.csv
X = FeatureOne(1:42,:);
y = FeatureLabelsNum(1:42,:);
%dividing the dataset into training and testing
rand_num = randperm(42);
%training Set
X_train = X(rand_num(1:34),:);
y_train = y(rand_num(1:34),:);
%testing Set
X_test = X(rand_num(34:end),:);
y_test = y(rand_num(34:end),:);
%preparing validation set out of training set
c = cvpartition(y_train,'k',5);
SVMModel =
fitcsvm(X_train,y_train,'Standardize',true,'KernelFunction','RBF',...
'KernelScale','auto','OutlierFraction',0.05);
CVSVMModel = crossval(SVMModel);
classLoss = kfoldLoss(CVSVMModel)
classOrder = SVMModel.ClassNames
sv = SVMModel.SupportVectors;
figure
gscatter(X_train(:,1),X_train(:,2),y_train)
hold on
plot(sv(:,1),sv(:,2),'ko','MarkerSize',10)
legend('Resampled','Non','Support Vector')
hold off
X_test_w_best_feature =X_test(:,:);
bp = (predict(SVMModel,X_test)== y_test);

You already use the predict function in your script, however, just pass the new data in and score will contain your predicted labels.
[~,score] = predict(SVMModel,X_new_data);

Related

Correct practice and approach for reporting the training and generalization performance

I am trying to learn the correct procedure for training a neural network for classification. Many tutorials are there but they never explain how to report for the generalization performance. Can somebody please tell me if the following is the correct method or not. I am using first 100 examples from the fisheriris data set that has labels 1,2 and call them as X and Y respectively. Then I split X into trainData and Xtest with a 90/10 split ratio. Using trainData I trained the NN model. Now the NN internally further splits trainData into tr,val,test subsets. My confusion is which one is usually used for generalization purpose when reporting the performance of the model to unseen data in conferences/Journals?
The dataset can be found in the link: https://www.mathworks.com/matlabcentral/fileexchange/71468-simple-neural-networks-with-k-fold-cross-validation-manner
rng('default')
load iris.mat;
X = [f(1:100,:) l(1:100)];
numExamples = size(X,1);
indx = randperm(numExamples);
X = X(indx,:);
Y = X(:,end);
split1 = cvpartition(Y,'Holdout',0.1,'Stratify',true); %90% trainval 10% test
istrainval = training(split1); % index for fitting
istest = test(split1); % indices for quality assessment
trainData = X(istrainval,:);
Xtest = X(istest,:);
Ytest = Y(istest);
numExamplesXtrainval = size(trainData,1);
indxXtrainval = randperm(numExamplesXtrainval);
trainData = trainData(indxXtrainval,:);
Ytrain = trainData(:,end);
hiddenLayerSize = 10;
% data format = rows = number of dim, column = number of examples
net = patternnet(hiddenLayerSize);
net = init(net);
net.performFcn = 'crossentropy';
net.trainFcn = 'trainscg';
net.trainParam.epochs=50;
[net tr]= train(net,trainData', Ytrain');
Trained = sim(net, trainData'); %outputs predicted labels
train_predict = net(trainData');
performanceTrain = perform(net,Ytrain',train_predict)
lbl_train=grp2idx(Ytrain);
Yhat_train = (train_predict >= 0.5);
Lbl_Yhat_Train = grp2idx(Yhat_train);
[cmMatrixTrain]= confusionmat(lbl_train,Lbl_Yhat_Train )
accTrain=sum(lbl_train ==Lbl_Yhat_Train)/size(lbl_train,1);
disp(['Training Set: Total Train Acccuracy by MLP = ',num2str(100*accTrain ), '%'])
[confTest] = confusionmat(lbl_train(tr.testInd),Lbl_Yhat_Train(tr.testInd) )
%unknown test
test_predict = net(Xtest');
performanceTest = perform(net,Ytest',test_predict);
Yhat_test = (test_predict >= 0.5);
test_lbl=grp2idx(Ytest);
Lbl_Yhat_Test = grp2idx(Yhat_test);
[cmMatrix_Test]= confusionmat(test_lbl,Lbl_Yhat_Test )
This is the output.
Problem1: There seems to be no prediction for the other class. Why?
Problem2: Do I need a separate dataset like the one I created as Xtest for reporting generalization error or is it the practice to use the data trainData(tr.testInd,:) as the generalization test set? Did I create an unnecessary subset?
performanceTrain =
2.2204e-16
cmMatrixTrain =
45 0
45 0
Training Set: Total Train Acccuracy by MLP = 50%
confTest =
9 0
5 0
cmMatrix_Test =
5 0
5 0
There are a few issues with the code. Let's deal with them before answering your question. First, you set a threshold of 0.5 for making decisions (Yhat_train = (train_predict >= 0.5);) while all points of your net prediction are above 0.5. This means you only get zeros in your confusion matrices. You can plot the scores to choose a better threshold:
figure;
plot(train_predict(Ytrain == 1),'.b')
hold on
plot(train_predict(Ytrain == 2),'.r')
legend('label 1','label 2')
cvpartition gave me an error. It ran successfully as split1 = cvpartition(Y,'Holdout',0.1); In any case, artificial neural networks usuallly manage partitioning within the training process, so you feed in X and Y and some parameters for how to do it. See here for example: link where you set
net.divideParam.trainRatio = .4;
net.divideParam.valRatio = .3;
net.divideParam.testRatio = .3;
So how to report the results? Only for the test data. The train data will suffer from overfit, and will show false, too good results. If you use validation data (you havn't), then you cannot show results for it because it will also suffer from overfit. If you let the training do validation for you your test results will be safe from overfit.

How to use trained svm model to predict whether image contains a car object or not

I want to use the svm classifying whether an image contains car or not.
I trained svm classifier using HOG. Then I try to use the classifier, so I looked up certain Mathworks tutorial.
I could not fined any useful tutorial for using svm classifier.
I use the data set from http://cogcomp.org/Data/Car/
This is my code for svm classifier.
imgPos = imread(strrep(file, '*', int2str(0)));
[hog_4x4, vis4x4] = extractHOGFeatures(imgPos,'CellSize',[4 4]);
cellSize = [4 4];
hogFeatureSize = length(hog_4x4);
temp(1:500) = 1;
temp(501:1000) = 0;
trainingLabels = categorical(temp);
trainingFeatures = zeros(fileNum*2, hogFeatureSize, 'single');
for n = 1:500
posfile = strrep(posFile, "*", int2str(n-1));
imgPos = imread(posfile);
trainingFeatures(n, :) = extractHOGFeatures(imgPos, 'CellSize', cellSize);
negfile = strrep(negFile, "*", int2str(n-1));
imgNeg = imread(negfile);
trainingFeatures(n+500, :) = extractHOGFeatures(imgNeg, 'CellSize', cellSize);
end
classifier = fitcecoc(trainingFeatures, trainingLabels);
I want use the classifier to detect car objects.
If it's possible I want to surround each detected car object with frame.
Any help is appreciated.
Your looking for the predict method. Get your test data features and run the following:
predictions = predict(classifier, testFeatures);

Which Function in Matlab Should I use to Validate a Model forecast() or predict()?

I have used two types of models for modeling a SISO system with a time series data. The first is ARIMAx and the second one the Output-Error. Now, I should know which of the two performs best in forecasting the output giving the input in certain horizon, 15 days in my case, and only the necessary observed outputs for the model initialize properly. In Matlab, it is presented two functions in that seems to be used to validate models the forecast() and predict(). I have been reading the difference between predicting and forecasting and apparently people misconfuse a lot the two terms. I would like to know which of the two I should use to validate a model and choose the best one. The main point is that I have to test the model's performance for many horizons. In other words, how the model performs to forecast on the first day ahead, on the second day ahead until the 15th day ahead. I wrote the following code as an example:
close all
clear all
tic;
uhe = {'furnas'};
% Set the structures to be evaluated in ARMAx model
na = 10;
nb = 2;
nc = 1;
nk = 2;
% Set the structures to be evaluated in OE model
nbb = 10;
nf = 6;
nkk = 0;
u = 1;
% Read training dataset file and set iddata definitions
data_train = importdata(strcat('train_',uhe{u},'.dat'));
data_test = importdata(strcat('test_',uhe{u},'.txt'));
data_valid = importdata(strcat('valid_',uhe{u},'.txt'));
data_complet = vertcat(data_train, data_valid, data_test);
data_complet = iddata(data_complet(:,2),data_complet(:,1));
data_complet.TimeUnit = 'days';
data_complet.InputName = 'Chuva';
data_complet.OutputName = 'Vazão';
data_complet.InputUnit = 'm³/s';
data_complet.OutputUnit = 'm³/s';
data_complet.Name = 'Sistema Chuva-Vazão';
data_train = iddata(data_train(:,2),data_train(:,1));
data_train.TimeUnit = 'days';
data_train.InputName = 'Chuva';
data_train.OutputName = 'Vazão';
data_train.InputUnit = 'm³/s';
data_train.OutputUnit = 'm³/s';
data_train.Name = 'Sistema Chuva-Vazão';
data_valid = iddata(data_valid(:,2),data_valid(:,1));
data_valid.TimeUnit = 'days';
data_valid.InputName = 'Chuva';
data_valid.OutputName = 'Vazão';
data_valid.InputUnit = 'm³/s';
data_valid.OutputUnit = 'm³/s';
data_valid.Name = 'Sistema Chuva-Vazão';
data_test = iddata(data_test(:,2),data_test(:,1));
data_test.TimeUnit = 'days';
data_test.InputName = 'Chuva';
data_test.OutputName = 'Vazão';
data_test.InputUnit = 'm³/s';
data_test.OutputUnit = 'm³/s';
data_test.Name = 'Sistema Chuva-Vazão';
% Modeling training dataset with ARMAx
models_train_armax = armax(data_train,[na nb nc nk]);
% Modeling training dataset with OE
models_train_oe = oe(data_train,[nbb nf nkk]);
% Evalutaing the validation dataset ARMAX
x0 = findstates(models_train_armax,data_valid);
OPT = simOptions('InitialCondition',x0);
ssmodel_armax=idss(models_train_armax);
models_valid_armax = sim(ssmodel_armax,data_valid,OPT);
% Evaluating the validation dataset OE
x0 = findstates(models_train_oe,data_valid);
OPT = simOptions('InitialCondition',x0);
ssmodel_oe=idss(models_train_oe);
models_valid_oe = sim(ssmodel_oe,data_valid,OPT);
% Predicting Horizon
hz = 20;
% Applying predict function
opt = predictOptions('InitialCondition','e');
[y_armax_pred] = predict(ssmodel_armax,data_valid(1:end),hz,opt);
[y_oe_pred] = predict(ssmodel_oe,data_valid(1:end),hz,opt);
% Applying forecast function
opt = forecastOptions('InitialCondition','e');
[y_armax_fc] = forecast(ssmodel_armax,data_train((end-max([na nb nc nk])):end),hz,data_test.u(1:hz),opt);
[y_oe_fc] = forecast(ssmodel_oe,data_train((end-max([nbb nf nkk])):end),hz,data_test(1:hz),opt);
Depends on how you are trying to validate the model. Generally you would use the predict command as you would want backtest against previous data.
Alternatively you could use forecast if you have a cross-validation/holdout sample and you would like to test against that
Matlab's help has an interesting line regarding the difference between forecast and predict
forecast performs prediction into the future, in a time range beyond the last instant of measured data. In contrast, the predict command predicts the response of an identified model over the time span of measured data. Use predict to determine if the predicted result matches the observed response of an estimated model. If sys is a good prediction model, consider using it with forecast.
Also note that Matlab's help for predict also says that careful model validation should not use the default value of the prediction horizon.
For careful model validation, a one-step-ahead prediction (K = 1) is usually not a good test for validating the model sys over the time span of measured data. Even the trivial one step-ahead predictor, y(hat)(t)=y(t−1), can give good predictions. So a poor model may look fine for one-step-ahead prediction of data that has a small sample time. Prediction with K = Inf, which is the same as performing simulation with sim command, can lead to diverging outputs because low-frequency disturbances in the data are emphasized, especially for models with integration. Use a K value between 1 and Inf to capture the mid-frequency behavior of the measured data.

DNN Regression for multiple inputs

I am trying to use DNN Multivariate Regression to estimate an output that takes 2 features as input. The following is my code (note that the data is clean, and absolutely no NaN exists).
train_set, test_set = train_test_split(Ha_Noi, test_size=0.2, random_state = random.randint(20, 200))
# Training Data
train_X = np.array(train_set['longwait_percent2'])
train_X2 = np.array(train_set['accept_rate_timeT'])
train_Y = np.array(train_set['accept_rate'])
n_samples = train_X.shape[0]
#Testing Data
Xtest = np.array(test_set['longwait_percent2'])
Xtest2 = np.array(test_set['accept_rate_timeT'])
Ytest = np.array(test_set['accept_rate'])
#Deep Neural Network Regressor
feature_column1 = learn.infer_real_valued_columns_from_input(train_X)
feature_column2 = learn.infer_real_valued_columns_from_input(train_X2)
regressor = learn.DNNRegressor(feature_columns=[feature_column1, feature_column2], hidden_units=[20, 10])
regressor.fit(x = [feature_column1, feature_column2],y = train_Y, steps= STEPS, batch_size= BATCH_SIZE)
When I execute this code, it keeps giving me the error message: "AttributeError: 'list' object has no attribute 'dtype'". I also noticed that the code works perfectly if my x-variable is 1D-array, rather than 2D. Does anyone know how to fix this?

Naive Bayse Classifier for Multiclass: Getting Same Error Rate

I have implemented the Naive Bayse Classifier for multiclass but problem is my error rate is same while I increase the training data set. I was debugging this over an over but wasn't able to figure why its happening. So I thought I ll post it here to find if I am doing anything wrong.
%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
%Getting the index common to test and train data
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);
test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);
%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));
% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end
error = 0;
p = zeros(numClasses,1);
% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end
Here is the dimentionality of my data
x =
data: [768x8 double]
labels: [768x1 double]
I am using Pima data set from UCI
What are the results of your implementation of the training data itself? Does it fit it at all?
It's hard to be sure but there are couple things that I noticed:
It is important for every class to have training data. You can't really train a classifier to recognize a class if there was no training data.
If possible number of training examples shouldn't be skewed towards some of classes. For example if in 2-class classification number of training and cross validation examples for class 1 constitutes only 5% of the data then function that always returns class 2 will have error of 5%. Did you try checking precision and recall separately?
You're trying to fit normal distribution to each feature in a class and then use it for posterior probabilities. I'm not sure how it plays out in terms of smoothing. Could you try to re-implement it with simple counting and see if it gives any different results?
It also could be that features are highly redundant and bayes method overcounts probabilities.