Matlab : decision tree shows invalid output values - matlab

I'm making a decision tree using the classregtree(X,Y) function. I'm passing X as a matrix of size 70X9 (70 data objects, each having 9 attributes), and Y as a 70X1 matrix. Each one of my Y values is either 2 or 4. However, in the decision tree formed, it gives values of 2.5 or 3.5 for some of the leaf nodes.
Any ideas why this might be caused?

You are using classregtree in regression mode (which is the default mode).
Change the mode to classification mode.

Here is an example using CLASSREGTREE for classification:
%# load dataset
load fisheriris
%# split training/testing
cv = cvpartition(species, 'holdout',1/3);
trainIdx = cv.training;
testIdx = cv.test;
%# train
t = classregtree(meas(trainIdx,:), species(trainIdx), 'method','classification', ...
'names',{'SL' 'SW' 'PL' 'PW'});
%# predict
pred = t.eval(meas(testIdx,:));
%# evaluate
cm = confusionmat(species(testIdx),pred)
acc = sum(diag(cm))./sum(testIdx)
The output (confusion matrix and accuracy):
cm =
17 0 0
0 13 3
0 2 15
acc =
0.9
Now if your target class is encoded as numbers, the returned prediction will still be cell array of strings, so you have to convert them back to numbers:
%# load dataset
load fisheriris
[species,GN] = grp2idx(species);
%# ...
%# evaluate
cm = confusionmat(species(testIdx),str2double(pred))
acc = sum(diag(cm))./sum(testIdx)
Note that classification will always return strings, so I think you might have mistakenly used the method=regression option, which performs regression (numeric target) not classification (discrete target)

Related

Confusion matrix error with less test data than training data

I have an issue with my model accuracy calculation. I used the code below:
y_train = [ 1 1 1 4 4 3 3 5 5 5 ]; % true labels for x_train
%x_test : has no true labels.
predictedLabel=[ 1 2 3 4 5 ]; % predicted labels for x_test
group=y_train ; % 10
grouphat=predictedLabel; % for test 5 test data
C=confusionmat(group,grouphat);
Accuracy = sum ( diag (C)) / sum (C (:)) ×100;
but I get the error:
Error using confusionmat (line 75)
G and GHAT need to have same number of rows
Do I get this error since the test data is more or less than the train? There is no true label for test data (semi supervised learning).
Your training labels and predicted labels are based on different inputs, so it doesn't make sense to compare them in a confusion matrix. From the confusionmat docs:
returns the confusion matrix C determined by the known and predicted groups
i.e. the known and predicted results for the same data.
Take this partly pseudo-code example, see the comments for details
% split your input data
trainData = data(1:100, :); % Training data
testData = data(101:120, :); % Testing data (mutually exclusive from training)
% Do some training (pseudo-code, not valid MATLAB)
% ** Let's assume that the labels are in column 1 **
model = train( trainData(:,1), trainData(:,2:end) );
% Test your model on the input data, excluding the actual labels in column 1
predictedLabels = model( testData(:,2:end) );
% Get the actual labels from column 1
actualLabels = testData(:,1);
% Note that size(predictedLabels) == size(actualLabels)
% Now we can do a confusion matrix
C = confusionmat( actualLabels, predictedLabels )

2 errors in libsvm matlab "Model does not support probabiliy estimates and Subscripted assignment dimension mismatch"

I want to classify a list of 5 test images using the library LIBSVM with a strategy 'one against all' in order to obtain probabilities for each class. the used code is bellow :
load('D:\xapp.mat');
load('D:\xtest.mat');
load('D:\yapp.mat');%% matrix contains true class of images yapp=[641;645;1001;1010;1100]
load('D:\ytest.mat');%% matrix contains unlabeled class of test set ytest=[1;2;3;4;5]
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ['-c 1000 -g 10 -b 1 ']);
end
%# get probability estimates of test instances using each model
prob = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred] = max(prob,[],2);
acc = sum(pred == ytest) ./ numel(ytest) %# accuracy
I obtain this error :
Model does not support probabiliy estimates
Subscripted assignment dimension mismatch.
Error in comp (line 98)
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
please, help me to solve this error and thanks in advance
What you're trying to do is to use a code snippet that evaluates a SVM classifier performances whereas your goal is to properly estimate the labels for your test set.
I assume your five labels are [641;645;1001;1010;1100] (as in yapp). First thing you have to do is delete ytest, because you don't know any labels for the test set. It is pointless to fill ytest with some dummy values: the SVMs will return our predicted labels.
The first error, as already pointed out in the comments is in
numLabels=max(yapp);
you must change max() with length() in order to gather the number of classes.
The training stage is almost correct.
Given the fact that k goes from 1 to 5 whereas yapp has the range above, you should consider changing double(yapp==k) into double(yapp==yapp(k)): in this manner we mark as positive the k-th value in yapp. Given the fact that k goes from 1 to 5, then yapp(k) will go from 641 to 1100.
And now the prediction stage.
The first input for svmpredict() should be the test labels but now we don't know them so we can fill it with a vector of zeros (there will be as many zeros as there are patterns in the test set). That is because svmpredict() automatically returns the accuracy as well if the test labels are known, but that's not the case. So you must change the second for-loop to
for k=1:numLabels
[~,~,p] = svmpredict(zeros(size(xtest,1),1), xtest, model{k}, '-b 1');
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
and finally predict the labels with
[~,pred] = max(prob,[],2);
and pred contains the predicted labels.
Note 1: in this method, however, you cannot measure accuracy and/or other parameters because what we called the test set actually is not a test set. A test set is a labelled set and we pretend we don't know its labels in order to let the SVM predict them and then match the predicted labels with the actual labels in order to measure its accuracy.
Note 2: predicted labels in pred will most likely have values in range 1 to 5 due to the second for-loop. However, since your labels have different values, you can map back taking into account that 1 is 641, 2 is 645, 3 is 1001, 4 is 1010, 5 is 1100.

Implementing convolution as a matrix multiplication

Pithy: Help with Matlab script that takes ImageData array and Convolution weights from Caffe and returns convolution. Please.
I am trying to recreate a convolution generated by Caffe in Matlab.
Let's make the following definitions
W**2 = Size of input
F**2 = Size of filter
P = Size of padding
S = Stride
K = Number of filters
The following text describes how to generalize the convolution as a matrix multiplication:
The local regions in the input image are stretched out into columns in an operation commonly called im2col. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix X_col of im2col of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.
From this, one could draw the conclusion that the im2col function call would look something like this:
input = im2col( input, [3*F*F, ((W-F)/S+1)**2)])
However, if I use the following parameter-values
W = 5
F = 3
P = 1
S = 2
K = 2
I get the following dimensions
>> size(input)
ans =
1 3 5 5
>> size(output)
ans =
1 2 3 3
>> size(filter)
ans =
2 3 3 3
And if I use the im2col function call from above, I end up with an empty matrix.
If I change the stride to 1 in the above example, the size of the input, the output and the filter remains the same. If I use Matlab's 'convn' command, the size is not the same as the actual output from Caffe.
>> size(convn(input,filter))
ans =
2 5 7 7
What would be the general way to resize your array for matrix multiplication?
You are using the second argument to im2col wrong, see the documentation.
You should give it the size of the filter window that you are trying to slide over the image, i.e.:
cols = im2col( input, [F, F])

How to see resampled data after BOOTSTRAP

I was trying to resample (with replacement) my database using 'bootstrap' in Matlab as follows:
D = load('Data.txt');
lead = D(:,1);
depth = D(:,2);
X = D(:,3);
Y = D(:,4);
%Bootstraping to resample 100 times
[resampling100,bootsam] = bootstrp(100,'corr',lead,depth);
%plottig the bootstraping result as histogram
hist(resampling100,10);
... ... ...
... ... ...
Though the script written above is correct, I wonder how I would be able to see/load the resampled 100 datasets created through bootstrap? 'bootsam(:)' display the indices of the data/values selected for the bootstrap samples, but not the new sample values!! Isn't it funny that I'm creating fake data from my original data and I can't even see what is created behind the scene?!?
My second question: is it possible to resample the whole matrix (in this case, D) altogether without using any function? However, I know how to create random values from a vector data using 'unidrnd'.
Thanks in advance for your help.
The answer to question 1 is that bootsam provides the indices of the resampled data. Specifically, the nth column of bootsam provides the indices of the nth resampled dataset. In your case, to obtain the nth resampled dataset you would use:
lead_resample_n = lead(bootsam(:, n));
depth_resample_n = depth(bootsam(:, n));
Regarding the second question, I'm guessing what you mean is, how would you just get a re-sampled dataset without worrying about applying a function to the resampled data. Personally, I would use randi, but in this situation, it is irrelevant whether you use randi or unidrnd. An example follows that assumes 4 columns of some data matrix D (as in your question):
%# Build an example dataset
T = 10;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, 1);
%# Obtain the resampled data
DResampled = D(Ind, :);
To create multiple re-sampled data, you can simply loop over the creation of random indices. Or you could do it in one step by creating a matrix of random indices and using that to index D. With careful use of reshape and permute you can turn this into a T*4*M array, where indexing m = 1, ..., M along the third dimension yields the mth resampled dataset. Example code follows:
%# Build an example dataset
T = 10;
M = 3;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, M);
%# Obtain the resampled data
DResampled = permute(reshape(D(Ind, :)', 4, T, []), [2 1 3]);

MATLAB - usage of knnclassify

When doing:
load training.mat
training = G
load testing.mat
test = G
and then:
>> knnclassify(test.Inp, training.Inp, training.Ltr)
??? Error using ==> knnclassify at 91
The length of GROUP must equal the number of rows in TRAINING.
Since:
>> size(training.Inp)
ans =
40 40 2016
And:
>> length(training.Ltr)
ans =
2016
How can I give the second parameter of knnclassify (TRAINING) the training.inp 3-D matrix so that the number of rows will be 2016 (the third dimension)?
Assuming that your 3D data is interpreted as 40-by-40 matrix of features for each of the 2016 instances (third dimension), we will have to re-arrange it as a matrix of size 2016-by-1600 (rows are samples, columns are dimensions):
%# random data instead of the `load data.mat`
testing = rand(40,40,200);
training = rand(40,40,2016);
labels = randi(3, [2016 1]); %# a class label for each training instance
%# (out of 3 possible classes)
%# arrange data as a matrix whose rows are the instances,
%# and columns are the features
training = reshape(training, [40*40 2016])';
testing = reshape(testing, [40*40 200])';
%# k-nearest neighbor classification
prediction = knnclassify(testing, training, labels);