K-NN classification using Fisher Iris dataset - matlab

I am working on a Pattern Recognition project and I face some problems. I have loaded the Fisher's Iris data set on my project and I want to run the k-NN classifier(for k = 1,3,5) on the above data set. But I want the following division: 80% training set and 20% test set. I want the partition to be repeated 5 times. How can it be done?
I have some code about that, but I do not even know whether I am in the right way or not.
% Regarding the random permutation that I want
[Xtrain,Xval,Xtest] = dividerand(150,0.8,0,0.2);
% Regarding the k-NN classification
X = meas;
Y = species;
z1 = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1);
I do not know if my code is in the right way, what is missing or even whether there is a better way to do my job than mine or not.
Could anyone help me to complete my task?

Related

Data augmentation techniques for general datasets?

I am working in a machine learning problem and want to build neural network based classifiers on it in matlab. One problem is that the data is given in the form of features and number of samples is considerably lower. I know about data augmentation techniques for images, by rotating, translating, affine translation, etc.
I would like to know whether there are data augmentation techniques available for general datasets ? Like is it possible to use randomness to generate more data ? I read the answer here but I did not understand it.
Kindly please provide answers with the working details if possible.
Any help will be appreciated.
You need to look into autoencoders. Effectively you pass your data into a low level neural network, it applies a PCA-like analysis, and you can subsequently use it to generate more data.
Matlab has an autoencoder class as well as a function, that will do all of this for you. From the matlab help files
Generate the training data.
rng(0,'twister'); % For reproducibility
n = 1000;
r = linspace(-10,10,n)';
x = 1 + r*5e-2 + sin(r)./r + 0.2*randn(n,1);
Train autoencoder using the training data.
hiddenSize = 25;
autoenc = trainAutoencoder(x',hiddenSize,...
'EncoderTransferFunction','satlin',...
'DecoderTransferFunction','purelin',...
'L2WeightRegularization',0.01,...
'SparsityRegularization',4,...
'SparsityProportion',0.10);
Generate the test data.
n = 1000;
r = sort(-10 + 20*rand(n,1));
xtest = 1 + r*5e-2 + sin(r)./r + 0.4*randn(n,1);
Predict the test data using the trained autoencoder, autoenc .
xReconstructed = predict(autoenc,xtest');
Plot the actual test data and the predictions.
figure;
plot(xtest,'r.');
hold on
plot(xReconstructed,'go');
You can see the green cicrles which represent additional data generated with the auto-encoder.

Implementing Naive Bayes Nearest Neighbor (NBNN) in MATLAB

I posted this question on the CV SO a few days ago but it has gone basically unobserved by the forum. I'm trying to implement NBNN in MATLAB to do image classification on the CIFAR-10 image dataset. The algorithm is pretty simple, and I'm confident in it's correctness, however, I'm receiving terrible accuracy rates with 22-28%. I'm unsure why and I'm hoping someone who has experience with image classification or the algorithm could point me in the right direction. The algorithm learns off of SIFT image descriptors, which could be one of the reasons why it's under-performing. Matlab only has a SURF feature detector. From what I've read, SURF/SIFT are basically equivalent. I've been using features, (nDescriptros x 64) obtained from the function below to train my model.
points = detectSURFFeatures(rgb2gray(image));
[features,valid_points] = extractFeatures(image,points);
Another possible issue with the CIFAR dataset and this approach is the small size of the images. Each image is 32 x 32 images which, I beleive, makes feature detection very difficult. I've played around with the different octave settings in the detectSURFFeatures() but nothing has brought my accuracy above 28%.
The annotated code for the approach is below. It's difficult to understand but still might be helpful.
Hopefully someone can help me out.
for i = 3001:4000 %Train on the first 3000 instances and test on the remaining 1000.
closeness = [];
for j = 1:10 %loop over the 10 catergories
class = train(trainLabel==j,:); % Pull out all descriptors that belong to class j
descriptor = test(test_id==i,:); % Pull out all descriptors for image i
[idx,dist] = knnsearch(class,descriptor,'K',1); % Find the distance between the descriptors and the closest labeled descriptor in class j.
total = sum(dist); % sum up distances
closeness = [closeness,sum(total)]; % append a vector of the image-to-class distances.
end
[val,cat] = min(closeness); % Find choose the class that resulted in the lowest, summed distance.
predLabel = [predLabel;cat]; % Append to a vector of labels.
i
end
If your images are 32x32 pixels, then trying to detect interest points is not a good idea. As you have observed, you would get very few features, if any. Upsampling the images is one option. Another option is to use a global descriptor like HOG (extractHOGFeatures).

Naïve Bayes Classifier -- is normalization necessary?

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.

How to use SVM in Matlab?

I am new to Matlab. Is there any sample code for classifying some data (with 41 features) with a SVM and then visualize the result? I want to classify a data set (which has five classes) using the SVM method.
I read the "A Practical Guide to Support Vector Classication" article and I saw some examples. My dataset is kdd99. I wrote the following code:
%% Load Data
[data,colNames] = xlsread('TarainingDataset.xls');
groups = ismember(colNames(:,42),'normal.');
TrainInputs = data;
TrainTargets = groups;
%% Design SVM
C = 100;
svmstruct = svmtrain(TrainInputs,TrainTargets,...
'boxconstraint',C,...
'kernel_function','rbf',...
'rbf_sigma',0.5,...
'showplot','false');
%% Test SVM
[dataTset,colNamesTest] = xlsread('TestDataset.xls');
TestInputs = dataTset;
groups = ismember(colNamesTest(:,42),'normal.');
TestOutputs = svmclassify(svmstruct,TestInputs,'showplot','false');
but I don't know that how to get accuracy or mse of my classification, and I use showplot in my svmclassify but when is true, I get this warning:
The display option can only plot 2D training data
Could anyone please help me?
I recommend you to use another SVM toolbox,libsvm. The link is as follow:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
After adding it to the path of matlab, you can train and use you model like this:
model=svmtrain(train_label,train_feature,'-c 1 -g 0.07 -h 0');
% the parameters can be modified
[label, accuracy, probablity]=svmpredict(test_label,test_feaure,model);
train_label must be a vector,if there are more than two kinds of input(0/1),it will be an nSVM automatically.
train_feature is n*L matrix for n samples. You'd better preprocess the feature before using it. In the test part, they should be preprocess in the same way.
The accuracy you want will be showed when test is finished, but it's only for the whole dataset.
If you need the accuracy for positive and negative samples separately, you still should calculate by yourself using the label predicted.
Hope this will help you!
Your feature space has 41 dimensions, plotting more that 3 dimensions is impossible.
In order to better understand your data and the way SVM works is to begin with a linear SVM. This tybe of SVM is interpretable, which means that each of your 41 features has a weight (or 'importance') associated with it after training. You can then use plot3() with your data on 3 of the 'best' features from the linear svm. Note how well your data is separated with those features and choose a basis function and other parameters accordingly.

what "target" do i put in iris dataset nntool matlab?

I am new in using matlab so this might be easy. I am trying to make an iris dataset neural network in matlab using nntool(feed-forward back propagation network). but i cant find out what the target matrix should be. I also am trying to find (tried to create but still did nothing) a code for programming the same thing instead of using nntools.
Can anyone help me out?
The targets are the correct class labels. However, the Fisher iris dataset in Matlab has its target data in an cell array of strings (species), while nntool wants a numerical vector. So you'll have to convert it.
clear all;
load('fisheriris');
classnames = unique(species);
targets = zeros(1, numel(species));
for i = 1:3
class(strcmp(species, classnames{i})) = i;
end
You now have a vector targets that can be loaded in nntool.