Data augmentation techniques for general datasets? - matlab

I am working in a machine learning problem and want to build neural network based classifiers on it in matlab. One problem is that the data is given in the form of features and number of samples is considerably lower. I know about data augmentation techniques for images, by rotating, translating, affine translation, etc.
I would like to know whether there are data augmentation techniques available for general datasets ? Like is it possible to use randomness to generate more data ? I read the answer here but I did not understand it.
Kindly please provide answers with the working details if possible.
Any help will be appreciated.

You need to look into autoencoders. Effectively you pass your data into a low level neural network, it applies a PCA-like analysis, and you can subsequently use it to generate more data.
Matlab has an autoencoder class as well as a function, that will do all of this for you. From the matlab help files
Generate the training data.
rng(0,'twister'); % For reproducibility
n = 1000;
r = linspace(-10,10,n)';
x = 1 + r*5e-2 + sin(r)./r + 0.2*randn(n,1);
Train autoencoder using the training data.
hiddenSize = 25;
autoenc = trainAutoencoder(x',hiddenSize,...
'EncoderTransferFunction','satlin',...
'DecoderTransferFunction','purelin',...
'L2WeightRegularization',0.01,...
'SparsityRegularization',4,...
'SparsityProportion',0.10);
Generate the test data.
n = 1000;
r = sort(-10 + 20*rand(n,1));
xtest = 1 + r*5e-2 + sin(r)./r + 0.4*randn(n,1);
Predict the test data using the trained autoencoder, autoenc .
xReconstructed = predict(autoenc,xtest');
Plot the actual test data and the predictions.
figure;
plot(xtest,'r.');
hold on
plot(xReconstructed,'go');
You can see the green cicrles which represent additional data generated with the auto-encoder.

Related

Training Neural network to predict sin(x) matlab

It's been 3 days since i'm trying to train many neural networks to predict sin(x) function, i'm using matlab 2016b (i have to work with it in my assignement)
what i did :
change layers
duplicate dataset (big , small)
add/sub periods
shuffle the data
change neural's number per layer
change learning function
change the transfer function and mapped the target
all that with no good prediction, can anyone explain me what i'm doing wrong ,
and it would be very helpful to paste any good book for ("preparing dataset befor traing", "knowing the best NN's structure for your project",...
and any book seems helpful)
my actual code : (i'm using nntool for the training )
%% input and target
input = 0:pi/100:8*pi;
target = sin(input) ;
plot(input,sin(input)),
hold on,
inputA = input;
targetA = target;
plot(inputA,targetA),
hold on,
%simulate input
output=sim(network2,inputA);
plot(inputA,output,'or')
hold off

Feedforward neural network classification in Matlab

I have two gaussian distribution samples, one guassian contains 10,000 samples and the other gaussian also contains 10,000 samples, I would like to train a feed-forward neural network with these samples but I dont know how many samples I have to take in order to get an optimal decision boundary.
Here is the code but I dont know exactly the solution and the output are weirds.
x1 = -49:1:50;
x2 = -49:1:50;
[X1, X2] = meshgrid(x1, x2);
Gaussian1 = mvnpdf([X1(:) X2(:)], mean1, var1);// for class A
Gaussian2 = mvnpdf([X1(:) X2(:)], mean2, var2);// for Class B
net = feedforwardnet(10);
G1 = reshape(Gaussian1, 10000,1);
G2 = reshape(Gaussian2, 10000,1);
input = [G1, G2];
output = [0, 1];
net = train(net, input, output);
When I ran the code it give me weird results.
If the code is not correct, can someone please suggest me so that I can get a decision boundary for these two distributions.
I'm pretty sure that the input must be the Gaussian distribution (and not the x coordinates). In fact the NN has to understand the relationship between the phenomenons themselves that you are interested (the Gaussian distributions) and the output labels, and not between the space in which are contained the phenomenons and the labels. Moreover, If you choose the x coordinates, the NN will try to understand some relationship between the latter and the output labels, but the x are something of potentially constant (i.e., the input data might be even all the same, because you can have very different Gaussian distribution in the same range of the x coordinates only varying the mean and the variance). Thus the NN will end up being confused, because the same input data might have more output labels (and you don't want that this thing happens!!!).
I hope I was helpful.
P.S.: for doubt's sake I have to tell you that the NN doesn't fit very well the data if you have a small training set. Moreover don't forget to validate your data model using the cross-validation technique (a good rule of thumb is to use a 20% of your training set for the cross-validation set and another 20% of the same training set for the test set and thus to use only the remaining 60% of your training set to train your model).

Multi-Data of K-means and SVM

I generate the multi data from mvnrnd. I could like use the K-means to clustering those data with 2 groups.And also want to know the accuracy of K-means,but i didn't know how to calculate that.How did i know the correct of k-means cluster to compare with the result and get the accuracy ?!
I have a multi data and the class , i know i could do the SVM. However,the accuracy of SVM was too low about 72% to 83%. I might have done some mistakes. I would like to hear some feedback. Thanks
n=1;mu1=[0,0,0];mu2=[1,1,1]; mu3=[2,2,2]; mu4=[3,3,3]; m=0.9;s=[1 m m ;m 1 m ; m m 1];
data1 = mvnrnd(mu1,s,1000); data2 = mvnrnd(mu2,s,1000);data3 = mvnrnd(mu3,s,1000);data4 = mvnrnd(mu4,s,1000);
all_data = [data1;data2;data3;data4];
[idx,ctrs,sumD,D] = kmeans(all_data,2,'distance','sqE','start','sample');
model = svmtrain(idx,all_data);
mu7=[0,0,0];mu8=[1,1,1];mu9=[2,2,2];mu10=[3,3,3];
data7=mvnrnd(mu7,s,1000);data8=mvnrnd(mu8,s,1000);data9=mvnrnd(mu9,s,1000);data10=mvnrnd(mu10,s,1000);
test_data = [data7;data8;data9;data10]; value = svmpredict(idx,test_data,model);
I want to know where my mistakes or something wrong of my code. I don't know why my accuracy is so low.I really want to improve my code. Thanks !!
To calculate accuracy of k-means algorithm You should have an a priori knowledge i.e. the reference class vector, like You have in SVM.
Despite that the k-means is unsupervised learning algorithm and You don't need to have class vector to classify the data, You need it to calculate the accuracy.
I have doubts for the method that You calculate model of the SVM as well. You use the indexes calculated from k-means algorithm witch has its own accuracy (right now it is not known). You use the modelled data for classification, so why won't you create your vector with classes?

Image Classification using gist and SVM training

I'd like to begin by saying that I'm really new to CV, and there may be some obvious things I didn't think about, so don't hesitate to mention anything of that category.
I am trying to achieve scene classification, currently between indoor and outdoor images for simplicity.
My idea to achieve this is to use a gist descriptor, which creates a vector with certain parameters of the scene.
In order to obtain reliable classification, I used indoor and outdoor images, 100 samples each, used a gist descriptor, created a training matrix out of them, and used 'svmtrain' on it. Here's a pretty simple code that shows how I trained the gist vectors:
train_label= zeros(size(200,1),1);
train_label(1:100,1) = 0; % 0 = indoor
train_label(101:200,1) = 1; % 1 = outdoor
training_mat(1:100,:) = gist_indoor1;
training_mat(101:200,:) = gist_outdoor1;
test_mat = gist_test;
SVMStruct = svmtrain(training_mat ,train_label, 'kernel_function', 'rbf', 'rbf_sigma', 0.6);
Group = svmclassify(SVMStruct, test_mat);
The problem is that the results are pretty bad.
I read that optimizing the constraint and gamma parameters of the 'rbf' kernell should improve the classification, but:
I'm not sure how to optimize with multidimensional data vectors(the optimization example given in Mathworks site is in 2D while mine is 512), any suggestion how to begin?
I might be completely in the wrong direction, please indicate if it is so.
Edit:
Thanks Darkmoor! I'll try calibrating using this toolbox, and maybe try to improve my feature extraction.
Hopefully when I have a working classification, I'll post it here.
Edit 2: Forgot to update, by obtaining gist descriptors of indoor and urban outdoor images from the SUN database, and training with optimized parameters by using the libsvm toolbox, I managed to achieve a classification rate of 95% when testing the model on pictures from my apartment and the street outside.
I did the same with urban outdoor scenes and natural scenes from the database, and achieved similar accuracy when testing on various scenes from my country.
The code I used to create the data matrices is taken from here, with very minor modifications:
% GIST Parameters:
clear param
param.imageSize = [256 256]; % set a normalized image size
param.orientationsPerScale = [8 8 8 8]; % number of orientations per scale (from HF to LF)
param.numberBlocks = 4;
param.fc_prefilt = 4;
%Obtain images from folders
sdirectory = 'C:\Documents and Settings\yotam\My Documents\Scene_Recognition\test_set\indoor&outdoor_test';
jpegfiles = dir([sdirectory '/*.jpg']);
% Pre-allocate gist:
Nfeatures = sum(param.orientationsPerScale)*param.numberBlocks^2;
gist = zeros([length(jpegfiles) Nfeatures]);
% Load first image and compute gist:
filename = [sdirectory '/' jpegfiles(1).name];
img = imresize(imread(filename),param.imageSize);
[gist(1, :), param] = LMgist(img, '', param); % first call
% Loop:
for i = 2:length(jpegfiles)
filename = [sdirectory '/' jpegfiles(i).name];
img = imresize(imread(filename),param.imageSize);
gist(i, :) = LMgist(img, '', param); % the next calls will be faster
end
I suggest you to use libsvm it is very efficient. There is relevant post for cross validation of libsvm. The same logic can be used for the relevant Matlab lib you mention.
Your logic is correct. Extract features and try to classify them. In any case, do not expect that the calibration of your classifier will return huge differences. The key idea is the feature extraction for huge differences in your results, in combination of course with your classifier calibration ;).
Good luck.

How to use SVM in Matlab?

I am new to Matlab. Is there any sample code for classifying some data (with 41 features) with a SVM and then visualize the result? I want to classify a data set (which has five classes) using the SVM method.
I read the "A Practical Guide to Support Vector Classication" article and I saw some examples. My dataset is kdd99. I wrote the following code:
%% Load Data
[data,colNames] = xlsread('TarainingDataset.xls');
groups = ismember(colNames(:,42),'normal.');
TrainInputs = data;
TrainTargets = groups;
%% Design SVM
C = 100;
svmstruct = svmtrain(TrainInputs,TrainTargets,...
'boxconstraint',C,...
'kernel_function','rbf',...
'rbf_sigma',0.5,...
'showplot','false');
%% Test SVM
[dataTset,colNamesTest] = xlsread('TestDataset.xls');
TestInputs = dataTset;
groups = ismember(colNamesTest(:,42),'normal.');
TestOutputs = svmclassify(svmstruct,TestInputs,'showplot','false');
but I don't know that how to get accuracy or mse of my classification, and I use showplot in my svmclassify but when is true, I get this warning:
The display option can only plot 2D training data
Could anyone please help me?
I recommend you to use another SVM toolbox,libsvm. The link is as follow:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
After adding it to the path of matlab, you can train and use you model like this:
model=svmtrain(train_label,train_feature,'-c 1 -g 0.07 -h 0');
% the parameters can be modified
[label, accuracy, probablity]=svmpredict(test_label,test_feaure,model);
train_label must be a vector,if there are more than two kinds of input(0/1),it will be an nSVM automatically.
train_feature is n*L matrix for n samples. You'd better preprocess the feature before using it. In the test part, they should be preprocess in the same way.
The accuracy you want will be showed when test is finished, but it's only for the whole dataset.
If you need the accuracy for positive and negative samples separately, you still should calculate by yourself using the label predicted.
Hope this will help you!
Your feature space has 41 dimensions, plotting more that 3 dimensions is impossible.
In order to better understand your data and the way SVM works is to begin with a linear SVM. This tybe of SVM is interpretable, which means that each of your 41 features has a weight (or 'importance') associated with it after training. You can then use plot3() with your data on 3 of the 'best' features from the linear svm. Note how well your data is separated with those features and choose a basis function and other parameters accordingly.