I'm actually training an ANN on MATLAB to optimize a pump. I've got 2000 samples as input of the design of the pump and as an output the efficiency. I've got some good results, but now I want to retrain the model. I want to rearrange the weight per sample such that the samples with better efficiency have a higher weight than the small efficiency.
How can I weigh my samples by efficiency?
Here is a piece of my code:
Mdl_NN1 = fitnet([6 4],training);
Mdl_NN1.layers{2}.transferFcn = 'purelin';
Mdl_NN1.divideParam.trainRatio = 70/100;
Mdl_NN1.divideParam.valRatio = 15/100;
Mdl_NN1.divideParam.testRatio = 15/100;
Mdl_NN1.trainParam.showWindow = true;
[Mdl_NN1,TR] = train(Mdl_NN1,XtrainSet',YtrainSet(:,2)')
The Xtrainset is the design of the pump in 6 parameters and the YtrainSet is just the efficiency.
Its all about curating your data (this applies in particular for NNs). So if you want to weight certain samples more than others, duplicate them for training and remove others, which are wrong/false. This is better than fiddling with the weights or changing the portions of train/test/validation.
One word of warning: if you duplicate data, make sure that it only appears in the training set. You will get unreliable accuracy if you consider the same examples for training and testing. So you might need to set the training/testing/validation data explicitly. Have a look on divide data for optimal neural network training in the docs.
Mdl_NN1.divideFcn = 'divideind';
Mdl_NN1.divideParam.trainInd = % vector with indices
Mdl_NN1.divideParam.testInd = % vector with indices
Mdl_NN1.divideParam.valInd = % vector with indices
or may be sufficient to set if the data is ordered properly
Mdl_NN1.divideFcn = 'divideblock';
Related
I am trying to use bag of words and fitcecoc() (multiclass SVM) to reproduce similar results to those obtained by using Image Category classifier
as seen in the documentation.
% Code from documentation
bag = bagOfFeatures(trainingSet); % create bag of features from trainingSet (an image datastore)
categoryClassifier = trainImageCategoryClassifier(trainingSet, bag);
confMatrix = evaluate(categoryClassifier, validationSet);
This returns accuracy of ~98% on the validation set.
However when I pass the histogram of visual word occurrences into the multiclass SVM classifier it has ~2.5% accuracy.
SVM_SURF = fitcecoc(trainFeatures,trainingSet.Labels);
bag = bagOfFeatures(validationSet);
featureMatrix = encode(bag, validationSet); % histogram of visual word occurrences
[pred score cost] = predict(SVM_SURF, featureMatrix)
accuracy = sum(validationSet.Labels == pred)/size(validationSet.Labels,1);
accuracy
Is there an obvious reason as to why the accuracy is so much lower when bag of words is passed into fitcecoc() rather than trainImageCategoryClassifier()?
The fitcecoc classifier is a multi-purpose (image, financial data, ...`) classifier. By configuring the kernel you can get better accuracy rates. However, traditionally, the fitcecoc function gives much better results if you increase the training data.
I'm trying to classify a testset using GMM. I have a trainset (n*4 matrix) with labels {1,2,3}, n means the number of training examples, which have 4 properties. And I also have a testset (m*4) to be classified.
My goal is to have a probability matrix (m*3) for each testing example giving each label P(x_test|labels). Just like soft clustering.
first, I create a GMM with k=9 components over the whole trainset. I know in some papers, the author create a GMM for each label in trainset. But I want to deal with the data from all of the classes.
GMModel = fitgmdist(trainset,k_component,'RegularizationValue',0.1,'Start','plus');
My problem is, I want to confirm the relationship P(component|labels)between components and labels. So I write a code as below, but not sure if it's right,
idx_ex_of_c1 = find(trainset_label==1);
idx_ex_of_c2 = find(trainset_label==2);
idx_ex_of_c3 = find(trainset_label==3);
[~,~,post] = cluster(GMModel,trainset);
cita_c_k = zeros(3,k_component);
for id_k = 1:k_component
cita_c_k(1,id_k) = sum(post(idx_ex_of_c1,id_k))/numel(idx_ex_of_c1);
cita_c_k(2,id_k) = sum(post(idx_ex_of_c2,id_k))/numel(idx_ex_of_c2);
cita_c_k(3,id_k) = sum(post(idx_ex_of_c3,id_k))/numel(idx_ex_of_c3);
end
cita_c_k is a (3*9) matrix to store the relationships. idx_ex_of_c1 is the index of examples, whose label is '1' in the trainset.
For the testing process. I first apply the GMModel to testset
[P,~] = posterior(GMModel,testset); % P is a m*9 matrix
And then, sum all components,
P_testset = P*cita_c_k';
[a,b] = max(P_testset,3);
imagesc(b);
The result is ok, But not good enough. Can anyone give me some tips?
Thanks!
You can take following steps:
Increase target error and/or use optimal network size in training, but over-training and network size increase usually won't help
Most important, shuffle training data while training and use only important data points for a label to train (ignore data points that may belong to more than one labels)
SEPARABILITY
Verify separability of data using properties using correlation.
Correlation of all data in a label (X) should be high (near to one)
Cross-correlation of all data in label (X) with data in label (!=X) should be low (near zero).
If you observe that data points in a label have low correlation and data points across labels have high correlation - It puts a question on selection of properties (there could be properties which actually won't make data separable). Being so do follows:
Add more relevant properties to data points and remove less relevant properties (technique to use this is PCA)
Use derived parameters like top frequency component etc. from data points to train rather than direct points
Use a time delay network to train time series (always)
I have two gaussian distribution samples, one guassian contains 10,000 samples and the other gaussian also contains 10,000 samples, I would like to train a feed-forward neural network with these samples but I dont know how many samples I have to take in order to get an optimal decision boundary.
Here is the code but I dont know exactly the solution and the output are weirds.
x1 = -49:1:50;
x2 = -49:1:50;
[X1, X2] = meshgrid(x1, x2);
Gaussian1 = mvnpdf([X1(:) X2(:)], mean1, var1);// for class A
Gaussian2 = mvnpdf([X1(:) X2(:)], mean2, var2);// for Class B
net = feedforwardnet(10);
G1 = reshape(Gaussian1, 10000,1);
G2 = reshape(Gaussian2, 10000,1);
input = [G1, G2];
output = [0, 1];
net = train(net, input, output);
When I ran the code it give me weird results.
If the code is not correct, can someone please suggest me so that I can get a decision boundary for these two distributions.
I'm pretty sure that the input must be the Gaussian distribution (and not the x coordinates). In fact the NN has to understand the relationship between the phenomenons themselves that you are interested (the Gaussian distributions) and the output labels, and not between the space in which are contained the phenomenons and the labels. Moreover, If you choose the x coordinates, the NN will try to understand some relationship between the latter and the output labels, but the x are something of potentially constant (i.e., the input data might be even all the same, because you can have very different Gaussian distribution in the same range of the x coordinates only varying the mean and the variance). Thus the NN will end up being confused, because the same input data might have more output labels (and you don't want that this thing happens!!!).
I hope I was helpful.
P.S.: for doubt's sake I have to tell you that the NN doesn't fit very well the data if you have a small training set. Moreover don't forget to validate your data model using the cross-validation technique (a good rule of thumb is to use a 20% of your training set for the cross-validation set and another 20% of the same training set for the test set and thus to use only the remaining 60% of your training set to train your model).
I have a dataset of 43 examples (data points) and 70'000 features, that means my dataset matrix is (43 x 70'000). The labels contains 4 different values (1-4), i.e. there are 4 classes.
Now, I have done classification with a Deep Belief Network / Neural Network but I'm getting only accuracy of around 25% (chance level) with leave-one-out cross-validation. If I'm using kNN, SVM etc. I'm getting >80% accuracy.
I have used the DeepLearnToolbox for Matlab (https://github.com/rasmusbergpalm/DeepLearnToolbox) and just adapted the Deep Belief Network example from the readme of the toolbox. I have tried different number of hidden layers (1-3) and different number of hidden nodes (100, 500,...) as well as different learning rates, momentum etc but accuracy is still very bad. The feature vectors are scaled to the range [0,1] because this is needed by the toolbox.
In detail I have done the following code (only showing one run of cross-validation):
% Indices of training and test set
train = training(c,m);
test = ~train;
% Train DBN
opts = [];
dbn = [];
dbn.sizes = [500 500 500];
opts.numepochs = 50;
opts.batchsize = 1;
opts.momentum = 0.001;
opts.alpha = 0.15;
dbn = dbnsetup(dbn, feature_vectors_std(train,:), opts);
dbn = dbntrain(dbn, feature_vectors_std(train,:), opts);
%unfold dbn to nn
nn = dbnunfoldtonn(dbn, 4);
nn.activation_function = 'sigm';
nn.learningRate = 0.15;
nn.momentum = 0.001;
%train nn
opts.numepochs = 50;
opts.batchsize = 1;
train_labels = labels(train);
nClass = length(unique(train_labels));
L = zeros(length(train_labels),nClass);
for i = 1:nClass
L(train_labels == i,i) = 1;
end
nn = nntrain(nn, feature_vectors_std(train,:), L, opts);
class = nnpredict(nn, feature_vectors_std(test,:));
feature_vectors_std is the (43 x 70'000) matrix with values scaled to [0,1].
Can somebody infer why I'm getting such bad accuracy?
Because you have much more features than examples in the dataset. In other words: you have big number of weights, and you need to estimate all of them, but you can't, because NN with such huge structure cannot generalize well on so small dataset, you need more data to learn such big number of hidden weights (In fact NN may memorize your training set, but cannot infer it's "knowledge" to test set). At the same time 80% accuracy with such simple methods as SVM and kNN indicates that you can describe your data with much simpler rules, because for example SVM will have only 70k weights (Instead of 70kfirst_layer_size + first_layer_sizesecond_layer_size + ... in NN), kNN will not use weights at all.
Complex model is not silver bullet, the more complex model you trying to fit - the more data you need.
Obviousely your dataset is too small than the complexcity of your network. reference from there
The complexity of a neural network can be expressed through the number
of parameters. In the case of deep neural networks, this number can be
in the range of millions, tens of millions and in some cases even
hundreds of millions. Let’s call this number P. Since you want to be
sure of the model’s ability to generalize, a good rule of a thumb for
the number of data points is at least P*P.
While the KNN and SVM is simpler,they don't need that much of data.
So they can work better.
I am working on people detecting using two different features HOG and LBP. I used SVM to train the positive and negative samples. Here, I wanna ask how to improve the accuracy of SVM itself? Because, everytime I added up more positives and negatives sample, the accuracy is always decreasing. Currently my positive samples are 1500 and negative samples are 700.
%extract features
[fpos,fneg] = features(pathPos, pathNeg);
%train SVM
HOG_featV = loadingV(fpos,fneg); % loading and labeling each training example
fprintf('Training SVM..\n');
%L = ones(length(SV),1);
T = cell2mat(HOG_featV(2,:));
HOGtP = HOG_featV(3,:)';
C = cell2mat(HOGtP); % each row of P correspond to a training example
%extract features from LBP
[LBPpos,LBPneg] = LBPfeatures(pathPos, pathNeg);
LBP_featV = loadingV(LBPpos, LBPneg);
LBPlabel = cell2mat(LBP_featV(2,:));
LBPtP = LBP_featV(3,:);
M = cell2mat(LBPtP)'; % each row of P correspond to a training example
featureVector = [C M];
model = svmlearn(featureVector, T','-t 2 -g 0.3 -c 0.5');
Anyone knows how to find best C and Gamma value for improving SVM accuracy?
Thank you,
To find best C and Gamma value for improving SVM accuracy you typically perform cross-validation. In sum you can leave-one-out (1 sample) and test the VBM for that sample using the different parameters (2 parameters define a 2d grid). Typically you would test each decade of the parameters for a certain range. For example: C = [0.01, 0.1, 1, ..., 10^9]; G= [1^-5, 1^-4, ..., 1000]. This should also improve your SVM accuracy by optimizing the hyper-parameters.
By looking again to your question it seems you are using the svmlearn of the machine learning toolbox (statistics toolbox) of Matlab. Therefore you have already built-in functions for cross-validation. Give a look at: http://www.mathworks.co.uk/help/stats/support-vector-machines-svm.html
I followed ASantosRibeiro's method to optimize the parameters before and it works well.
In addition, you could try to add more negative samples until the proportion of the negative and positive reach 2:1. The reason is that when you implement real-time application, you should scan the whole image step by step and commonly the negative extracted samples would be much more than the people-contained samples.
Thus, add more negative training samples is a quite straightforward but effective way to improve overall accuracy(Both false positive and true negative).