Feedforward neural network classification in Matlab - matlab

I have two gaussian distribution samples, one guassian contains 10,000 samples and the other gaussian also contains 10,000 samples, I would like to train a feed-forward neural network with these samples but I dont know how many samples I have to take in order to get an optimal decision boundary.
Here is the code but I dont know exactly the solution and the output are weirds.
x1 = -49:1:50;
x2 = -49:1:50;
[X1, X2] = meshgrid(x1, x2);
Gaussian1 = mvnpdf([X1(:) X2(:)], mean1, var1);// for class A
Gaussian2 = mvnpdf([X1(:) X2(:)], mean2, var2);// for Class B
net = feedforwardnet(10);
G1 = reshape(Gaussian1, 10000,1);
G2 = reshape(Gaussian2, 10000,1);
input = [G1, G2];
output = [0, 1];
net = train(net, input, output);
When I ran the code it give me weird results.
If the code is not correct, can someone please suggest me so that I can get a decision boundary for these two distributions.

I'm pretty sure that the input must be the Gaussian distribution (and not the x coordinates). In fact the NN has to understand the relationship between the phenomenons themselves that you are interested (the Gaussian distributions) and the output labels, and not between the space in which are contained the phenomenons and the labels. Moreover, If you choose the x coordinates, the NN will try to understand some relationship between the latter and the output labels, but the x are something of potentially constant (i.e., the input data might be even all the same, because you can have very different Gaussian distribution in the same range of the x coordinates only varying the mean and the variance). Thus the NN will end up being confused, because the same input data might have more output labels (and you don't want that this thing happens!!!).
I hope I was helpful.
P.S.: for doubt's sake I have to tell you that the NN doesn't fit very well the data if you have a small training set. Moreover don't forget to validate your data model using the cross-validation technique (a good rule of thumb is to use a 20% of your training set for the cross-validation set and another 20% of the same training set for the test set and thus to use only the remaining 60% of your training set to train your model).

Related

In Matlab, what does it mean to use GMM as a posterior distribution to make a supervised classifier inspired by GMM? Suggested by podludek and lejlot

I understand that GMM is not a classifier itself, but I am trying to follow the instructions of some users in this stack exchange post below to create a GMM-inspired classifier.
lejlot: Multiclass classification using Gaussian Mixture Models with scikit learn
"construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier"
What is meant by "assigned probability" for GMM Matlab objects in the above quote and how can we input a new point to get our desired assigned probability? For a new point that we are trying to classify, my understanding is that we need to get the posterior probabilities that the new point belongs to either Gaussian and then compare these two probabilities.
It looks from the documentation https://www.mathworks.com/help/stats/gmdistribution.html
like we only have access to cluster center mu's and covariance matrices (sigma) but not an actual probability distribution that would take in a point and spit out a probability
podludek: Multiclass classification using Gaussian Mixture Models with scikit learn
"GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem.....You should use GMM as a posterior distribution, one GMM per each class." -
In the documentation in Matlab for posterior(gm,X), the tutorial shows us inputting X, which is already the the data we used to create ("train") our GMM. But how can we get the posterior probability of being in a cluster for a new point?
https://www.mathworks.com/help/stats/gmdistribution.posterior.html
"P = posterior(gm,X) returns the posterior probability of each Gaussian mixture component in gm given each observation in X"
--> But the X used in the link above is the 'training' data used to create the GMM itself, not a new point. Also we have two gm objects, not one. How can we grab the probability a point belongs to a Gaussian?
The pseudocode below is how I envisioned a GMM inspired classifier would go for a two class example: I would fit GMM's to individual clusters as described by podludek. Then, I would use the posterior probailities of a point being in each cluster and then pick the bigger probability.
I'm aware there are issues with this conceptually (such as the two GMM objects having conflicting covariance matrices) but I've been assured by my mentor that there is a way to make a supervised version of GMM, and he wants me to make one, so here we go:
Pseusdocode:
X % The training data matrix
% each new row is a new data point
% each column is new feature
% Ex: if you had 10,000 data points and 100 features for each, your matrix
% would be 10000 by 100
% Let's say we had 200 points of each class in our training data
% Grab subsets of X that corresponds to classes 1 and 2
X_only_class_2 = X(1:200,:)
X_only_class_1 = X(201:end,:)
gmfit_class_1 = fitgmdist(X_only_class_1,1,'RegularizationValue',0.1);
cov_matrix_1=gmfit_class_1.Sigma;
gmfit_class_2 = fitgmdist(X_only_class_2,1,'RegularizationValue',0.1);
cov_matrix_2=gmfit_class_2.Sigma;
% Now do some tests on data we already know the classification of to check if this is working as we would expect:
a = posterior(gmfit_class_1,X_only_class_1)
b = posterior(gmfit_class_1,X_only_class_2)
c = posterior(gmfit_class_2,X_only_class_1)
d = posterior(gmfit_class_2,X_only_class_2)
But unfortunately, computing these posteriors a, b, c, and d just result in column vectors of 1's. I'm aware these are degenerate cases (and pointless for actual classification since we already know the classifications of our training data) but I still wanted to test them to make sure the posterior method is working as I would expect.
Expected:
a = posterior(gmfit_class_1,X_only_class_1)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
b = posterior(gmfit_class_1,X_only_class_2)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
c = posterior(gmfit_class_2,X_only_class_1)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
d = posterior(gmfit_class_2,X_only_class_2)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
I have to think that somehow Matlab is being confused by how in both gmm fit models, there is only one cluster in each. Either that or I am not interpreting the posterior method correctly.

Neural network y=f(x) regression

Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.

Why bad accuracy with neural network?

I have a dataset of 43 examples (data points) and 70'000 features, that means my dataset matrix is (43 x 70'000). The labels contains 4 different values (1-4), i.e. there are 4 classes.
Now, I have done classification with a Deep Belief Network / Neural Network but I'm getting only accuracy of around 25% (chance level) with leave-one-out cross-validation. If I'm using kNN, SVM etc. I'm getting >80% accuracy.
I have used the DeepLearnToolbox for Matlab (https://github.com/rasmusbergpalm/DeepLearnToolbox) and just adapted the Deep Belief Network example from the readme of the toolbox. I have tried different number of hidden layers (1-3) and different number of hidden nodes (100, 500,...) as well as different learning rates, momentum etc but accuracy is still very bad. The feature vectors are scaled to the range [0,1] because this is needed by the toolbox.
In detail I have done the following code (only showing one run of cross-validation):
% Indices of training and test set
train = training(c,m);
test = ~train;
% Train DBN
opts = [];
dbn = [];
dbn.sizes = [500 500 500];
opts.numepochs = 50;
opts.batchsize = 1;
opts.momentum = 0.001;
opts.alpha = 0.15;
dbn = dbnsetup(dbn, feature_vectors_std(train,:), opts);
dbn = dbntrain(dbn, feature_vectors_std(train,:), opts);
%unfold dbn to nn
nn = dbnunfoldtonn(dbn, 4);
nn.activation_function = 'sigm';
nn.learningRate = 0.15;
nn.momentum = 0.001;
%train nn
opts.numepochs = 50;
opts.batchsize = 1;
train_labels = labels(train);
nClass = length(unique(train_labels));
L = zeros(length(train_labels),nClass);
for i = 1:nClass
L(train_labels == i,i) = 1;
end
nn = nntrain(nn, feature_vectors_std(train,:), L, opts);
class = nnpredict(nn, feature_vectors_std(test,:));
feature_vectors_std is the (43 x 70'000) matrix with values scaled to [0,1].
Can somebody infer why I'm getting such bad accuracy?
Because you have much more features than examples in the dataset. In other words: you have big number of weights, and you need to estimate all of them, but you can't, because NN with such huge structure cannot generalize well on so small dataset, you need more data to learn such big number of hidden weights (In fact NN may memorize your training set, but cannot infer it's "knowledge" to test set). At the same time 80% accuracy with such simple methods as SVM and kNN indicates that you can describe your data with much simpler rules, because for example SVM will have only 70k weights (Instead of 70kfirst_layer_size + first_layer_sizesecond_layer_size + ... in NN), kNN will not use weights at all.
Complex model is not silver bullet, the more complex model you trying to fit - the more data you need.
Obviousely your dataset is too small than the complexcity of your network. reference from there
The complexity of a neural network can be expressed through the number
of parameters. In the case of deep neural networks, this number can be
in the range of millions, tens of millions and in some cases even
hundreds of millions. Let’s call this number P. Since you want to be
sure of the model’s ability to generalize, a good rule of a thumb for
the number of data points is at least P*P.
While the KNN and SVM is simpler,they don't need that much of data.
So they can work better.

Naïve Bayes Classifier -- is normalization necessary?

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.

Matlab Multilayer Perceptron Question

I need to classify a dataset using Matlab MLP and show classification.
The dataset looks like
Click to view
What I have done so far is:
I have create an neural network contains a hidden layer (two neurons
?? maybe someone could give me some suggestions on how many
neurons are suitable for my example) and a output layer (one
neuron).
I have used several different learning methods such as Delta bar
Delta, backpropagation (both of these methods are used with or -out
momentum and Levenberg-Marquardt.)
This is the code I used in Matlab(Levenberg-Marquardt example)
net = newff(minmax(Input),[2 1],{'logsig' 'logsig'},'trainlm');
net.trainParam.epochs = 10000;
net.trainParam.goal = 0;
net.trainParam.lr = 0.1;
[net tr outputs] = train(net,Input,Target);
The following shows hidden neuron classification boundaries generated by Matlab on the data, I am little bit confused, beacause network should produce nonlinear result, but the result below seems that two boundary lines are linear..
Click to view
The code for generating above plot is:
figure(1)
plotpv(Input,Target);
hold on
plotpc(net.IW{1},net.b{1});
hold off
I also need to plot the output function of the output neuron, but I am stucking on this step. Can anyone give me some suggestions?
Thanks in advance.
Regarding the number of neurons in the hidden layer, for such an small example two are more than enough. The only way to know for sure the optimum is to test with different numbers. In this faq you can find a rule of thumb that may be useful: http://www.faqs.org/faqs/ai-faq/neural-nets/
For the output function, it is often useful to divide it in two steps:
First, given the input vector x, the output of the neurons in the hidden layer is y = f(x) = x^T w + b where w is the weight matrix from the input neurons to the hidden layer and b is the bias vector.
Second, you will have to apply the activation function g of the network to the resulting vector of the previous step z = g(y)
Finally, the output is the dot product h(z) = z . v + n, where v is the weight vector from the hidden layer to the output neuron and n the bias. In the case of more than one output neurons, you will repeat this for each one.
I've never used the matlab mlp functions, so I don't know how to get the weights in this case, but I'm sure the network stores them somewhere. Edit: Searching the documentation I found the properties:
net.IW numLayers-by-numInputs cell array of input weight values
net.LW numLayers-by-numLayers cell array of layer weight values
net.b numLayers-by-1 cell array of bias values