Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.
Related
I want to get the offset in samples between two datasets in Matlab (getting them synced in time), a quite common issue. Therefore I use the cross correlation function xcorr or the cross covariance function xcov (both provide similar results in most cases for this purpose). With artificial data it works fine, but I struggle with "real" data, even though it should be pretty much the same. Matlab always says the offset would be zero. I'm using this simple piece of code:
[crossCorr] = xcov(b, c);
[~, peakIndex] = max(crossCorr())
offset = peakIndex - length(b)
I've posted a fully runable example m-file with a downsampled data excerpt on pastebin:
Code with data on pastebin
EDIT: The downsampled excerpt seems to be not fully suitable for evaluating the effect. Here's a much larger sample with the original frequency, pease use this one instead. Unfortunately it was too big for pastebin.
As the plot shows it should be no problem at all to get the offset via cross covariance. I also tried to scale the data nicer in order to avoid numerical problems, but that didn't change anything at all.
Would be great if someone could tell me my mistake.
There's nothing wrong with your method in principle, I used exactly the same approach successfully for temporally aligning different audio recordings of the same signal.
However, it appears that for your time series, correlation (or covariance) is simply not the right measure to compare shifted versions – possibly because they contain components of a time scale comparable to the total length. An alternative is to use residual variance, i.e. the variance of the difference between shifted versions. Here is a (not particularly elegant) implementation of this idea:
lags = -1000 : 1000;
v = nan(size(lags));
for i = 1 : numel(lags)
lag = lags(i);
if lag >= 0
v(i) = var(b(1 + lag : end) - c(1 : end - lag));
else
v(i) = var(b(1 : end + lag) - c(1 - lag : end));
end
end
[~, ind] = min(v);
minlag = lags(ind);
For your (longer) data set, this results in minlag = 169. Plotting residual variance over lags gives:
Your data has a minor peak around 5 and a major peak around 101.
If I knew something about my data then I could might window around an acceptable range of offsets as shown below.
Code for initial exploration:
figure; clc;
subplot(2,1,1)
plot(1:numel(b), b);
hold on
plot(1:numel(c), c, 'r');
legend('b','c')
subplot(2,1,2)
plot(crossCorr,'.b-')
hold on
plot(peakIndex,crossCorr(peakIndex),'or')
legend('crossCorr','peak')
Initial Image:
If you zoom into the first peak you can see that it is not only high around 5, but it is polynomial "enough" to allow sub-element offsets. That is convenient.
Image showing :
Here is what the curve-fitting tool gives as the analytic for a cubic:
Linear model Poly3:
f(x) = p1*x^3 + p2*x^2 + p3*x + p4
Coefficients (with 95% confidence bounds):
p1 = 8.515e-013 (8.214e-013, 8.816e-013)
p2 = -3.319e-011 (-3.369e-011, -3.269e-011)
p3 = 2.253e-010 (2.229e-010, 2.277e-010)
p4 = -4.226e-012 (-7.47e-012, -9.82e-013)
Goodness of fit:
SSE: 2.799e-024
R-square: 1
Adjusted R-square: 1
RMSE: 6.831e-013
You can note that the SSE fits to roundoff.
If you compute the root (near n=4) you use the following matlab code:
% Coefficients
p1 = 8.515e-013
p2 = -3.319e-011
p3 = 2.253e-010
p4 = -4.226e-012
% Linear model Poly3:
syms('x')
f = p1*x^3 + p2*x^2 + p3*x + p4
xz1=fzero(#(y) subs(diff(f),'x',y), 4)
and you get the analytic root at 4.01420240431444.
EDIT:
Hmmm. How about fitting a gaussian mixture model to the convolution? You sweep through a good range of component count, you do between 10 and 30 repeats, and you find which component count has the best/lowest BIC. So you fit a gmdistribution to the lower subplot of the first figure, then test the covariance at the means of the components in decreasing order.
I would try the offset at the means, and just look at sum squared error. I would then pick the offset that has the lowest error.
Procedure:
compute cross correlation
fit cross correlation to Gaussian Mixture model
sweep a reasonable range of components (start with 1-10)
use a reasonable number of repeats (10 to 30 depending on run-to-run variation)
compute Bayes Information Criterion (BIC) for each level, pick the lowest because it indicates a reasonable balance of error and parameter count
each component is going to have a mean, evaluate that mean as a candidate offset and compute sum-squared error (sse) when you offset like that.
pick the offset of the component that gives best SSE
Let me know how well that works.
If the two signals misalign by non-integer number of samples, e.g. 3.7 samples, then the xcorr method may find the max value at 4 samples, it won't be able to find the accurate time shift. In this case, you should try a method called "unified change detection". The web-link for the paper is:
[http://www.phmsociety.org/node/1404/]
Good Luck.
I have two gaussian distribution samples, one guassian contains 10,000 samples and the other gaussian also contains 10,000 samples, I would like to train a feed-forward neural network with these samples but I dont know how many samples I have to take in order to get an optimal decision boundary.
Here is the code but I dont know exactly the solution and the output are weirds.
x1 = -49:1:50;
x2 = -49:1:50;
[X1, X2] = meshgrid(x1, x2);
Gaussian1 = mvnpdf([X1(:) X2(:)], mean1, var1);// for class A
Gaussian2 = mvnpdf([X1(:) X2(:)], mean2, var2);// for Class B
net = feedforwardnet(10);
G1 = reshape(Gaussian1, 10000,1);
G2 = reshape(Gaussian2, 10000,1);
input = [G1, G2];
output = [0, 1];
net = train(net, input, output);
When I ran the code it give me weird results.
If the code is not correct, can someone please suggest me so that I can get a decision boundary for these two distributions.
I'm pretty sure that the input must be the Gaussian distribution (and not the x coordinates). In fact the NN has to understand the relationship between the phenomenons themselves that you are interested (the Gaussian distributions) and the output labels, and not between the space in which are contained the phenomenons and the labels. Moreover, If you choose the x coordinates, the NN will try to understand some relationship between the latter and the output labels, but the x are something of potentially constant (i.e., the input data might be even all the same, because you can have very different Gaussian distribution in the same range of the x coordinates only varying the mean and the variance). Thus the NN will end up being confused, because the same input data might have more output labels (and you don't want that this thing happens!!!).
I hope I was helpful.
P.S.: for doubt's sake I have to tell you that the NN doesn't fit very well the data if you have a small training set. Moreover don't forget to validate your data model using the cross-validation technique (a good rule of thumb is to use a 20% of your training set for the cross-validation set and another 20% of the same training set for the test set and thus to use only the remaining 60% of your training set to train your model).
I have been following the course of Andrew Ng about Machine Learning, and I currently have some doubts about the implementation of a handwritten recognition tool.
-First he says that he uses a subset of the MNIST dataset, which contaings 5000 training examples and each training example is an image in a 20x20 gray scale format. With that he says that we have a vector of 400 elements of length that is the "unrolled" of the data previously described. Does it mean that the train set has something like the following format?
Training example 1 v[1,2,...,400]
Training example 2 v[1,2,...,400]
...
Training example 5000 v[1,2,...,400]
For the coding part the author gives the following complete code in Matlab:
%% Machine Learning Online Class - Exercise 3 | Part 2: Neural Networks
% Instructions
% ------------
%
% This file contains code that helps you get started on the
% linear exercise. You will need to complete the following functions
% in this exericse:
%
% lrCostFunction.m (logistic regression cost function)
% oneVsAll.m
% predictOneVsAll.m
% predict.m
%
% For this exercise, you will not need to change any code in this file,
% or any other files other than those mentioned above.
%
%% Initialization
clear ; close all; clc
%% Setup the parameters you will use for this exercise
input_layer_size = 400; % 20x20 Input Images of Digits
hidden_layer_size = 25; % 25 hidden units
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)
%% =========== Part 1: Loading and Visualizing Data =============
% We start the exercise by first loading and visualizing the dataset.
% You will be working with a dataset that contains handwritten digits.
%
% Load Training Data
fprintf('Loading and Visualizing Data ...\n')
load('ex3data1.mat');
m = size(X, 1);
% Randomly select 100 data points to display
sel = randperm(size(X, 1));
sel = sel(1:100);
displayData(X(sel, :));
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ================ Part 2: Loading Pameters ================
% In this part of the exercise, we load some pre-initialized
% neural network parameters.
fprintf('\nLoading Saved Neural Network Parameters ...\n')
% Load the weights into variables Theta1 and Theta2
load('ex3weights.mat');
%% ================= Part 3: Implement Predict =================
% After training the neural network, we would like to use it to predict
% the labels. You will now implement the "predict" function to use the
% neural network to predict the labels of the training set. This lets
% you compute the training set accuracy.
pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
fprintf('Program paused. Press enter to continue.\n');
pause;
% To give you an idea of the network's output, you can also run
% through the examples one at the a time to see what it is predicting.
% Randomly permute examples
rp = randperm(m);
for i = 1:m
% Display
fprintf('\nDisplaying Example Image\n');
displayData(X(rp(i), :));
pred = predict(Theta1, Theta2, X(rp(i),:));
fprintf('\nNeural Network Prediction: %d (digit %d)\n', pred, mod(pred, 10));
% Pause
fprintf('Program paused. Press enter to continue.\n');
pause;
end
and the predict function should be complete by the students, I have done the following:
function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
% trained weights of a neural network (Theta1, Theta2)
% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);
% You need to return the following variables correctly
p = zeros(size(X, 1), 1);
X = [ones(m , 1) X];
% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
% your learned neural network. You should set p to a
% vector containing labels between 1 to num_labels.
%
% Hint: The max function might come in useful. In particular, the max
% function can also return the index of the max element, for more
% information see 'help max'. If your examples are in rows, then, you
% can use max(A, [], 2) to obtain the max for each row.
%
a1 = X;
a2 = sigmoid(a1*Theta1');
a2 = [ones(m , 1) a2];
a3 = sigmoid(a2*Theta2');
[M , p] = max(a3 , [] , 2);
Even thought it runs I am not completely aware of how it really works (I have just followed the step by step instructions that is on the author's website). I have doubts in the following:
The author considers that X(input) is an array of 5000 x 400 elements, or it has 400 neurons as input, with 10 neurons as output and a hidden layer. Does it mean this 5000 x 400 values are the training set?
The author gives us the values of theta 1 and theta 2, which I believe serve as weights for the calculations on the inner layer, but how does values are obtained? Why does he uses 25 neurons of hidden layer and not 24 or 30?
Any help will be apreciated.
Thanks
Let's break your question in parts:
First he says that he uses a subset of the MNIST dataset, which
contaings 5000 training examples and each training example is an image
in a 20x20 gray scale format. With that he says that we have a vector
of 400 elements of length that is the "unrolled" of the data
previously described. Does it mean that the train set has something
like the following format? (...)
You're on the right track. Each training example is a 20x20 image. The simplest neural network model, introduced in the course, treats each image just as a simple 1x400 vector (the "unrolled" means exactly this transformation). The dataset is stored in a matrix because this way you can perform computations faster exploiting the efficient linear algebra libraries which are used by Octave/Matlab. You don't need necessarily to store all training examples as a 5000x400 matrix, but this way your code will run faster.
The author considers that X(input) is an array of 5000 x 400 elements,
or it has 400 neurons as input, with 10 neurons as output and a hidden
layer. Does it mean this 5000 x 400 values are the training set?
The "input layer" is nothing but the very input image. You can think of it as neurons whose output values were already calculated or as the values were coming from outside the network (think about your retina. It is like the input layer of you visual system). Thus this network has 400 input units (the "unrolled" 20x20 image). But of course, your training set doesn't consist of a single image, thus you put all your 5000 images together in a single 5000x400 matrix to form your training set.
The author gives us the values of theta 1 and theta 2, which I believe
serve as weights for the calculations on the inner layer, but how does
values are obtained?
These theta values were found using a algorithm called backpropagation. If you didn't have to implement it in the course yet, just be patient. It might be in the exercises soon! Btw, yes they are the weights.
Why does he uses 25 neurons of hidden layer and not 24 or 30?
He probably chose an arbitrarily value that doesn't run too slow, neither has too poor performance. You probably can find much better values for this hyper-parameters. But if you increase it too much, the training procedure will take probably much longer. Also since you are just using a small portion of the hole training set (the original MNIST has 60000 training examples and 28x28 images), you need to use a "small" number of hidden units to prevent over fitting. If you use too many units your neurons will "learn by heart" the training examples and will not be able to generalize to new unseen data. Finding the hyper parameters, such as the number of the hidden units, is a kind of art that you will master with experience (and maybe with Bayesian optimization and more advanced method, but that's another story xD).
I did the same course some time ago.
X is the input data. Therefore X is the matrix consisting of the 5 000 vectors of 400 elements each. There is no training set, because the network is pre trained.
Normally the values for theta 1 and 2 are trained. How this is done is a subject for the next few lectures. (Backpropagation algorithm)
I'm not entirely sure, why he used 25 neurons as hidden layer. However my guess is, that this number of neurons simply works, without making the training step take forever.
I'm currently working with neural networks and I'm still beginner. My purpose is to use a MLP to predict flow time series (I know, that NARX-networks may be more suitable for time series predictions, but the requirement is a MLP).
For example I want to predict the flow Q(t+x) with current and historical flow Q(t...t-n) and precipitation P(t...t-m) etc.
The results of my net-trainings (training, validation and test of the network) and an additional validation period show relatively good qualities (correlation and RMSE). But when I look closer at the output of training and validation period, there is a lag to the targets of the respective periode. And my problem is that I don't know why.
The lag exactly corresponds to my forecast period x, no matter how large x is.
I use a standard MLP from the Matlab-toolbox with default Settings (randomly divide, trainlm, etc.) like using the graphical NN-tool (but I also tested other Settings with my own code).
With a simple Q(t) NAR-net it is the same problem. If I try it with regular data like predicting sin(t+x) with sin(t..t-n) or the same with a rectangular function there is no time shift, it's all fine.
Only if I use real world data or irregular (but most constant) data like [0.12 0.14 0.13 0.1 0.1 0.1 ... (n times) 0.1 ... 0.1 0.1 0.14 0.15 0.12 ...] there is the shift between the target and the output. Although I train the network with the target Q(t+x) the real training output is Q(t). I try also some other input variable combinations from less to more information. My time series is above 7 years with hourly resolution. But it also occurs with other resolutions.
Is there something I am wrong in my work or something I can try. I've read that some others also have this Problem, but no solutions? I think it is no failure of my implementation, because I also tried the matlab-tool and the sinus function and there are the same outcomes. And if I ignore the shift, the accuracy of the values is not bad (thats why the goodness of correlation and rmse is also good obviously).
I use matlab 2012.
Here's also a minimalistic code example, only with the most import points. But also shows the problem very well.
%% minimalstic example
% but there is the same problem with more input variables
load Q
%% create net inputs and targets
% start point of t
t = 100;
% history data of Q -> Q(t-1), Q(t-2), Q(t-3)
inputs = [Q(t-1:end-1,1) Q(t-2:end-2,1) Q(t-3:end-3,1)]';
% timestep t that want to be predicted
targets = Q(t:end,1)';
%% create fitting net (MLP)
% but it is the same problem for NARnet
% and from here, you can also use the NN graphical tool
% number of hidden neurons
numHiddenNeurons = 6; % the described problem is not dependent on this
% point, therefor it is freely chosen
net = fitnet(numHiddenNeurons); % same problem if choosing the old version newfit
% default MLP settings, no changes, but the problem even exist with other
% combinations of settings
% train net
[trained_net,tr] = train(net,inputs,targets);
% apply trained net with given data (create net outputs)
outputs = sim(trained_net,inputs);
figure(1)
hold on
bar(targets',0.6,'FaceColor','r','EdgeColor','none')
bar(outputs',0.2,'FaceColor','b','EdgeColor','none')
legend('observation','prediction')
% please zoom very far to see single bars!! the bar plot shows very good
% the time shift
% if you choose a bigger forecasting time, the shift will also be better to
% see
%% the result: targets(1,1)=Q(t), outputs(1,1)=Q(t-1)
%% now try the sinus function, the problem will not be there
x = 1:1:1152;
SIN = sin(x);
inputs = [SIN(1,t-1:end-1);SIN(1,t-2:end-2);SIN(1,t-3:end-3)];
targets = SIN(1,t:end);
% start again from above, creating the net
I have not enough reputations to upload two excerpts of the results of the codes for one step ahead prediction.
Consider predicting not the absolute value of the flow, but the change of flow from the previous period, using the recent changes from the previous periods as inputs. As pointed out by Diphtong above, it very well may be the case that the previous flow values are not predictive of (contain no useful information about) the next flow value.
Conceptually, this is similar to predicting the next value of a random walk. Imagine you had a situation where the next value of a function was equal to the current value plus some random number between -1.0 and +1.0. If you tried to predict the next value from the previous values, the best that any function approximator/regressor could do to minimize the prediction error would be to use the current value as the best predictor of the next value.
However, in your case, it could still be possible that there is some information in the previous flow values. To prevent the current value from overwhelming the error term, deny the network from using the current value as the predictor by feeding it the derivative of the absolute flow values. If there is no useful information in those either, it should minimize the error by always predicting 0.
In summary, try:
Inputs: change in flow at [t-1], at [t-2], ... , [t-w]
Output: change in flow at [t]
This "time-shift" you are observing is exactly what #Diphtong mentions: your neural-network cannot resolve the relationship between the inputs and the output, so it bahaves like a "naive predictor" (look it up) where (in the financial stock market world) the best prediction for tomorrow's stock price is today's price.
It may help, but I've seen deltas of the input time series, LOG() and SQRT() perform the same...
We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.