Naïve Bayes Classifier -- is normalization necessary? - matlab

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end

Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.

Related

Neural network y=f(x) regression

Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

weighted correlation for case of matrix

i have question how to calculate weighted correlations for matrices,from wikipedia i have created three following codes
1.weighted mean calculation
function [y]= weighted_mean(x,w);
n=length(x);
%assume that weight vector and input vector have same length
sum=0.0;
sum_weight=0.0;
for i=1:n
sum=sum+ x(i)*w(i);
sum_weight=sum_weight+w(i);
end
y=sum/sum_weight;
end
2.weighted covariance
function result=cov_weighted(x,y,w)
n=length(x);
sum_covar=0.0;
sum_weight=0;
for i=1:n
sum_covar=sum_covar+w(i)*(x(i)-weighted_mean(x,w))*(y(i)-weighted_mean(y,w));
sum_weight=sum_weight+w(i);
end
result=sum_covar/sum_weight;
end
and finally weighted correlation
3.
function corr_weight=weighted_correlation(x,y,w);
corr_weight=cov_weighted(x,y,w)/sqrt(cov_weighted(x,x,w)*cov_weighted(y,y,w));
end
now i want to apply weighted correlation method for matrices,related to this link
http://www.mathworks.com/matlabcentral/fileexchange/20846-weighted-correlation-matrix/content/weightedcorrs.m
i did not understand anything how to apply,that why i have created my self,but need in case of input are matrices,thanks very much
#dato-datuashvili Maybe I am providing too much information...
1) I would like to stress that the evaluation of Weighted Correlation matrices are very uncommon. This happens because you have to provide beforehand the weights. Unless you have a clear reason to choose the weights, there is no clear way to provide them.
How can you tell that a measurement of your sample is more or less important than another measurement?
Having said that, the weights are up to you! Yo have to choose them!
So, people usually consider just the correlation matrix (no weights or all weights are the same e.g w_i=1).
If you have a clear way to choose good weights, just do not consider this part.
2) I understand that you want to test your code. So, in order to that, you have to have correlated random variables. How to generate them?
Multivariate normal distributions are the simplest case. See the wikipedia page about them: Multivariate Normal Distribution (see the item "Drawing values from the distribution". Wikipedia shows you how to generate the random numbers from this distribution using Choleski Decomposition). The 2-variate case is much simpler. See for instance Generate Correlated Normal Random Variables
The good news is that if you are using Matlab there is a function for you. See Matlab: Random numbers from the multivariate normal distribution.]
In order to use this function you have to provide the desired means and covariances. [Note that you are making the role of nature here. You are generating the data! In real life, you are going to apply your function to the real data. What I am trying to say is that this step is only useful for tests. Furthermore, pay attencion to the fact that in the Matlab function you are providing the variances and evaluating the correlations (covariances normalized by standard errors). In the 2-dimensional case (that is the case of your function it is possible to provide directly the correlation. See the page above that I provided to you of Math.Stackexchange]
3) Finally, you can apply them to your function. Generate X and Y from a normal multivarite distribution and provide the vector of weights w to your function corr_weight_correlation and you are done!
I hope I provide what you need!
Daniel
Update:
% From the matlab page
mu = [2 3];
SIGMA = [1 1.5; 1.5 3];
n=100;
[x,y] = mvnrnd(mu,SIGMA,n);
% Using your code
w=ones(n,1);
corr_weight=weighted_correlation(x,y,w); % Remember that Sigma is covariance and Corr_weight is correlation. In order to calculate the same thing, just use result=cov_weighted instead.

Fast fourier transform for deasonalizing data in MATLAB

I'm very much a novice at signal processing techniques, but I am trying to apply the fast fourier transform to a daily time series to remove the seasonality present in the data. The example I am working with is from here:
http://www.mathworks.com/help/signal/ug/frequency-domain-linear-regression.html
While I understand how to implement the code as it is written in the example, I am having a hard time adapting it to my specific application. What I am trying to do is create a preprocessing function which deseasonalizes the training data using similar code to the above example. Then, using the same estimated coefficients from the in-sample data, deseasonalize the out-of-sample data to preserve its independence from the in-sample data. Basically, once the coefficients are estimated, I will normalize each new data point using the same coefficients. I suspect this is akin to estimating a linear trend, then removing it from the in-sample data, and then using the same linear model on unseen data to detrend it i the same manner.
Obviously, when I estimate the fourier coefficients, the vector I get out is equal to the length of the in-sample data. The out-of-sample data is comprised of much fewer observations, so directly applying them is impossible.
Is this sort of analysis possible using this technique or am I going down a dead end road? How should I approach that using the code in the example above?
What you want to do is certainly possible, you are on the right track, but you seem to misunderstand a few points in the example. First, it is shown in the example that the technique is the equivalent of linear regression in the time domain, exploiting the FFT to perform in the frequency domain an operation with the same effect. Second, the trend that is removed is not linear, it is equal to a sum of sinusoids, which is why FFT is used to identify particular frequency components in a relatively tidy way.
In your case it seems you are interested in the residuals. The initial approach is therefore to proceed as in the example as follows:
(1) Perform a rough "detrending" by removing the DC component (the mean of the time-domain data)
(2) FFT and inspect the data, choose frequency channels that contain most of the signal.
You can then use those channels to generate a trend in the time domain and subtract that from the original data to obtain the residuals. You need not proceed by using IFFT, however. Instead you can explicitly sum over the cosine and sine components. You do this in a way similar to the last step of the example, which explains how to find the amplitudes via time-domain regression, but substituting the amplitudes obtained from the FFT.
The following code shows how you can do this:
tim = (time - time0)/timestep; % <-- acquisition times for your *new* data, normalized
NFpick = [2 7 13]; % <-- channels you picked to build the detrending baseline
% Compute the trend
mu = mean(ts);
tsdft = fft(ts-mu);
Nchannels = length(ts); % <-- size of time domain data
Mpick = 2*length(NFpick);
X(:,1:2:Mpick) = cos(2*pi*(NFpick-1)'/Nchannels*tim)';
X(:,2:2:Mpick) = sin(-2*pi*(NFpick-1)'/Nchannels*tim)';
% Generate beta vector "bet" containing scaled amplitudes from the spectrum
bet = 2*tsdft(NFpick)/Nchannels;
bet = reshape([real(bet) imag(bet)].', numel(bet)*2,1)
trend = X*bet + mu;
To remove the trend just do
detrended = dat - trend;
where dat is your new data acquired at times tim. Make sure you define the time origin consistently. In addition this assumes the data is real (not complex), as in the example linked to. You'll have to examine the code to make it work for complex data.

how to calculate roc curves?

I write a classifier (Gaussian Mixture Model) to classify five human actions. For every observation the classifier compute the posterior probability to belong to a cluster.
I want to valutate the performance of my system parameterized with a threshold, with values from 0 to 100. For every threshold values, for every observation, if the probability of belonging to one of cluster is greater than threshold I accept the result of the classifier otherwise I discard it.
For every threshold values I compute the number of true-positive, true-negative, false-positive, false-negative.
Than I compute the two function: sensitivity and specificity as
sensitivity = TP/(TP+FN);
specificity=TN/(TN+FP);
In matlab:
plot(1-specificity,sensitivity);
to have the ROC curve. But the result isn't what I expect.
This is the plot of the functions of discards, errors, corrects, sensitivity and specificity varying the threshold of one action.
This is the plot of ROC curve of one action
This is the stem of ROC curve for the same action
I am wrong, but i don't know where. Perhaps I do wrong the calculating of FP, FN, TP, TN especially when the result of the classifier is minor of the threshold, so I have a discard. What I have to incremente when there is a discard?
Background
I am answering this because I need to work through the content, and a question like this is a great excuse. Thank you for the good opportunity.
I use data from the built-in fisher iris data:
http://archive.ics.uci.edu/ml/datasets/Iris
I also use code snippets from the Mathworks tutorial on the classification, and for plotroc
http://www.mathworks.com/products/demos/statistics/classdemo.html
http://www.mathworks.com/help/nnet/ref/plotroc.html?searchHighlight=plotroc
Problem Description
There is clearer boundary within the domain to classify "setosa" but there is overlap for "versicoloir" vs. "virginica". This is a two dimensional plot, and some of the other information has been discarded to produce it. The ambiguity in the classification boundaries is a useful thing in this case.
%load data
load fisheriris
%show raw data
figure(1); clf
gscatter(meas(:,1), meas(:,2), species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');
axis equal
axis tight
title('Raw Data')
Analysis
Lets say that we want to determine the bounds for a linear classifier that defines "virginica" versus "non-virginica". We could look at "self vs. not-self" for other classes, but they would have their own
So now we make some linear discriminants and plot the ROC for them:
%load data
load fisheriris
load iris_dataset
irisInputs=meas(:,1:2)';
irisTargets=irisTargets(3,:);
ldaClass1 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'linear')';
ldaClass2 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diaglinear')';
ldaClass3 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'quadratic')';
ldaClass4 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diagquadratic')';
ldaClass5 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'mahalanobis')';
myinput=repmat(irisTargets,5,1);
myoutput=[ldaClass1;ldaClass2;ldaClass3;ldaClass4;ldaClass5];
whos
plotroc(myinput,myoutput)
The result is shown in the following, though it took deleting repeat copies of the diagonal:
You can note in the code that I stack "myinput" and "myoutput" and feed them as inputs into the "plotroc" function. You should take the results of your classifier as targets and actuals and you can get similar results. This compares the actual output of your classifier versus the ideal output of your target values. Those are the input to plotroc.
So this will give you "built-in" ROC, which is useful for quick work, but does not make you learn every step in detail.
Questions you can ask at this point include:
which classifier is best? How do I determine what best is in this case?
What is the convex hull of the classifiers? Is there some mixture of classifiers that is more informative than any pure method? Bagging perhaps?
You are trying to draw the curves of precision vs recall, depending on the classifier threshold parameter. The definition of precision and recall are:
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
You can check the definition of these parameters in:
http://en.wikipedia.org/wiki/Precision_and_recall
There are some curves here:
http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
Are you dividing your dataset in training set, cross validation set and test set? (if you do not divide the data, it is normal that your precision-recall curve seems weird)
EDITED: I think that there are two possible sources for your problem:
When you train a classifier for 5 classes, usually you have to train 5 distinctive classifiers. One classifier for (class A = class 1, class B = class 2, 3, 4 or 5), then a second classfier for (class A = class 2, class B = class 1, 3, 4 or 5), ... and the fifth for class A = class 5, class B = class 1, 2, 3 or 4).
As you said to select the output for your "compound" classifier, you have to pass your new (test) datapoint through the five classifiers, and you choose the one with the biggest probability.
Then, you should have 5 thresholds to define weighting values that my prioritize selecting one classifier over the others. You should check how the matlab implementations uses the thresholds, but their effect is that you don't choose the class with more probability, but the class with better weighted probability.
As you say, maybe you are not calculating well TP, TN, FP, FN. Your test data should have datapoints belonging to all the classes. Then you have testdata(i,:) and classtestdata(i) are the feature vector and "ground truth" class of datapoint i. When you evaluate the classifier you obtain classifierOutput(i) = 1 or 2 or 3 or 4 or 5. Then you should calculate the "confusion matrix", which is the way to calculate TP, TN, FP, FN when you have multiple classes (> 2):
http://en.wikipedia.org/wiki/Confusion_matrix
http://www.mathworks.com/help/stats/confusionmat.html
(note the relation between TP, TN, FP, FN that you are calculating for the multiclass problem)
I think that you can obtain the TP, TN, FP, FN data of each subclassifier (remember that you are calculating 5 separate classifiers, even if you do not realize it) from the confusion matrix. I am not sure but you can draw the precision recall curve for each subclassifier.
Also check these slides: http://www.slideserve.com/MikeCarlo/multi-class-and-structured-classification
I don't know what the ROC curve is, I will check it because machine learning is a really interesting subject for me.
Hope this helps,