I am training a neural network to do voice emotion recognition with:
100 input layer size.
25 hidden layer size.
6 labels (output layer).
I have divided the data set into train set and test set, then extracted features from the voice using MLFCC (Mel Frequency Cepstral Coefficients) which returns a matrix with different sizes. So, I used 100 features of them for each time.
The accuracy on the training set is 100%, but when it comes to the test set it is about 30-40%.
I still don't know a lot about this domain, but it is obviously overfitting the problem (maybe not, but that is what I have learned). I have done some adjustments to avoid this problem:
Increasing the lambda, decreasing the number of features, adding an extra hidden layer. The accuracy got better, but never bigger than 40%.
What would be the problem?
Here is the implementation of MLFCC:
function [cepstra,aspectrum,pspectrum] = melfcc(samples, sr, varargin)
if nargin < 2; sr = 16000; end
% Parse out the optional arguments
[wintime, hoptime, numcep, lifterexp, sumpower, preemph, dither, ...
minfreq, maxfreq, nbands, bwidth, dcttype, fbtype, usecmp, modelorder, ...
broaden, useenergy] = ...
process_options(varargin, 'wintime', 0.025, 'hoptime', 0.010, ...
'numcep', 13, 'lifterexp', 0.6, 'sumpower', 1, 'preemph', 0.97, ...
'dither', 0, 'minfreq', 0, 'maxfreq', 4000, ...
'nbands', 40, 'bwidth', 1.0, 'dcttype', 2, ...
'fbtype', 'mel', 'usecmp', 0, 'modelorder', 0, ...
'broaden', 0, 'useenergy', 0);
if preemph ~= 0
samples = filter([1 -preemph], 1, samples);
end
% Compute FFT power spectrum
[pspectrum,logE] = powspec(samples, sr, wintime, hoptime, dither);
aspectrum = audspec(pspectrum, sr, nbands, fbtype, minfreq, maxfreq, sumpower, bwidth);
if (usecmp)
% PLP-like weighting/compression
aspectrum = postaud(aspectrum, maxfreq, fbtype, broaden);
end
if modelorder > 0
if (dcttype ~= 1)
disp(['warning: plp cepstra are implicitly dcttype 1 (not ', num2str(dcttype), ')']);
end
% LPC analysis
lpcas = dolpc(aspectrum, modelorder);
% convert lpc to cepstra
cepstra = lpc2cep(lpcas, numcep);
% Return the auditory spectrum corresponding to the cepstra?
% aspectrum = lpc2spec(lpcas, nbands);
% else return the aspectrum that the cepstra are based on, prior to PLP
else
% Convert to cepstra via DCT
cepstra = spec2cep(aspectrum, numcep, dcttype);
end
cepstra = lifter(cepstra, lifterexp);
if useenergy
cepstra(1,:) = logE;
end
And here is my implementation:
clear ; close all; clc
[input,output]=gettingPatterns;
input_layer_size = 70;
hidden_layer_size = 100;
hidden2_layer_size = 25;
num_labels = 6;
fu = [input output];size(fu)
fu=fu(randperm(size(fu,1)),:);
input = fu(:,1:70);
output = fu (:,71:76);
crossIn = input(201:240,:);
crossOut=output(201:240,:);
trainIn = input(1:200,:);
trainOut=output(1:200,:);
Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
Theta2 = randInitializeWeights(hidden_layer_size,hidden2_layer_size);
Theta3 =randInitializeWeights(hidden2_layer_size,num_labels);
initial_nn_params = [Theta1(:) ; Theta2(:);Theta3(:)];
size(initial_nn_params)
options = optimset('MaxIter',1000);
% You should also try different values of lambda
lambda=1;
costFunction = #(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
hidden2_layer_size,num_labels, trainIn, trainOut, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
num_labels, (hidden_layer_size + 1));
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):( (hidden_layer_size * (input_layer_size + 1)))+(hidden2_layer_size*(hidden_layer_size+1))), ...
hidden2_layer_size, (hidden_layer_size + 1));
Theta3 = reshape(nn_params(((1 + (hidden_layer_size * (input_layer_size + 1)))+(hidden2_layer_size*(hidden_layer_size+1))):end), ...
num_labels, (hidden2_layer_size + 1));
%[error_train, error_val] = learningCurve(trainIn, trainOut, crossIn, crossOut, lambda,input_layer_size,hidden_layer_size,num_labels);
pred = predict(Theta1, Theta2,Theta3,trainIn);
[dummy, p] = max(trainOut, [], 2);
[pred trainOut]
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == p)) * 100);
pred = predict(Theta1, Theta2,Theta3,crossIn);
[pred crossOut]
[dummy, p] = max(crossOut, [], 2);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == p)) * 100);
And here is the code for getting the patterns:
function [ input,output ] = gettingPatterns()
myFolder='C:\Users\ahmed\Documents\MATLAB\New Folder (3)\homeWork\speech';
filePattern=fullfile(myFolder,'*.wav');
wavFiles=dir(filePattern);
output=[];
input=[];
for k = 1:length(wavFiles)
sampleOutput=zeros(1,6);
baseFileName = wavFiles(k).name;
if baseFileName(3:5)=='ang',sampleOutput(1)=1;,end;
if baseFileName(3:5)=='fea',sampleOutput(2)=1;,end;
if baseFileName(3:5)=='bor',sampleOutput(3)=1;,end;
if baseFileName(3:5)=='sad',sampleOutput(4)=1;,end;
if baseFileName(3:5)=='joy',sampleOutput(5)=1;,end;
if baseFileName(3:5)=='neu',sampleOutput(6)=1;,end;
output(k,:)=sampleOutput;
fullFileName = fullfile(myFolder, baseFileName);
wavArray = wavread(fullFileName);
[cepstra,xxx]=melfcc(wavArray);
[m,n]=size(cepstra);
reshapedArray=reshape(cepstra,m*n,1);
smalledArray=small(reshapedArray,70);
%Normalized Features
%x(i)=(x(i)-mean)/std;
normalizedFeatures=[];
for i=1:length(smalledArray)
normalizedFeatures(i)=(smalledArray(i)-mean(smalledArray))/std(smalledArray);
end
input(k,:)=normalizedFeatures;
end
Please note the following:
I have done this test on the nn toolbox, I got the same result. The only reason for implementing it myself is to be able to add an extra hidden layer.
The implementation of the cost function, forward propagation and backward propagation are 100% correct, so I did not include them in this question.
Related
I have 2 features which I expand to contain all possible combinations of the two features under order 6. When I do MATLAB's fminunc, it returns a weight vector where all elements are 0.
The dataset is here
clear all;
clc;
data = load("P2-data1.txt");
m = length(data);
para = 0; % regularization parameter
%% Augment Feature
y = data(:,3);
new_data = newfeature(data(:,1), data(:,2), 3);
[~, n] = size(new_data);
betas1 = zeros(n,1); % initial weights
options = optimset('GradObj', 'on', 'MaxIter', 400);
[beta_new, cost] = fminunc(#(t)(regucostfunction(t, new_data, y, para)), betas1, options);
fprintf('Cost at theta found by fminunc: %f\n', cost);
fprintf('theta: \n');
fprintf(' %f \n', beta_new); % get all 0 here
% Compute accuracy on our training set
p_new = predict(beta_new, new_data);
fprintf('Train Accuracy after feature augmentation: %f\n', mean(double(p_new == y)) * 100);
fprintf('\n');
%% the functions are defined below
function g = sigmoid(z) % running properly
g = zeros(size(z));
g=ones(size(z))./(ones(size(z))+exp(-z));
end
function [J,grad] = regucostfunction(theta,x,y,para) % CalculateCost(x1,betas1,y);
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
hyp = sigmoid(x*theta);
err = (hyp - y)';
grad = (1/m)*(err)*x;
sum = 0;
for k = 2:length(theta)
sum = sum+theta(k)^2;
end
J = (1/m)*((-y' * log(hyp) - (1 - y)' * log(1 - hyp)) + para*(sum) );
end
function p = predict(theta, X)
m = size(X, 1); % Number of training examples
p = zeros(m, 1);
index = find(sigmoid(theta'*X') >= 0.5);
p(index,1) = 1;
end
function out = newfeature(X1, X2, degree)
out = ones(size(X1(:,1)));
for i = 1:degree
for j = 0:i
out(:, end+1) = (X1.^(i-j)).*(X2.^j);
end
end
end
data contains 2 columns of rows followed by a third column of 0/1 values.
The functions used are: newfeature returns the expanded features and regucostfunction computes the cost. When I did the same approach with the default features, it worked and I think the problem here has to do with some coding issue.
what I'm trying to do is to testify the fact that if I add one more layer into CNN, the accuracy goes higher.
The code is below here.
This code is from https://github.com/lhoang29/DigitRecognition/blob/master/cnnload.m
I'm at the beginner stage of CNN and trying to expand one more layer including
convolution and pooling stage. I tried several ways but seems not working. Could someone show me how to expand one more layer?
Thankyou. Below is the code
Code for main function:
clear all; close all; clc;
maxtrain = 10000;
iter = 10;
eta = 0.01;
%% Data Load
trlblid = fopen('train-labels-idx1-ubyte');
trimgid = fopen('train-images-idx3-ubyte');
tslblid = fopen('t10k-labels-idx1-ubyte');
tsimgid = fopen('t10k-images-idx3-ubyte');
% read train labels
fread(trlblid, 4);
numtrlbls = toint(fread(trlblid, 4));
trainlabels = fread(trlblid, numtrlbls);
% read train data
fread(trimgid, 4);
numtrimg = toint(fread(trimgid, 4));
trimgh = toint(fread(trimgid, 4));
trimgw = toint(fread(trimgid, 4));
trainimages = permute(reshape(fread(trimgid,trimgh*trimgw*numtrimg),trimgh,trimgw,numtrimg), [2 1 3]);
% read test labels
fread(tslblid, 4);
numtslbls = toint(fread(tslblid, 4));
testlabels = fread(tslblid, numtslbls);
% read test data
fread(tsimgid, 4);
numtsimg = toint(fread(tsimgid, 4));
tsimgh = toint(fread(tsimgid, 4));
tsimgw = toint(fread(tsimgid, 4));
testimages = permute(reshape(fread(tsimgid, tsimgh*tsimgw*numtsimg),tsimgh,tsimgw,numtsimg), [2 1 3]);
%% CNN Training
[missimages, misslabels] = cnntrain(trainlabels,trainimages,testlabels,testimages,maxtrain,iter,eta);
%% CNN Testing
showmiss(missimages,misslabels,testimages,testlabels,25,2);
Code for training:
function [missimages, misslabels] = cnntrain(trainlabels,trainimages,testlabels,testimages,maxtrain,iter,eta)
fn = 5; % number of kernels for layer 1
ks = 5; % size of kernel
[h,w,n] = size(trainimages);
n = min(n,maxtrain);
% normalize data to [-1,1] range
nitrain = (trainimages / 255) * 2 - 1;
nitest = (testimages / 255) * 2 - 1;
% train with backprop
h1 = h-ks+1;
w1 = w-ks+1;
A1 = zeros(h1,w1,fn);
h2 = h1/2;
w2 = w1/2;
I2 = zeros(h2,w2,fn);
A2 = zeros(h2,w2,fn);
A3 = zeros(10,1);
% kernels for layer 1
W1 = randn(ks,ks,fn) * .01;
B1 = ones(1,fn);
% scale parameter and bias for layer 2
S2 = randn(1,fn) * .01;
B2 = ones(1,fn);
% weights and bias parameters for fully-connected output layer
W3 = randn(h2,w2,fn,10) * .01;
B3 = ones(10,1);
% true outputs
Y = eye(10)*2-1;
for it=1:iter
err = 0;
for im=1:n
%------------ FORWARD PROP ------------%
% Layer 1: convolution with bias followed by sigmoidal squashing
for fm=1:fn
A1(:,:,fm) = convn(nitrain(:,:,im),W1(end:-1:1,end:-1:1,fm),'valid') + B1(fm);
end
Z1 = tanh(A1);
% Layer 2: average/subsample with scaling and bias
for fm=1:fn
I2(:,:,fm) = avgpool(Z1(:,:,fm));
A2(:,:,fm) = I2(:,:,fm) * S2(fm) + B2(fm);
end
Z2 = tanh(A2);
% Layer 3: fully connected
for cl=1:10
A3(cl) = convn(Z2,W3(end:-1:1,end:-1:1,end:-1:1,cl),'valid') + B3(cl);
end
Z3 = tanh(A3); % Final output
err = err + .5 * norm(Z3 - Y(:,trainlabels(im)+1),2)^2;
%------------ BACK PROP ------------%
% Compute error at output layer
Del3 = (1 - Z3.^2) .* (Z3 - Y(:,trainlabels(im)+1));
% Compute error at layer 2
Del2 = zeros(size(Z2));
for cl=1:10
Del2 = Del2 + Del3(cl) * W3(:,:,:,cl);
end
Del2 = Del2 .* (1 - Z2.^2);
% Compute error at layer 1
Del1 = zeros(size(Z1));
for fm=1:fn
Del1(:,:,fm) = (S2(fm)/4)*(1 - Z1(:,:,fm).^2);
for ih=1:h1
for iw=1:w1
Del1(ih,iw,fm) = Del1(ih,iw,fm) * Del2(floor((ih+1)/2),floor((iw+1)/2),fm);
end
end
end
% Update bias at layer 3
DB3 = Del3; % gradient w.r.t bias
B3 = B3 - eta*DB3;
% Update weights at layer 3
for cl=1:10
DW3 = DB3(cl) * Z2; % gradient w.r.t weights
W3(:,:,:,cl) = W3(:,:,:,cl) - eta * DW3;
end
% Update scale and bias parameters at layer 2
for fm=1:fn
DS2 = convn(Del2(:,:,fm),I2(end:-1:1,end:-1:1,fm),'valid');
S2(fm) = S2(fm) - eta * DS2;
DB2 = sum(sum(Del2(:,:,fm)));
B2(fm) = B2(fm) - eta * DB2;
end
% Update kernel weights and bias parameters at layer 1
for fm=1:fn
DW1 = convn(nitrain(:,:,im),Del1(end:-1:1,end:-1:1,fm),'valid');
W1(:,:,fm) = W1(:,:,fm) - eta * DW1;
DB1 = sum(sum(Del1(:,:,fm)));
B1(fm) = B1(fm) - eta * DB1;
end
end
disp(['Error: ' num2str(err) ' at iteration ' num2str(it)]);
end
miss = 0;
numtest=size(testimages,3);
missimages = zeros(1,numtest);
misslabels = zeros(1,numtest);
for im=1:numtest
for fm=1:fn
A1(:,:,fm) = convn(nitest(:,:,im),W1(end:-1:1,end:-1:1,fm),'valid') + B1(fm);
end
Z1 = tanh(A1);
% Layer 2: average/subsample with scaling and bias
for fm=1:fn
I2(:,:,fm) = avgpool(Z1(:,:,fm));
A2(:,:,fm) = I2(:,:,fm) * S2(fm) + B2(fm);
end
Z2 = tanh(A2);
% Layer 3: fully connected
for cl=1:10
A3(cl) = convn(Z2,W3(end:-1:1,end:-1:1,end:-1:1,cl),'valid') + B3(cl);
end
Z3 = tanh(A3); % Final output
[pm,pl] = max(Z3);
if pl ~= testlabels(im)+1
miss = miss + 1;
missimages(miss) = im;
misslabels(miss) = pl - 1;
end
end
disp(['Miss: ' num2str(miss) ' out of ' num2str(numtest)]);
end
function [pr] = avgpool(img)
pr = zeros(size(img)/2);
for r=1:2:size(img,1)
for c=1:2:size(img,2)
pr((r+1)/2,(c+1)/2) = (img(r,c)+img(r+1,c)+img(r,c+1)+img(r+1,c+1))/4;
end
end
end
Code for showing accuracy
function [] = showmiss(missim,misslab,testimages,testlabels,numshow,numpages)
nummiss = nnz(missim);
page = 1;
showsize = floor(sqrt(numshow));
for f=1:numshow:nummiss
figure(floor(f/numshow) + 1);
for m=f:min(nummiss,f+numshow-1)
subplot(showsize,showsize,m-f+1);
imshow(testimages(:,:,missim(m)));
title(strcat(num2str(testlabels(missim(m))), ':', num2str(misslab(m))));
end
page = page + 1;
if page > numpages
break;
end
end
end
Function toint
function [x] = toint(b)
x = b(1)*16777216 + b(2)*65536 + b(3)*256 + b(4);
end
I have put together a classifying 3 layer artificial neural network that appears to work on other datasets. Playing around some artificial datasets that I made, I was unable to correctly predict between two classes when one class was positive in one feature or another feature.
Clearly class1 is can be identified by asking if either feature 1 or feature 2 is equal to 1 but I can't get the algorithm to predict the dataset correctly (there are 20 examples following this pattern in the dataset).
Can ANN/MLPs recognize this type of pattern? If so, what am I missing? If not, are there other methods that can predict this type of pattern (maybe SVM)?
I used Octave as that was what was used in the online course offered from coursera. I have listed most of the code here although it is structured slightly differently when I run it. As you can see I do use bias units on the first and second layers and I have also varied the number of hidden units in the second layer from 1-5 with no improvement over random guessing.
% Load dataset
y = [1; 1; 2; 2]
X = [1, 0; 0, 1; 0, 0; 0, 0]
m = size(X, 1);
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
% Randomly initialize weight parameters
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];
% Add bias units to layers and feedforward
Xbias = [ones(m,1), X];
L2bias = [ones(m,1), sigmoid(Xbias*Theta1')];
L3 = sigmoid(L2bias * Theta2');
% Create class matrix Y
Y = zeros(m, num_labels);
for r = 1:m;
Y(r, y(r)) = 1;
end
% Set cost function
J = (sum(sum(Y.*log(L3) + (1-Y).*log(1-L3))))/-m + lambda*(sum(sum((Theta1(:,2:columns(Theta1))).^2)) + sum(sum((Theta2(:,2:columns(Theta2))).^2)))/2/m;
% Initialize weight gradient matrices
D2 = zeros(rows(Theta2),columns(Theta2));
D1 = zeros(rows(Theta1),columns(Theta1));
% Calculate gradient with backpropagation
for t = 1:m;
a1 = [1 X(t,:)]';
z2 = Theta1*a1;
a2 = [1; sigmoid(z2)];
z3 = Theta2*a2;
a3 = sigmoid(z3);
d3 = a3 - Y(t,:)';
d2 = (Theta2'*d3)(2:end).*sigmoidGradient(z2);
D2 = D2 + d3*a2';
D1 = D1 + d2*a1';
end
Theta2_grad = D2/m;
Theta1_grad = D1/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda*Theta2(:,2:end)/m;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda*Theta1(:,2:end)/m;
% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
% Compute cost (Feed forward)
[J,grad] = nnCostFunction(initial_nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Create "short hand" for the cost function to be minimized using fmincg
costFunction = #(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Train the neural network using fmincg
options = optimset('MaxIter', 1000);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
NN can recognize any pattern. Universal Approximation Theorem proves that (as well as many others).
The most obvious reason I can think of is lack of bias neuron. Althouh for more valuable answers you have to include your code.
Okay, so I am in the middle of Andrew Ng's machine learning course on coursera and would like to adapt the neural network which was completed as part of assignment 4.
In particular, the neural network which I had completed correctly as part of the assignment was as follows:
Sigmoid activation function: g(z) = 1/(1+e^(-z))
10 output units, each which could take 0 or 1
1 hidden layer
Back-propagation method used to minimize cost function
Cost function:
where L=number of layers, s_l = number of units in layer l, m = number of training examples, K = number of output units
Now I want to adjust the exercise so that there is one continuous output unit that takes any value between [0,1] and I am trying to work out what needs to change, so far I have
Replaced the data with my own, i.e.,such that the output is continuous variable between 0 and 1
Updated references to the number of output units
Updated the cost function in the back-propagation algorithm to:
where a_3 is the value of the output unit determined from forward propagation.
I am certain that something else must change as the gradient checking method shows the gradient determined by back-propagation and that by the numerical approximation no longer match up. I did not change the sigmoid gradient; it is left at f(z)*(1-f(z)) where f(z) is the sigmoid function 1/(1+e^(-z))) nor did I update the numerical approximation of the derivative formula; simply (J(theta+e) - J(theta-e))/(2e).
Can anyone advise of what other steps would be required?
Coded in Matlab as follows:
% FORWARD PROPAGATION
% input layer
a1 = [ones(m,1),X];
% hidden layer
z2 = a1*Theta1';
a2 = sigmoid(z2);
a2 = [ones(m,1),a2];
% output layer
z3 = a2*Theta2';
a3 = sigmoid(z3);
% BACKWARD PROPAGATION
delta3 = a3 - y;
delta2 = delta3*Theta2(:,2:end).*sigmoidGradient(z2);
Theta1_grad = (delta2'*a1)/m;
Theta2_grad = (delta3'*a2)/m;
% COST FUNCTION
J = 1/(2 * m) * sum( (a3-y).^2 );
% Implement regularization with the cost function and gradients.
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + Theta1(:,2:end)*lambda/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + Theta2(:,2:end)*lambda/m;
J = J + lambda/(2*m)*( sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));
I have since realised that this question is similar to that asked by #Mikhail Erofeev on StackOverflow, however in this case I wish the continuous variable to be between 0 and 1 and therefore use a sigmoid function.
First, your cost function should be:
J = 1/m * sum( (a3-y).^2 );
I think your Theta2_grad = (delta3'*a2)/m;is expected to match the numerical approximation after changed to delta3 = 1/2 * (a3 - y);).
Check this slide for more details.
EDIT:
In case there is some minor discrepancy between our codes, I pasted my code below for your reference. The code has already been compared with numerical approximation function checkNNGradients(lambda);, the Relative Difference is less than 1e-4 (not meets the 1e-11 requirement by Dr.Andrew Ng though)
function [J grad] = nnCostFunctionRegression(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
X = [ones(m, 1) X];
z1 = sigmoid(X * Theta1');
zs = z1;
z1 = [ones(m, 1) z1];
z2 = z1 * Theta2';
ht = sigmoid(z2);
y_recode = zeros(length(y),num_labels);
for i=1:length(y)
y_recode(i,y(i))=1;
end
y = y_recode;
regularization=lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
J=1/(m)*sum(sum((ht - y).^2))+regularization;
delta_3 = 1/2*(ht - y);
delta_2 = delta_3 * Theta2(:,2:end) .* sigmoidGradient(X * Theta1');
delta_cap2 = delta_3' * z1;
delta_cap1 = delta_2' * X;
Theta1_grad = ((1/m) * delta_cap1)+ ((lambda/m) * (Theta1));
Theta2_grad = ((1/m) * delta_cap2)+ ((lambda/m) * (Theta2));
Theta1_grad(:,1) = Theta1_grad(:,1)-((lambda/m) * (Theta1(:,1)));
Theta2_grad(:,1) = Theta2_grad(:,1)-((lambda/m) * (Theta2(:,1)));
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
If you want to have continuous output try not to use sigmoid activation when computing target value.
a1 = [ones(m, 1) X];
a2 = sigmoid(X * Theta1');
a2 = [ones(m, 1) z1];
a3 = z1 * Theta2';
ht = a3;
Normalize input before using it in nnCostFunction. Everything else remains same.
i tried to port this python implementation of a continuous RBM to Matlab:
http://imonad.com/rbm/restricted-boltzmann-machine/
I generated 2-dimensional trainingdata in the shape of a (noisy) circle and trained the rbm with 2 visible an 8 hidden layers. To test the implementation i fed uniformly distributed randomdata to the RBM and plotted the reconstructed data (Same procedure as used in the link above).
Now the confusing part: With trainingdata in the range of (0,1)x(0,1) i get very satisfying results, however with trainingdata in range (-0.5,-0.5)x(-0.5,-0.5) or (-1,0)x(-1,0) the RBM reconstructs only data in the very right top of the circle. I dont understand what causes this, is it just a bug in my implementation i dont see?
Some plots, the blue dots are the training data, the red dots are the reconstructions.
Here is my implementation of the RBM:
Training:
maxepoch = 300;
ksteps = 10;
sigma = 0.2; % cd standard deviation
learnW = 0.5; % learning rate W
learnA = 0.5; % learning rate A
nVis = 2; % number of visible units
nHid = 8; % number of hidden units
nDat = size(dat, 1);% number of training data points
cost = 0.00001; % cost
moment = 0.9; % momentum
W = randn(nVis+1, nHid+1) / 10; % weights
dW = randn(nVis+1, nHid+1) / 1000; % change of weights
sVis = zeros(1, nVis+1); % state of visible neurons
sVis(1, end) = 1.0; % bias
sVis0 = zeros(1, nVis+1); % initial state of visible neurons
sVis0(1, end) = 1.0; % bias
sHid = zeros(1, nHid+1); % state of hidden neurons
sHid(1, end) = 1.0; % bias
aVis = 0.1*ones(1, nVis+1);% A visible
aHid = ones(1, nHid+1); % A hidden
err = zeros(1, maxepoch);
e = zeros(1, maxepoch);
for epoch = 1:maxepoch
wPos = zeros(nVis+1, nHid+1);
wNeg = zeros(nVis+1, nHid+1);
aPos = zeros(1, nHid+1);
aNeg = zeros(1, nHid+1);
for point = 1:nDat
sVis(1:nVis) = dat(point, :);
sVis0(1:nVis) = sVis(1:nVis); % initial sVis
% positive phase
activHid;
wPos = wPos + sVis' * sHid;
aPos = aPos + sHid .* sHid;
% negative phase
activVis;
activHid;
for k = 1:ksteps
activVis;
activHid;
end
tmp = sVis' * sHid;
wNeg = wNeg + tmp;
aNeg = aNeg + sHid .* sHid;
delta = sVis0(1:nVis) - sVis(1:nVis);
err(epoch) = err(epoch) + sum(delta .* delta);
e(epoch) = e(epoch) - sum(sum(W' * tmp));
end
dW = dW*moment + learnW * ((wPos - wNeg) / numel(dat)) - cost * W;
W = W + dW;
aHid = aHid + learnA * (aPos - aNeg) / (numel(dat) * (aHid .* aHid));
% error
err(epoch) = err(epoch) / (nVis * numel(dat));
e(epoch) = e(epoch) / numel(dat);
disp(['epoch: ' num2str(epoch) ' err: ' num2str(err(epoch)) ...
' ksteps: ' num2str(ksteps)]);
end
save(['rbm_' filename '.mat'], 'W', 'err', 'aVis', 'aHid');
activHid.m:
sHid = (sVis * W) + randn(1, nHid+1);
sHid = sigFun(aHid .* sHid, datRange);
sHid(end) = 1.; % bias
activVis.m:
sVis = (W * sHid')' + randn(1, nVis+1);
sVis = sigFun(aVis .* sVis, datRange);
sVis(end) = 1.; % bias
sigFun.m:
function [sig] = sigFun(X, datRange)
a = ones(size(X)) * datRange(1);
b = ones(size(X)) * (datRange(2) - datRange(1));
c = ones(size(X)) + exp(-X);
sig = a + (b ./ c);
end
Reconstruction:
nSamples = 2000;
ksteps = 10;
nVis = 2;
nHid = 8;
sVis = zeros(1, nVis+1); % state of visible neurons
sVis(1, end) = 1.0; % bias
sHid = zeros(1, nHid+1); % state of hidden neurons
sHid(1, end) = 1.0; % bias
input = rand(nSamples, 2);
output = zeros(nSamples, 2);
for sample = 1:nSamples
sVis(1:nVis) = input(sample, :);
for k = 1:ksteps
activHid;
activVis;
end
output(sample, :) = sVis(1:nVis);
end
RBM's were originally designed to work only with binary data. But also work with data between 0 and 1. Its part of the algorithm. Further reading
As input is in the range of [0 1] for both x and y, this is why they stay in that ares. Changing the input to input = (rand(nSamples, 2)*2) -1; results in input sampled from a range of [-1 1] and therefore the red dots will be more spread out around the circle.