Why Matlab wants such giant amount of memory? - matlab

I'm trying to learn neural net that is 289x300x1. E.g. input vector is 289 elements, 300 hidden neurons, 1 class-output.
So
net = feedforwardnet(300);
net = train(net,X,y,'useParallel','yes','showResources','yes');
gives error
Error using nn7/perfsJEJJ>calc_Y_trainPerfJeJJ (line 37) Error
detected on worker 2. Requested 87301x87301 (56.8GB) array exceeds
maximum array size preference.
X is an array of size 289x2040, type of elements is double.
y is an array of size 1x2040, type of elemetns is double.
I dont understand why matlab wants so much of memory for such small task. Weights need to be stored = 289 * 300 * 64 bytes which is ~5.5 MB.
And how to solve it.

It is probably due to a combination of a few things:
The number of neurons into your hidden layer is rather large.... are you sure 300 features / neurons is what you need? Consider breaking down the problem to fewer features... a dimensionality reduction may be fruitful, but I'm just speculating. However, from what I know, a neural network of 300 hidden neurons should be fine from experience, but I just brought this point up because that hidden neuron size is rather large.
You have too many inputs going in for training. You have 2040 points going in and that's perhaps why it's breaking. Try breaking up the dataset into chunks of a given size, then incrementally train the network for each chunk.
Let's assume that point #1 you can't fix, but you can address point #2, something like this comes to mind:
chunk_size = 200; %// Declare chunk size
num_chunks = ceil(size(X,2)/chunk_size); %// Get total number of chunks
net = feedforwardnet(300); %// Initialize NN
%// For each chunk, extract out a section of the data, then train the
%// network. Retrain on original network until we run out of data to train
for ii = 1 : num_chunks
%// Ensure cap off if we get to the chunk at the end that isn't
%// evenly divisible by the chunk size
if ii*chunk_size > size(X,2)
max_val = size(X,2);
else
max_val = ii*chunk_size;
end
%// Specify portion of data to extract
interval = (ii-1)*chunk_size + 1 : max_val;
%// Train the NN on this data
net = train(net, X(:,interval), y(interval),'useParallel','yes','showResources','yes'));
end
As such, break up your data into chunks, train your neural network on each chunk separately and update the neural network as you go. You can do this because neural networks basically implement Stochastic Gradient Descent where the parameters are updated each time a new input sample is provided.

Related

Neural network y=f(x) regression

Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.

Matlab Neural Network training doesn't yield good results

I'm trying to use a neural network for a classification problem, but the result of the training produce very bad performance. The classification problem:
I have more than 300,000 training samples
Each input is a vector of 32 values (real values)
Each output is a vector of 32 values (0 or 1)
This is how I train the network:
DNN_SIZE = [1000, 1000];
% Initialize DNN
net = feedforwardnet(DNN_SIZE, 'traingda');
net.performParam.regularization = 0.2;
%Set activation functions
for i=1:length(DNN_SIZE)
net.layers{i}.transferFcn = 'poslin';
end
net.layers{end}.transferFcn = 'logsig';
net = train(net, train_inputs, train_outputs);
Note: I have tried different values for DNN_SIZE including larger and smaller values, for hidden layers and less, but it didn't make a difference.
Note 2: I have tried training the same network using a data set from Matlab's examples (simpleclass_dataset) and I still got bad performance.
The performance of the trained network is very bad- Its output is basically 0.5 in every output for every input vector (when the target outputs during training are always 0 or 1). What am I doing wrong, and how can I fix it?
Thanks.

Continuous nodes in Bayes net toolbox for Matlab

I have a node representing a random variable whith 3314 realizations and 49 dimensions each, can it be treated as a discrete variable? Each realization is a binary vector of 49 dimensions, the other nodes are observable therefore the network training is performed with learn_params. This function transforms data as follows:
local_data = data(fam, :);
if iscell(local_data)
local_data = cell2num(local_data);
end
When this transformation takes place, the data lose their structure and become just an array of 162386 = (49 * 3314) so every vector of 49 dimensions cannot be seen as an embodiment of the variable, now will be 162386 realizations and the training on network node will not be correct.
I want to know if it is really possible (right?) take this variable as discrete and what training alternative I have for not modifying the data structure.

neural network for handwritten recognition?

I have been following the course of Andrew Ng about Machine Learning, and I currently have some doubts about the implementation of a handwritten recognition tool.
-First he says that he uses a subset of the MNIST dataset, which contaings 5000 training examples and each training example is an image in a 20x20 gray scale format. With that he says that we have a vector of 400 elements of length that is the "unrolled" of the data previously described. Does it mean that the train set has something like the following format?
Training example 1 v[1,2,...,400]
Training example 2 v[1,2,...,400]
...
Training example 5000 v[1,2,...,400]
For the coding part the author gives the following complete code in Matlab:
%% Machine Learning Online Class - Exercise 3 | Part 2: Neural Networks
% Instructions
% ------------
%
% This file contains code that helps you get started on the
% linear exercise. You will need to complete the following functions
% in this exericse:
%
% lrCostFunction.m (logistic regression cost function)
% oneVsAll.m
% predictOneVsAll.m
% predict.m
%
% For this exercise, you will not need to change any code in this file,
% or any other files other than those mentioned above.
%
%% Initialization
clear ; close all; clc
%% Setup the parameters you will use for this exercise
input_layer_size = 400; % 20x20 Input Images of Digits
hidden_layer_size = 25; % 25 hidden units
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)
%% =========== Part 1: Loading and Visualizing Data =============
% We start the exercise by first loading and visualizing the dataset.
% You will be working with a dataset that contains handwritten digits.
%
% Load Training Data
fprintf('Loading and Visualizing Data ...\n')
load('ex3data1.mat');
m = size(X, 1);
% Randomly select 100 data points to display
sel = randperm(size(X, 1));
sel = sel(1:100);
displayData(X(sel, :));
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ================ Part 2: Loading Pameters ================
% In this part of the exercise, we load some pre-initialized
% neural network parameters.
fprintf('\nLoading Saved Neural Network Parameters ...\n')
% Load the weights into variables Theta1 and Theta2
load('ex3weights.mat');
%% ================= Part 3: Implement Predict =================
% After training the neural network, we would like to use it to predict
% the labels. You will now implement the "predict" function to use the
% neural network to predict the labels of the training set. This lets
% you compute the training set accuracy.
pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
fprintf('Program paused. Press enter to continue.\n');
pause;
% To give you an idea of the network's output, you can also run
% through the examples one at the a time to see what it is predicting.
% Randomly permute examples
rp = randperm(m);
for i = 1:m
% Display
fprintf('\nDisplaying Example Image\n');
displayData(X(rp(i), :));
pred = predict(Theta1, Theta2, X(rp(i),:));
fprintf('\nNeural Network Prediction: %d (digit %d)\n', pred, mod(pred, 10));
% Pause
fprintf('Program paused. Press enter to continue.\n');
pause;
end
and the predict function should be complete by the students, I have done the following:
function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
% trained weights of a neural network (Theta1, Theta2)
% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);
% You need to return the following variables correctly
p = zeros(size(X, 1), 1);
X = [ones(m , 1) X];
% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
% your learned neural network. You should set p to a
% vector containing labels between 1 to num_labels.
%
% Hint: The max function might come in useful. In particular, the max
% function can also return the index of the max element, for more
% information see 'help max'. If your examples are in rows, then, you
% can use max(A, [], 2) to obtain the max for each row.
%
a1 = X;
a2 = sigmoid(a1*Theta1');
a2 = [ones(m , 1) a2];
a3 = sigmoid(a2*Theta2');
[M , p] = max(a3 , [] , 2);
Even thought it runs I am not completely aware of how it really works (I have just followed the step by step instructions that is on the author's website). I have doubts in the following:
The author considers that X(input) is an array of 5000 x 400 elements, or it has 400 neurons as input, with 10 neurons as output and a hidden layer. Does it mean this 5000 x 400 values are the training set?
The author gives us the values of theta 1 and theta 2, which I believe serve as weights for the calculations on the inner layer, but how does values are obtained? Why does he uses 25 neurons of hidden layer and not 24 or 30?
Any help will be apreciated.
Thanks
Let's break your question in parts:
First he says that he uses a subset of the MNIST dataset, which
contaings 5000 training examples and each training example is an image
in a 20x20 gray scale format. With that he says that we have a vector
of 400 elements of length that is the "unrolled" of the data
previously described. Does it mean that the train set has something
like the following format? (...)
You're on the right track. Each training example is a 20x20 image. The simplest neural network model, introduced in the course, treats each image just as a simple 1x400 vector (the "unrolled" means exactly this transformation). The dataset is stored in a matrix because this way you can perform computations faster exploiting the efficient linear algebra libraries which are used by Octave/Matlab. You don't need necessarily to store all training examples as a 5000x400 matrix, but this way your code will run faster.
The author considers that X(input) is an array of 5000 x 400 elements,
or it has 400 neurons as input, with 10 neurons as output and a hidden
layer. Does it mean this 5000 x 400 values are the training set?
The "input layer" is nothing but the very input image. You can think of it as neurons whose output values were already calculated or as the values were coming from outside the network (think about your retina. It is like the input layer of you visual system). Thus this network has 400 input units (the "unrolled" 20x20 image). But of course, your training set doesn't consist of a single image, thus you put all your 5000 images together in a single 5000x400 matrix to form your training set.
The author gives us the values of theta 1 and theta 2, which I believe
serve as weights for the calculations on the inner layer, but how does
values are obtained?
These theta values were found using a algorithm called backpropagation. If you didn't have to implement it in the course yet, just be patient. It might be in the exercises soon! Btw, yes they are the weights.
Why does he uses 25 neurons of hidden layer and not 24 or 30?
He probably chose an arbitrarily value that doesn't run too slow, neither has too poor performance. You probably can find much better values for this hyper-parameters. But if you increase it too much, the training procedure will take probably much longer. Also since you are just using a small portion of the hole training set (the original MNIST has 60000 training examples and 28x28 images), you need to use a "small" number of hidden units to prevent over fitting. If you use too many units your neurons will "learn by heart" the training examples and will not be able to generalize to new unseen data. Finding the hyper parameters, such as the number of the hidden units, is a kind of art that you will master with experience (and maybe with Bayesian optimization and more advanced method, but that's another story xD).
I did the same course some time ago.
X is the input data. Therefore X is the matrix consisting of the 5 000 vectors of 400 elements each. There is no training set, because the network is pre trained.
Normally the values for theta 1 and 2 are trained. How this is done is a subject for the next few lectures. (Backpropagation algorithm)
I'm not entirely sure, why he used 25 neurons as hidden layer. However my guess is, that this number of neurons simply works, without making the training step take forever.

Implementing a neural network for smile detection in images

Let me explain the background to the problem before I explain the problem. My task is to take in an image labelled with a smile of not. The files that have smiles are labelled, for example 100a.jpg and 100b.jpg. Where 'a' is used to represent an image without a smile and 'b' is used to represent an image with a smile. As such I'm looking to make a 3 layered network i.e. Layer 1 = input nodes, Layer 2 = hidden layer and layer 3 = output node.
The general algorithm is to:
Take in an image and re-size it to the size of x 24x20.
Apply a forward propagation from the input nodes to the hidden layer.
Apply a forward propagation from the hidden layer to the output node.
Then apply a backward propagation from the output node to the hidden layer. (Formula1)
Then apply a backward propagation from the hidden layer to the input nodes. (Formula2)
Formula 1:
Formula 2:
Now the problem quite simply is... my code never converges and as such I dont have weight vectors that can be used to test the network. Problem is I HAVE NO CLUE WHY THIS IS HAPPENING... Here is the error I display, clearly not converging:
Training done full cycle
0.5015
Training done full cycle
0.5015
Training done full cycle
0.5015
Training done full cycle
0.5038
Training done full cycle
0.5038
Training done full cycle
0.5038
Training done full cycle
0.5038
Training done full cycle
0.5038
Here is my matlab code:
function [thetaLayer12,thetaLayer23]=trainSystem()
%This is just the directory where I read the images from
files = dir('train1/*jpg');
filelength = length(files);
%Here I create my weights between input layer and hidden layer and then
%from the hidden layer to the output node. The reason the value 481 is used
%is because there will be 480 input nodes + 1 bias node. The reason 200 is
%used is for the number of hidden layer nodes
thetaLayer12 = unifrnd (-1, 1 ,[481,200]);
thetaLayer23 = unifrnd (-1, 1 ,[201,1]);
%Learning Rate value
alpha = 0.00125;
%Initalize Convergence Error
globalError = 100;
while(globalError > 0.001)
globalError = 0;
%Run through all the files in my training set. 400 Files to be exact.
for i = 1 : filelength
%Here we find out if the image has a smile in it or not. If there
%Images are labled 1a.jpg, 1b.jpg where images with an 'a' in them
%have no smile and images with a 'b' in them have a smile.
y = isempty(strfind(files(i).name,'a'));
%We read in the image
imageBig = imread(strcat('train1/',files(i).name));
%We resize the image to 24x20
image = imresize(imageBig,[24 20]);
%I then take the 2D image and map it to a 1D vector
inputNodes = reshape(image,480,1);
%A bias value of 1 is added to the top of the vector
inputNodes = [1;inputNodes];
%Forward Propogation is applied the input layer and the hidden
%layer
outputLayer2 = logsig(double(inputNodes')* thetaLayer12);
%Here we then add a bias value to hidden layer nodes
inputNodes2 = [1;outputLayer2'];
%Here we then do a forward propagation from the hidden layer to the
%output node to obtain a single value.
finalResult = logsig(double(inputNodes2')* thetaLayer23);
%Backward Propogation is then applied to the weights between the
%output node and the hidden layer.
thetaLayer23 = thetaLayer23 - alpha*(finalResult - y)*inputNodes2;
%Backward Propogation is then applied to the weights between the
%hidden layer and the input nodes.
thetaLayer12 = thetaLayer12 - (((alpha*(finalResult-y)*thetaLayer23(2:end))'*inputNodes2(2:end))*(1-inputNodes2(2:end))*double(inputNodes'))';
%I sum the error across each iteration over all the images in the
%folder
globalError = globalError + abs(finalResult-y);
if(i == 400)
disp('Training done full cycle');
end
end
%I take the average error
globalError = globalError / filelength;
disp(globalError);
end
end
Any help would seriously be appreciated!!!!
The success of training any machine learning algorithm is heavily dependent on the amount of training examples you use to train your algorithm. You never said exactly how many training examples you have, but in the case of face detection a huge number of examples would probably be needed (if it would work at all).
Think of it this way, a computer scientist shows you two arrays of pixel intensity values. He tells you which one has a simile in it and which does not. Than he shows you two more and asks you to tell him which one has a simile in it.
Fortunately we can work around this to some extent. You can use an autoencoder or a dictionary learner like sparse coding to find higher level structures in the data. Instead of the computer scientist showing your pixels intensities, he could show you edges or even body parts. You could then use this as input to your neural network, but a significant number of training examples would probably still be needed (but lesser than before).
Than analogy was inspired by a talk given by professor Ng of Stanford on unsupervised feature learning.