I am experimenting with Matlab, set up a Narx Neural Network with the input vector consisting of 2 values, each of them is delayed 30 times, than I have a hidden sigmoid layer with 40 neurons, another one with 15 and the output layer consisting of one value with a purelin function.
I try to transfer the network to the c/c++ lib fann, so I try to understand which layer does what.
netc.b{3} = 0.2302
and netc.LW{6} gives me a vector with 15 values. When I set the values to zero by
netc.LW{6} = zeros(1,15)
And feed the network with zeros by
out = netc(con2seq([zeros(1,40);zeros(1,40)]))
I would expect only the bias to show up at the output, but I get 40 times the value 311.7813. Setting the bias on the output layer to zero I get 40 times 255.5 as output. What do I get wrong?
Related
I'm working on a neural network with Keras using TensorFlow as the backend right now, and my model takes 5 inputs, all normalized to 0 to 1. The inputs' units vary from m/s to meters to m/s/s. So, for example, one input could vary from 0 m/s to 30 m/s, while another input could vary from 5 m to 200 m in the training dataset.
Is it better to individually and independently normalize all inputs so that I have different scales for each unit/input? Or would normalizing all inputs to one scale (mapping 0-200 to 0-1 for the example above) be better for accuracy?
Normalize individualy each input. Because if you normalize everything by dividing 200 some inputs will affect your network less than others. If one input vary between 0-30, after dividing by 200 you get 0-0.15 scale and scale for input which vary 0-200 will be 0-1 after division. So 0-30 input will have less numbers and you tell your network that input is not so relevant as one whith 0-200.
I'm trying to learn neural net that is 289x300x1. E.g. input vector is 289 elements, 300 hidden neurons, 1 class-output.
So
net = feedforwardnet(300);
net = train(net,X,y,'useParallel','yes','showResources','yes');
gives error
Error using nn7/perfsJEJJ>calc_Y_trainPerfJeJJ (line 37) Error
detected on worker 2. Requested 87301x87301 (56.8GB) array exceeds
maximum array size preference.
X is an array of size 289x2040, type of elements is double.
y is an array of size 1x2040, type of elemetns is double.
I dont understand why matlab wants so much of memory for such small task. Weights need to be stored = 289 * 300 * 64 bytes which is ~5.5 MB.
And how to solve it.
It is probably due to a combination of a few things:
The number of neurons into your hidden layer is rather large.... are you sure 300 features / neurons is what you need? Consider breaking down the problem to fewer features... a dimensionality reduction may be fruitful, but I'm just speculating. However, from what I know, a neural network of 300 hidden neurons should be fine from experience, but I just brought this point up because that hidden neuron size is rather large.
You have too many inputs going in for training. You have 2040 points going in and that's perhaps why it's breaking. Try breaking up the dataset into chunks of a given size, then incrementally train the network for each chunk.
Let's assume that point #1 you can't fix, but you can address point #2, something like this comes to mind:
chunk_size = 200; %// Declare chunk size
num_chunks = ceil(size(X,2)/chunk_size); %// Get total number of chunks
net = feedforwardnet(300); %// Initialize NN
%// For each chunk, extract out a section of the data, then train the
%// network. Retrain on original network until we run out of data to train
for ii = 1 : num_chunks
%// Ensure cap off if we get to the chunk at the end that isn't
%// evenly divisible by the chunk size
if ii*chunk_size > size(X,2)
max_val = size(X,2);
else
max_val = ii*chunk_size;
end
%// Specify portion of data to extract
interval = (ii-1)*chunk_size + 1 : max_val;
%// Train the NN on this data
net = train(net, X(:,interval), y(interval),'useParallel','yes','showResources','yes'));
end
As such, break up your data into chunks, train your neural network on each chunk separately and update the neural network as you go. You can do this because neural networks basically implement Stochastic Gradient Descent where the parameters are updated each time a new input sample is provided.
In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.
Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.
I have a node representing a random variable whith 3314 realizations and 49 dimensions each, can it be treated as a discrete variable? Each realization is a binary vector of 49 dimensions, the other nodes are observable therefore the network training is performed with learn_params. This function transforms data as follows:
local_data = data(fam, :);
if iscell(local_data)
local_data = cell2num(local_data);
end
When this transformation takes place, the data lose their structure and become just an array of 162386 = (49 * 3314) so every vector of 49 dimensions cannot be seen as an embodiment of the variable, now will be 162386 realizations and the training on network node will not be correct.
I want to know if it is really possible (right?) take this variable as discrete and what training alternative I have for not modifying the data structure.
I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.