I'm full newbie in neural networks. I generated NN in matlab. Further I need to know exact structure of this NN, because I need to implement it in Java(static connections and weights, no learning). Can you explain how to connect neurons and what math operations perform in each element?
NN params are next(taken from Matlab):
iw{1,1} - Weight to layer 1 from intput 1
[2.8574 -1.9207;
1.7582 -1.2549;
-4.5925 0.23236;
12.0861 12.3701;
2.503 -1.9321;
-2.1422 2.6928]
lw{2,1} - Weight to layer
[-0.51977 5.3993 3.4349 5.2863 3.1976 -0.67102]
b{1} - Bias to layer 1
[-3.2811;
-6.956;
-3.0943;
11.1103;
0.14842;
-3.3705]
b{2} - Bias to layer 2
[1.4657]
Transfer function TANSIG
Greatly appreciate your help.
You have a NN that has 2 inputs, then a hidden layer of 6 neurons and an output layer of 1 neuron.
Each of the neuron in each layer, will take all the outputs from the previous one and multiply them by a number and offset the result by another.
The numbers you show are the numbers I mentioned.
For example, the neuron 1 from hidden layer will output hidden1=2.8574*in1 -1.9207*in2-3.2811. Then take whatever sigma function you are using and apply hidden1=sigma(hidden1).
As another example, the output will be out=-hidden1*0.51977+hidden2*5.3993+...-hidden6*0.67102+1.4657
Related
I wrote this script (Matlab) for classification using Softmax. Now I want to use same script for regression by replacing the Softmax output layer with a Sigmoid or ReLU activation function. But I wasn't able to do that.
X=houseInputs ;
T=houseTargets;
%Train an autoencoder with a hidden layer of size 10 and a linear transfer function for the decoder. Set the L2 weight regularizer to 0.001, sparsity regularizer to 4 and sparsity proportion to 0.05.
hiddenSize = 10;
autoenc1 = trainAutoencoder(X,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin');
%%
%Extract the features in the hidden layer.
features1 = encode(autoenc1,X);
%Train a second autoencoder using the features from the first autoencoder. Do not scale the data.
hiddenSize = 10;
autoenc2 = trainAutoencoder(features1,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin',...
'ScaleData',false);
features2 = encode(autoenc2,features1);
%%
softnet = trainSoftmaxLayer(features2,T,'LossFunction','crossentropy');
%Stack the encoders and the softmax layer to form a deep network.
deepnet = stack(autoenc1,autoenc2,softnet);
%Train the deep network on the wine data.
deepnet = train(deepnet,X,T);
%Estimate the deep network, deepnet.
y = deepnet(X);
Regression is a different problem from classification. You have to change your loss function to something that fits with a regression e.g. mean square error and of course change the number of neuron to one (you will only ouput 1 value on your last layer).
It is possible to use a Neural Network to perform a regression task but it might be an overkill for many tasks. True regression means to perform a mapping of one set of continuous inputs to another set of continuous outputs:
f: x -> ý
Changing the architecture of a neural network to make it perform a regression task is usually fairly simple. Instead of mapping the continuous input data to a specific class as it is done using the Softmax function as in your case, you have to make the network use only a single output node.
This node will just sum the outputs of the the previous layer (last hidden layer) and multiply the summed activations by 1. During the training process this output ý will be compared to the correct ground-truth value y that comes with your dataset. As a loss function you may use the Root-means-squared-error (RMSE).
Training such a network will result in a model that maps an arbitrary number of independent variables x to a dependent variable ý, which basically is a regression task.
To come back to your Matlab implementation, it would be incorrect to change the current Softmax output layer to be an activation function such as a Sigmoid or ReLU. Instead your would have to implement a custom RMSE output layer for your network, which is fed with the sum of activations coming from the last hidden layer of your network.
I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
NeuralNet(
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
],
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_num_filters=32,
conv2d1_filter_size=(5, 5),
conv2d1_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_num_filters=32,
conv2d2_filter_size=(3, 3),
conv2d2_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
dense_num_units=256,
dense_nonlinearity=lasagne.nonlinearities.rectify,
# output Layer
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=10,
# optimization method params
update= momentum,
update_learning_rate=0.01,
update_momentum=0.9,
max_epochs=10,
verbose=1,
)
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
Here,
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1
In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.
Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.
I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.
I need to classify a dataset using Matlab MLP and show classification.
The dataset looks like
Click to view
What I have done so far is:
I have create an neural network contains a hidden layer (two neurons
?? maybe someone could give me some suggestions on how many
neurons are suitable for my example) and a output layer (one
neuron).
I have used several different learning methods such as Delta bar
Delta, backpropagation (both of these methods are used with or -out
momentum and Levenberg-Marquardt.)
This is the code I used in Matlab(Levenberg-Marquardt example)
net = newff(minmax(Input),[2 1],{'logsig' 'logsig'},'trainlm');
net.trainParam.epochs = 10000;
net.trainParam.goal = 0;
net.trainParam.lr = 0.1;
[net tr outputs] = train(net,Input,Target);
The following shows hidden neuron classification boundaries generated by Matlab on the data, I am little bit confused, beacause network should produce nonlinear result, but the result below seems that two boundary lines are linear..
Click to view
The code for generating above plot is:
figure(1)
plotpv(Input,Target);
hold on
plotpc(net.IW{1},net.b{1});
hold off
I also need to plot the output function of the output neuron, but I am stucking on this step. Can anyone give me some suggestions?
Thanks in advance.
Regarding the number of neurons in the hidden layer, for such an small example two are more than enough. The only way to know for sure the optimum is to test with different numbers. In this faq you can find a rule of thumb that may be useful: http://www.faqs.org/faqs/ai-faq/neural-nets/
For the output function, it is often useful to divide it in two steps:
First, given the input vector x, the output of the neurons in the hidden layer is y = f(x) = x^T w + b where w is the weight matrix from the input neurons to the hidden layer and b is the bias vector.
Second, you will have to apply the activation function g of the network to the resulting vector of the previous step z = g(y)
Finally, the output is the dot product h(z) = z . v + n, where v is the weight vector from the hidden layer to the output neuron and n the bias. In the case of more than one output neurons, you will repeat this for each one.
I've never used the matlab mlp functions, so I don't know how to get the weights in this case, but I'm sure the network stores them somewhere. Edit: Searching the documentation I found the properties:
net.IW numLayers-by-numInputs cell array of input weight values
net.LW numLayers-by-numLayers cell array of layer weight values
net.b numLayers-by-1 cell array of bias values