Neural Network bias - neural-network

I'm building a feed forward neural network, and I'm trying to decide how to implement the bias. I'm not sure about two things:
1) Is there any downside to implementing the bias as a trait of the node as opposed to a dummy input+weight?
2) If I implement it as a dummy input, would it be input just in the first layer (from the input to the hidden layer), or would I need a dummy input in every layer?
Thanks!
P.S. I'm currently using 2d arrays to represent weights between layers. Any ideas for other implementation structures? This isn't my main question, just looking for food for thought.

Implementation doesn't matter as long as the behaviour is right.
Yes, it is needed in every layer.
2d array is a way to go.
I'd suggest to include bias as another neuron with constant input 1. This will make it easier to implement - you don't need a special variable for it or anything like that.

Related

create deep network in matlab with logsig layer instead of softmax layer

I want to create a deep classification net, but my classes aren't mutually exclusive (that is what sofmaxlayer do).
Is it possible to define a non mutually exclusive classification layer (i.e., a data can be in more than one class)?
One way to do it, it would be with a logsig function in the classification layer, instead of a softmax, but I have no idea how to acomplish that....
In CNN you can have multiple class in last layer as you know. But if I understand correctly your need in last layer an out put with that is in a range of numbers instead of 1 or 0 for each class. Its mean you need regression. If your labels support this task it's OK and you can do it with regression just like what happen in bounding box regression for localization. And you don't need soft-max in last layer. just use other activation functions that produce sufficient out put for your task.

Error function and ReLu in a CNN

I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).

How to add a custom layer and loss function into a pretrained CNN model by matconvnet?

I'm new to matconvnet. Recently, I'd like to try a new loss function instead of the existing one in pretrained model, e.g., vgg-16, which usually uses softmax loss layer. What's more, I want to use a new feature extractor layer, instead of pooling layer or max layer. I know there are 2 CNN wrappers in matconvnet, simpleNN and DagNN respectively, since I'm using vgg-16,a linear model which has a linear sequence of building blocks. So, in simpleNN wrapper, how to create a custom layer in detail, espectially the procedure and the relevant concept, e.g., do I need to remove layers behind the new feature extractor layer or just leave them ? And I know how to compute the derivative of the loss function so the details of computation inside the layer is not that important in this question, I just want to know the procedure represented by codes. Could someone help me? I'll appreciate it a lot !
You can remove the older error or objective layer
net.layer('abc')=[];
and you can add new error code in vl_nnloss() file

Can I use next layer's output as current layer's input by Keras?

In text generate mission, we usually use model's last output as current input to generate next word. More generalized, I want to achieve a neural network that regards next layer's finally hidden state as current layer's input. Just like the following(what confuses me is the decoder part):
But I have read Keras document and haven't found any functions to achieve it.
Can I achieve this structure by Keras? How?
What you are asking is an autoencoders, you can find similar structures in Keras.
But there are certain details that you should figure it out on your own. Including the padding strategy and preprocessing your input and output data. Your input cannot get dynamic input size, so you need to have a fixed length for input and outputs. I don't know what do you mean by arrows who join in one circle but I guess you can take a look at Merge layer in Keras (basically adding, concatenating, and etc.)
You probably need 4 sequential model and one final model that represent the combined structure.
One more thing, the decoder setup of LSTM (The Language Model) is not dynamic in design. In your model definition, you basically introduce a fixed inputs and outputs for it. Then you prepare the training correctly, so you don't need anything dynamic. Then during the test, you can predict each decoded word in a loop by running the model once predict the next output step and run it again for next time step and so on.
The structure you have showed is a custom structure. So, Keras doesn't provide any class or wrapper to directly build such structure. But YES, you can build this kind of structure in Keras.
So, it looks like you need LSTM model in backward direction. I didn't understand the other part which probably looks like incorporating previous sentence embedding as input to the next time-step input of LSTM unit.
I rather encourage you to work with simple language-modeling with LSTM first. Then you can tweak the architecture later to build an architecture depicted in figure.
Example:
Text generation with LSTM in Keras

Theanets: Removing individual connections

How do you remove connections in Theanets? I'd like to create custom connectivity between an input layer, a single hidden layer, and an output layer. But the only defaults are feedforward all-to-all architectures or recurrent architectures. I'd like to remove specific connections from the all-to-all connectivity and then train the network.
Thanks in advance.
(Developer of theanets here.)
This is currently not directly possible with theanets. For computational efficiency the underlying computations in feedforward networks are implemented as simple matrix operations, which are fast and can be executed on a GPU for sometimes dramatic speedups.
You can, however, initialize the weights in a layer so that some (or many) of the weights are zero. To do this, just pass a dictionary in the layers list, and include a sparsity key:
import theanets
net = theanets.Autoencoder(
layers=(784, dict(size=1000, sparsity=0.9), 784))
This initializes the weights for the layer so that the given fraction of weights are zeros. The weights are, however, eligible for change during the training process, so this is only an initialization trick.
You can, however, implement a custom Layer subclass that does whatever you like, as long as you stay within the Theano boundaries. You could, for instance, implement a type of feedforward layer that uses a mask to ensure that some weights remain zeros during the feedforward computation.
For more details you might want to ask on the theanets mailing list.