Adding softmax significantly changes weight updates - neural-network

I have a neural network of the form N = W1 * Tanh(W2 * I), where I is the Input Vector/Matrix. When I learn these weights the output has a certain form. However, when I add a normalization layer, for example, N' = Softmax( W1 * Tanh(W2 * I) ) however the in the output vector of N' a single element is close to 1 while the rest are almost zero. This is the case, not only with SoftMax() but with any normalizing layer. Is there any standard solution to such a problem?

That is the behavior of the softmax function. Perhaps what you need is a sigmoid function.

Related

Why modifying the weights of a recurrent neural network in MATLAB does not cause the output to change when predicting on same data?

I consider the following recurrent neural network (RNN):
RNN under consideration
where x is the input (a vector of reals), h the hidden state vector and y is the output vector. I trained the network on Matlab using some data x and obtained W, V, and U.
However, in MATLAB after changing matrix W to W', and keeping U,V the same, the output (y) of the RNN that uses W is the same as the output (y') of the RNN that uses W' when both predict on the same data x. Those two outputs should be different just by looking at the above equation, but I don't seem to be able to do that in MATLAB (when I modify V or U, the outputs do change). How could I fix the code so that the outputs (y) and (y') are different as they should be?
The relevant code is shown below:
[x,t] = simplefit_dataset; % x: input data ; t: targets
net = newelm(x,t,5); % Recurrent neural net with 1 hidden layer (5 nodes) and 1 output layer (1 node)
net.layers{1}.transferFcn = 'tansig'; % 'tansig': equivalent to tanh and also is the activation function used for hidden layer
net.biasConnect = [0;0]; % biases set to zero for easier experimenting
net.derivFcn ='defaultderiv'; % defaultderiv: tells Matlab to pick whatever derivative scheme works best for this net
view(net) % displays the network topology
net = train(net,x,t); % trains the network
W = net.LW{1,1}; U = net.IW{1,1}; V = net.LW{2,1}; % network matrices
Y = net(x); % Y: output when predicting on data x using W
net.LW{1,1} = rand(5,5); % This is the modified matrix W, W'
Y_prime = net(x) % Y_prime: output when predicting on data x using W'
max(abs(Y-Y_prime )); % The difference between the two outputs is 0 when it probably shouldn't be.
Edit: minor corrections.
This is the recursion in your first layer: (from the docs)
The weight matrix for the weight going to the ith layer from the jth
layer (or a null matrix [ ]) is located at net.LW{i,j} if
net.layerConnect(i,j) is 1 (or 0).
So net.LW{1,1} are the weights to the first layer from the first layer (i.e. recursion), whereas net.LW{2,1} stores the weights to the second layer from the first layer. Now, what does it mean when one can change the weights of the recursion randomly without any effect (in fact, you can set them to zero net.LW{1,1} = zeros(size(W)); without an effect). Note that this essentially is the same as if you drop the recursion and create as simple feed-forward network:
Hypothesis: The recursion has no effect.
You will note that if you change the weights to the second layer (1 neuron) from the first layer (5 neurons) net.LW{2,1} = zeros(size(V));, it will affect your prediction (the same is of course true if you change the input weights net.IW).
Why does the recursion has no effect?
Well, that beats me. I have no idea where this special glitch is or what the theory is behind the newelm network.

Assuming the order Conv2d->ReLU->BN, should the Conv2d layer have a bias parameter?

Should we include the bias parameter in Conv2d if we are going for Conv2d followed by ReLU followed by batch norm (bn)?
There is no need if we go for Conv2d followed by bn followed by ReLU, since the shift parameter of bn takes care of bias work.
Yes, if the order is conv2d -> ReLU -> BatchNorm, then having a bias parameter in the convolution can help. To show that, let's assume that there is a bias in the convolution layer, and let's compare what happens with both of the orders you mention in the question. The idea is to see whether the bias is useful for each case.
Let's consider a single pixel from one of the convolution's output layers, and assume that x_1, ..., x_k are the corresponding inputs (in vectorised form) from the batch (batch size == k). We can write the convolution as
Wx+b #with W the convolution weights, b the bias
As you said in the question, when the order is conv2d-> BN -> ReLu, then the bias is not useful because all it does to the distribution of the Wx is shift it by b, and this is cancelled out by the immediate BN layer:
(Wx_i - mu)/sigma ==> becomes (Wx_i + b - mu - b)/sigma i.e. no changes.
However, if you use the other order, i.e
BN(ReLU(Wx+b))
then ReLU will map some of the Wx_i+b to 0ยท As a consequence, the mean will look like this:
(1/k)(0+...+0+ SUM_s (Wx_s+b))=some_term + b/k
and the std will look like
const*((0-some_term-b/k)^2 + ... + (Wx_i+b - some_term -b/k)^2 +...))
and as you can see from expanding those therms that depend on non-zero Wx_i+b:
(Wx_i+b - some_term - b/k)^2 = some_other_term + some_factor * W * b/k * x_i
which means that the result will depend on b in a multiplicative manner. As a result, its absence can't just be compensated by the shift component of the BN layer (noted beta in most implementation and papers). That is why having a bias term when using this order is not useless.

Is it possible to use different L1 / L2 regularization parameters for different sets of weights in chainer or pytorch?

(As an example) When implementing a simple linear model for noutput target values as a neural network in pytorch:
l1=L.Linear(ninput, noutput)
(call)
y = self.l1(x)
return y
Adding this hook will do L2 regularization on all weights, imposing the same alpha=0.01 everywhere:
optimizer.add_hook(optimizer.WeightDecay(rate=0.01))
Is it possible to use a different alpha for each set of weights leading from all ninput input units to one of the noutput output units?
Since we are working in pytorch it is possible to add other scalars to loss function yourself. So assume loss from you classfier is L ( assume it is a cross entropy loss ) and you have a linear layer defined as:
l1 = nn.Linear(in,out)
Now if you want to have different regularization on each set of weights then all you have to do is gather weights using ( i.e select using index) and add to the final loss:
loss = L (crossentropy loss) + sum ( alpha * norm(l1.weight[k]))
alpha the hyper-parameters and norm is mostly L2 norm,in pytorch it is just torch.norm(l1.weight) where index k is a tensor of indices of weights you want to select. Finally, you don't need to do the global regularization as you have done in the code.

Connecting perceptrons with output of previous ones?

Because of the help I received and researched here I was able to create a simple perceptron in C#, code of which goes like:
int Input1 = A;
int Input2 = B;
//weighted sum
double WSum = A * W1 + B * W2 + Bias;
//get the sign: -1 for negative, +1 for positive
int Sign=Math.Sign(WSum);
double error = Desired - Sign;
//updating weights
W1 += error * Input1 * 0.1; //0.1 being a learning rate
W2 += error * Input2 * 0.1;
return Sign;
I do not use Sigmoid here and just get -1 or 1.
I would have two questions:
1) Is that correct that my weights get values like -5 etc? When input is e.g. 100,50 it goes like: W1+=error*100*0.1
2) I want to proceed deeper and create more connected neurons - I guess I would need at least two to provide inputs to the third. Is that correct that the third will be fed with values just -1..1? I am aiming to a simple pattern recognition but so far do not understand how it should work.
It is perfectly valid that the values of your weights range from -Infinity to +Infinity. You should always use real numbers instead of integers (so as mentioned above, double will work. 32 bit floats precision is perfectly sufficient for neural networks).
Moreover, you should decay your learning rate with every learning step, e.g. reduce it by a factor of 0.99 after each update. Else, your algorithm will oscillate when approaching an optimum.
If you want to go "deeper", you will need to implement a Multilayer Perceptron (MLP). There exists a proof that a neural network with simple threshold neurons and multiple layers alsways has an equivalent with only 1 layer. This is why several decades ago the research community temporarily abandoned the idea of artificial neural networks. 1986, Geoffrey Hinton made the Backpropagation algorithm popular. With it you can train MLPs with multiple hidden layers.
To solve non-linear problems like XOR or other complex problems like pattern recognition, you need to apply a non-linear activation function. Have a look at the logistic sigmoid activation function for a start. f(x) = 1. / (1. + exp(-x)). When doing this you should normalize your input as well as your output values to the range [0.0; 1.0]. This is especially important for the output neurons since the output of the logistic sigmoid activation function is defined in exactly this range.
A simple Python implementation of feed-forward MLPs using arrays can be found in this answer.
Edit: You also need at least 1 hidden layer to solve e.g. XOR.
Try to set your weights as double.
Also i think it's much better to work with arrays, especially in neural networks and perceptron is the only way.
And you will need some for or while loops to succeed what you want.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.