Assume that we have a Convolutional Neural Network trained to classify (w.l.o.g. grayscale) images, in Tensor-Flow.
Given the trained net and a test image one can trace which pixels of it are salient, or "equivalently" which pixels are most responsible for the output classification of the image. A nice, explanation and implementation details in Theano, are given in this article.
Assume that for the first layer of convolutions that is directly linked with the input image, we do have the gradient for the parameters of every convolutional kernel-wrt. the classification function.
How can one propagate the gradient back to the Input layer, so to compute a partial derivative on every pixel of the image?
Propagating and accumulating back the gradient, would give us the salient pixels (they are those with big in-magnitude derivative).
To find the gradient wrt. the kernels of the first layer, so far I did:
Replaced the usual loss operator with the output layer operator.
Used the "compute_gradient" function,
All in all, it looks like:
opt = tf.train.GradientDescentOptimizer(1)
grads = opt.compute_gradients(output)
grad_var = [(grad1) for grad in grads]
g1 = sess.run([grad_var[0]])
Where, the "output" is the max of the output layer of the NN.
And g1, is a (k, k, 1, M) tensor, since I used M: k x k convolutional kernels on the first layer.
Now, I need to find the correct way to propagate g1 on every input pixel, as to compute their derivative wrt. the output.
To compute the gradients, you don't need to use an optimizer, and you can directly use tf.gradients.
With this function, you can directly compute the gradient of output with respect to the image input, whereas the optimizer compute_gradients method can only compute gradients with respect to Variables.
The other advantage of tf.gradients is that you can specify the gradients of the output you want to backpropagate.
So here is how to get the gradients of an input image with respect to output[1, 1]:
we have to set the output gradients to 0 everywhere except at indice [1, 1]
input = tf.ones([1, 4, 4, 1])
filter = tf.ones([3, 3, 1, 1])
output = tf.nn.conv2d(input, filter, [1, 1, 1, 1], 'SAME')
grad_output = np.zeros((1, 4, 4, 1), dtype=np.float32)
grad_output[0, 1, 1, 0] = 1.
grads = tf.gradients(output, input, grad_output)
sess = tf.Session()
print sess.run(grads[0]).reshape((4, 4))
# prints [[ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 0. 0. 0. 0.]]
Related
I am randomly rotating 3D images in pytorch using torch.rot90 but this rotates all the images in the batch in the same way. I would like to find a differentiable way to randomly rotate each image in a different axis.
here is the code which rotates each image to the same orientation:
#x = next batch
k = torch.randint(0, 4, (1,)).item()
dims = [0,0]
dims[0] = dims[1] = torch.randint(2, 5, (1,))
while dims[0] == dims[1]:#this makes sure the two axes aren't the same
dims[1] = torch.randint(2, 5, (1,))
x = torch.rot90(x, k, dims)
# x is now a batch of 3D images that have all been rotated in the same random orientation
You could split the data in the batch randomly into 3 subsets, and apply each dimensional rotation respectively.
Let me expand on iacob's answer. Firstly, let me go over the parameters of rot90 function. Other than the input tensor, it expects k and dims where k is the number of rotations to be done, and dims is a list or tuple containing two dimensions on how the tensor to be rotated. If a tensor is 4D for example, dims could be [0, 3] or (1,2) or [2,3] etc. They have to be valid axes and it should contain two numbers. You don't really need to create tensors for this parameter or k. It is important to note that, depending on the given dims, output shape can drastically change:
x = torch.rand(15, 3, 4,6)
y1 = torch.rot90(x[0:5], 1, [1,3])
y2 = torch.rot90(x[5:10], 1, [1,2])
y3 = torch.rot90(x[10:15], 1, [2,3])
print(y1.shape) # torch.Size([5, 6, 4, 3])
print(y2.shape) # torch.Size([5, 4, 3, 6])
print(y3.shape) # torch.Size([5, 3, 6, 4])
Similar to iacob's answer, here we apply 3 different rotations to slices of the input. Note that how the output dimensions are all different, due to nature of rotations over different dimensions. You can't really join these results into one tensor, unless you have a really specific input size, for example Batch x 10 x 10 x 10 where rotating over combinations of 1,2,3 axes will always return same dimensions. You can however use each of these different sized output separately as inputs to different modules, layers etc.
I personally can't think of a use case where your random axes rotation can be used. If you can elaborate on why you are trying to do this, I can try to give some better solutions.
I'm working with regression to predict an array with 0-1 value (array of bit). The neural network specification is the following (MATLAB):
layers = [
imageInputLayer([1 16 2],'Normalization','none')
fullyConnectedLayer(512)
batchNormalizationLayer
reluLayer
fullyConnectedLayer(64)
batchNormalizationLayer
reluLayer
% sigmoidLayer
tanhLayer
regressionLayer
];
I've used the following code to implement Sigmoid Layer:
classdef sigmoidLayer < nnet.layer.Layer
methods
function layer = sigmoidLayer(name)
% Set layer name
if nargin == 2
layer.Name = name;
end
% Set layer description
layer.Description = 'sigmoidLayer';
end
function Z = predict(layer,X)
% Forward input data through the layer and output the result
Z = exp(X)./(exp(X)+1);
end
function dLdX = backward(layer, X ,Z,dLdZ,memory)
% Backward propagate the derivative of the loss function through
% the layer
dLdX = Z.*(1-Z) .* dLdZ;
end
end
end
The output is only 0 or 1. So why sigmoid is worse than tanh, instead of equal or better?
It depends on what you call "worse". Without more details it's hard to answer clearly.
However one of the key differences is the function's derivative. As the gradient update's magnitude depends on the derivative of the function, it can become close to 0 (and the network can't learn anymore) when the derivative saturates.
The sigmoid saturates at 1 and 0, when x->+/- inf, sigmoid -> 1/0 and d(sigmoid)/dx -> 0 and therefore depending on your data this might cause slower or "worse" learning. On the contrary, though it does saturate when going to 1, tanh does not saturate (actually it's a maxima for its derivative) around 0 so learning in this region is not problematic.
You might also want to look into label smoothing
I need to plot the divisory line together with the graph below:
The code I used to train the MLP neural network is here:
circles = [1 1; 2 1; 2 2; 2 3; 2 4; 3 2; 3 3; 4 1; 4 2; 4 3];
crosses = [1 2; 1 3; 1 4; 2 5; 3 4; 3 5; 4 4; 5 1; 5 2; 5 3];
net = feedforwardnet(3);
net = train(net, circles, crosses);
plot(circles(:, 1), circles(:, 2), 'ro');
hold on
plot(crosses(:, 1), crosses(:, 2), 'b+');
hold off;
But I'd like to show the line separating the groups in the chart too. How do I proceed? Thanks in advance.
First off, you're not training your neural network properly. You'd have to use both circles and crosses as input samples into your neural network and the output will have to be a two neuron output where [1 0] as the output would denote that the circles class is what the classification should be and [0 1] is what the crosses classification would be.
In addition, each column is an input sample while each row is a feature. Therefore, you have to transpose both of these and make a larger input matrix. You'll also need to make your output labels in accordance with what we just talked about:
X = [circles.' crosses.'];
Y = [[ones(1, size(circles,1)); zeros(1, size(circles,1))] ...
[zeros(1, size(crosses,1)); ones(1, size(crosses,1))]];
Now train your network:
net = feedforwardnet(3);
net = train(net, X, Y);
Now, if you want to figure out which class each point belongs to, you simply take the largest neuron output and whichever one gave you the largest, that's the class it belongs to.
Now, to answer your actual question, there's no direct way to show the "lines" of separation with Neural Networks if you use the MATLAB toolbox. However, you can show regions of separation and maybe throw in some transparency so that you can overlay this on top of the figure.
To do this, define a 2D grid of coordinates that span your two classes but with a finer grain... say... 0.01. Run this through the neural network, see what the maximum output neuron is, then mark this accordingly on your figure.
Something like this comes to mind:
%// Generate test data
[ptX,ptY] = meshgrid(1:0.01:5, 1:0.01:5);
Xtest = [ptX(:).'; ptY(:).'];
%// See what the output labels are
out = sim(net, Xtest);
[~,classes] = max(out,[],1);
%// Now plot the regions
figure;
hold on;
%// Plot the first class region
plot(Xtest(1, classes == 1), Xtest(2, classes == 1), 'y.');
%// Add transparency
alpha(0.1);
%// Plot the second class region
plot(Xtest(1, classes == 2), Xtest(2, classes == 2), 'g.');
%// Add transparency
alpha(0.1);
%// Now add the points
plot(circles(:, 1), circles(:, 2), 'ro');
plot(crosses(:, 1), crosses(:, 2), 'b+');
The first two lines of code generate a bunch of test (x,y) points and ensures that they're in a 2 row input matrix as that is what the network inputs require. I use meshgrid for generating these points. Next, we use sim to simulate or put in inputs into the neural network. Once we do this, we will have two output neuron neural network responses per input point where we take a look at which output neuron gave us the largest response. If the first output gave us the largest response, we consider the input as belonging to the first class. If not, then it's the second class. This is facilitated by using max and looking at each column independently - one column per input sample and seeing which location gave us the maximum.
Once we do this, we create a new figure and plot the points that belonged to class 1, which is the circles, in yellow and the second class, which is the crosses, in green. I throw in some transparency to make sure we can see the regions with the points. After, I plot the points as normal using your code.
With the above code, I get this figure:
As you can see, your model has some classification inaccuracies. Specifically, there are three crosses that would be misclassified as circles. You'll have to play around with number of neurons in the hidden layer as well as perhaps using a different activation function but this certainly is enough to get you started.
Good luck!
I am trying to implement neural network with RELU.
input layer -> 1 hidden layer -> relu -> output layer -> softmax layer
Above is the architecture of my neural network.
I am confused about backpropagation of this relu.
For derivative of RELU, if x <= 0, output is 0.
if x > 0, output is 1.
So when you calculate the gradient, does that mean I kill gradient decent if x<=0?
Can someone explain the backpropagation of my neural network architecture 'step by step'?
if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.
If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.
Here is a good example, use ReLU to implement XOR:
reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # the derivate of ReLU
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
plt.plot(loss_col)
plt.show()
More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/
So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.
Yes the orginal Relu function has the problem you describe.
So they later made a change to the formula, and called it leaky Relu
In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :
An explantion of activation methods, and a improved Relu on youtube
Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp
The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.
in a given application I apply an averaging mask to input images to reduce noise, and then a Laplacian mask to enhance small details. Anyone knows if I Would get the same results if I reverse the order of these operations in Matlab?
Convolving with a Laplacian kernel is similar to using second derivative information about the intensity changes. Since this derivative is sensitive to noise, we often smooth the image with a Gaussian before applying the Laplacian filter.
Here's a MATLAB example similar to what #belisarius posted:
f='http://upload.wikimedia.org/wikipedia/commons/f/f4/Noise_salt_and_pepper.png';
I = imread(f);
kAvg = fspecial('average',[5 5]);
kLap = fspecial('laplacian',0.2);
lapMask = #(I) imsubtract(I,imfilter(I,kLap));
subplot(131), imshow(I)
subplot(132), imshow( imfilter(lapMask(I),kAvg) )
subplot(133), imshow( lapMask(imfilter(I,kAvg)) )
Lets say you have two filters F1 and F2, and an image I. If you pass your image through the two filters, you would get a response that was defined as
X = ((I * F1) * F2)
Where here I am using * to represent convolution.
By the associative rule of convolution, this is the same as.
X = (I * (F1 * F2))
using commutativity, we can say that
X = (I * (F2 * F1)) = ((I * F2) * F1)
Of course, this is in the nice continuous domain of math, doing these things on a machine means there will be rounding errors and some data may be lost. You also should think about if your filters are FIR, otherwise the whole concept of thinking about digital filtering as convolution sorta starts to break down as your filter can't really behave the way you wanted it to.
EDIT
The discrete convolution is defined as
so adding zeros at the edges of you data doesn't change anything in a mathematical sense.
As some people have pointed out, you will get different answers numerically, but this is expected whenever we deal with computing actual data. These variations should be small and limited to the low energy components of the output of the convolution (i.e: the edges).
It is also important to consider how the convolution operation is working. Convolving two sets of data of length X and length Y will result in an answer that is X+Y-1 in length. There is some behind the scenes magic going on for programs like MATLAB and Mathematica to give you an answer that is of length X or Y.
So in regards to #belisarius' post, it would seem we are really saying the same thing.
Numerically the results are not the same, but the images look pretty similar.
Example in Mathematica:
Edit
As an answer to #thron comment in his answer about commutation of linear filters and padding, just consider the following operations.
While commutation of a Gaussian and Laplacian filter without padding is true:
list = {1, 3, 5, 7, 5, 3, 1};
gauss[x_] := GaussianFilter[ x, 1]
lapl[x_] := LaplacianFilter[x, 1]
Print[gauss[lapl[list]], lapl[gauss[list]]]
(*
->{5.15139,0.568439,-1.13688,-9.16589,-1.13688,0.568439,5.15139}
{5.15139,0.568439,-1.13688,-9.16589,-1.13688,0.568439,5.15139}
*)
Doing the same with padding, result in a difference at the edges:
gauss[x_] := GaussianFilter[ x, 1, Padding -> 1]
lapl[x_] := LaplacianFilter[x, 1, Padding -> 1]
Print[gauss[lapl[list]], lapl[gauss[list]]]
(*
->{4.68233,0.568439,-1.13688,-9.16589,-1.13688,0.568439,4.68233}
{4.58295,0.568439,-1.13688,-9.16589,-1.13688,0.568439,4.58295}
*)