What does the parameter retain_graph mean in the Variable's backward() method?

What does the parameter retain_graph mean in the Variable's backward() method? - neural-network

I'm going through the neural transfer pytorch tutorial and am confused about the use of retain_variable(deprecated, now referred to as retain_graph). The code example show:
class ContentLoss(nn.Module):
def __init__(self, target, weight):
super(ContentLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.criterion = nn.MSELoss()
def forward(self, input):
self.loss = self.criterion(input * self.weight, self.target)
self.output = input
return self.output
def backward(self, retain_variables=True):
#Why is retain_variables True??
self.loss.backward(retain_variables=retain_variables)
return self.loss
From the documentation
retain_graph (bool, optional) – If False, the graph used to compute
the grad will be freed. Note that in nearly all cases setting this
option to True is not needed and often can be worked around in a much
more efficient way. Defaults to the value of create_graph.
So by setting retain_graph= True, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?

#cleros is pretty on the point about the use of retain_graph=True. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.
An illustrative example
Suppose that we have a computation graph shown above. The variable d and e is the output, and a is the input. For example,
import torch
from torch.autograd import Variable
a = Variable(torch.rand(1, 4), requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()
when we do d.backward(), that is fine. After this computation, the parts of the graph that calculate d will be freed by default to save memory. So if we do e.backward(), the error message will pop up. In order to do e.backward(), we have to set the parameter retain_graph to True in d.backward(), i.e.,
d.backward(retain_graph=True)
As long as you use retain_graph=True in your backward method, you can do backward any time you want:
d.backward(retain_graph=True) # fine
e.backward(retain_graph=True) # fine
d.backward() # also fine
e.backward() # error will occur!
More useful discussion can be found here.
A real use case
Right now, a real use case is multi-task learning where you have multiple losses that maybe be at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to backprop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.
# suppose you first back-propagate loss1, then loss2 (you can also do the reverse)
loss1.backward(retain_graph=True)
loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
optimizer.step() # update the network parameters

This is a very useful feature when you have more than one output of a network. Here's a completely made up example: imagine you want to build some random convolutional network that you can ask two questions of: Does the input image contain a cat, and does the image contain a car?
One way of doing this is to have a network that shares the convolutional layers, but that has two parallel classification layers following (forgive my terrible ASCII graph, but this is supposed to be three convlayers, followed by three fully connected layers, one for cats and one for cars):
-- FC - FC - FC - cat?
Conv - Conv - Conv -|
-- FC - FC - FC - car?
Given a picture that we want to run both branches on, when training the network, we can do so in several ways. First (which would probably be the best thing here, illustrating how bad the example is), we simply compute a loss on both assessments and sum the loss, and then backpropagate.
However, there's another scenario - in which we want to do this sequentially. First we want to backprop through one branch, and then through the other (I have had this use-case before, so it is not completely made up). In that case, running .backward() on one graph will destroy any gradient information in the convolutional layers, too, and the second branch's convolutional computations (since these are the only ones shared with the other branch) will not contain a graph anymore! That means, that when we try to backprop through the second branch, Pytorch will throw an error since it cannot find a graph connecting the input to the output!
In these cases, we can solve the problem by simple retaining the graph on the first backward pass. The graph will then not be consumed, but only be consumed by the first backward pass that does not require to retain it.
EDIT: If you retain the graph at all backward passes, the implicit graph definitions attached to the output variables will never be freed. There might be a usecase here as well, but I cannot think of one. So in general, you should make sure that the last backwards pass frees the memory by not retaining the graph information.
As for what happens for multiple backward passes: As you guessed, pytorch accumulates gradients by adding them in-place (to a variable's/parameters .grad property).
This can be very useful, since it means that looping over a batch and processing it once at a time, accumulating the gradients at the end, will do the same optimization step as doing a full batched update (which only sums up all the gradients as well). While a fully batched update can be parallelized more, and is thus generally preferable, there are cases where batched computation is either very, very difficult to implement or simply not possible. Using this accumulation, however, we can still rely on some of the nice stabilizing properties that batching brings. (If not on the performance gain)

Related

Impact of using data shuffling in Pytorch dataloader

I implemented an image classification network to classify a dataset of 100 classes by using Alexnet as a pretrained model and changing the final output layers.
I noticed when I was loading my data like
trainloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=False)
, I was getting accuracy on validation dataset around 2-3 % for around 10 epochs but when I just changed shuffle=True and retrained the network, the accuracy jumped to 70% in the first epoch itself.
I was wondering if it happened because in the first case the network was being shown one example after the other continuously for just one class for few instances resulting in network making poor generalizations during training or is there some other reason behind it?
But, I did not expect that to have such a drastic impact.
P.S: All the code and parameters were exactly the same for both the cases except changing the shuffle option.

Yes it totally can affect the result! Shuffling the order of the data that we use to fit the classifier is so important, as the batches between epochs do not look alike.
Checking the Data Loader Documentation it says:
"shuffle (bool, optional) – set to True to have the data reshuffled at every epoch"
In any case, it will make the model more robust and avoid over/underfitting.
In your case this heavy increase of accuracy (from the lack of awareness of the dataset) probably is due to how the dataset is "organised" as maybe, as an example, each category goes to a different batch, and in every epoch, a batch contains the same category, which derives to a very bad accuracy when you are testing.

PyTorch did many great things, and one of them is the DataLoader class.
DataLoader class takes the dataset (data), sets the batch_size (which is how many samples per batch to load), and invokes the sampler from a list of classes:
DistributedSampler
SequentialSampler
RandomSampler
SubsetRandomSampler
WeightedRandomSampler
BatchSampler
The key thing samplers do is how they implement the iter() method.
In case of SequentionalSampler it looks like this:
def __iter__(self):
return iter(range(len(self.data_source)))
This returns an iterator, for every item in the data_source.
When you set shuffle=True that would not use SequentionalSampler, but instead the RandomSampler.
And this may improve the learning process.

Using torch.nn.DataParallel with a custom CUDA extension

To my understanding, the built-in PyTorch operations all automatically handle batches through implicit vectorization, allowing parallelism across multiple GPUs.
However, when writing a custom operation in CUDA as per the Documentation, the LLTM example given performs operations that are batch invariant, for example computing the gradient of the Sigmoid function elementwise.
However, I have a use case that is not batch element invariant and not vectorizable. Running on a single GPU, I currently (inefficiently) loop over each element in the batch, performing a kernel launch for each, like so (written in the browser, just to demonstrate):
std::vector<at::Tensor> op_cuda_forward(at::Tensor input,
at::Tensor elementSpecificParam) {
auto output = at::zeros(torch::CUDA(/* TYPE */), {/* DIMENSIONS */});
const size_t blockDim = //
const size_t gridDim = //
const size_t = numBatches = //
for (size_t i = 0; i < numBatches; i++) {
op_cuda_forward_kernel<T><<<gridDim, blockDim>>>(input[i],
elementSpecificParam[i],
output[i]);
}
return {output};
}
However, I wish to split this operation over multiple GPUs by batch element.
How would the allocation of the output Tensor work in a multi-GPU scenario?
Of course, one may create intermediate Tensors on each GPU before launching the appropriate kernel, however, the overhead of copying the input data to each GPU and back again would be problematic.
Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?
The end goal is to have a CUDA operation that works with torch.nn.DataParallel.

This is kind of unusual, as commonly "Batch" is exactly defined as all operations of the network being invariant along that dimension.
So you could, for example, just introduce another dimension. So you have the "former batch dimension" in which your operation is not invariant. For this keep your current implementation. Then, parallelize over the new dimension of multiple "actual batches" of data.
But, to stay closer to the question you asked, I see two options:
As you said, inside your implementation figure out which original batch you are operating on (depending on total number of parallel splits, etc). This can become hairy.
Consider your parameter as Part of Input! In your outside call, pass the parameter along your input data to the forward of your model.
So (Pythonlike-Pseudocode):
Network(nn.Module):
...
def forward(x, parameter):
x=self.pre_modules(x)
x=self.custom_module(x,parameter)
return x
parameter=torch.zeros(16,requires_grad=True)
net=nn.DataParallel(model)
net(input,parameter)
If your are willing to accept that this will be a leaky abstraction of the network and are mainly interested in getting things to work, I would try out the latter approach first.

Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?
Using environmental information, like ranks, local_ranks and local_rank, is a pretty common practice in distributed training (both DP and DDP)
These information are also used in sharding dataset, mapping workers to devices and etc.

Evaluating neural networks built with Comp Graph dl4j

I am trying to build a complex neural network using Computation Graph implementation in Deeplearning4J. I need to have multiple outputs so that's why I can't go with the generic MultiLayerConfiguration.
However, my problem is that in this case I do not know how to do the evaluation of my model and I would like at least to know the accuracy.
Has anybody worked with Comp Graphs in dl4j?

First of all yes: tons of people use computation graph. They usually start from our existing examples though and tend to mainly use it for things like seq2seq.
As for your question on evaluation, it's conceptually the same as multi layer network. How you evaluate is likely going to be task specific though. If you think about where evaluation happens, it's always tied to a task (classification,regression,binary classification,..) with an output layer . In the most common case usually you only have 1 output which outputs a classification. In that case you can just use the first array it outputs.
Otherwise for multiple outputs..you'd have to define what you're evaluating. Usually tasks merge to 1 path.
If they don't, you'd have multiple output layers where you want to do an evaluation object per output.
Computation graphs and multi layer network both use a .output method to give you raw arrays. That is typically what you pass to eval.eval.

What is the policy gradient when multiple actions are possible?

I am trying to program a reinforcement learning algorithm using policy gradients, as inspired by Karpathy's blog article. Karpathy's example has only two actions UP or DOWN, so a single output neuron is sufficient (high activation=UP, low activation=DOWN). I want to extend this to multiple actions, so I believe I need a softmax activation function on the output layer. However, I am not certain about what the gradient for the output layer should be.
If I was using a cross-entropy loss function with the softmax activation in a supervised learning context, the gradient for neuron is simply:
g[i] = a[i] - target[i]
where target[i] = 1 for the desired action and 0 for all others.
To use this for reinforcement learning I would multiply g[i] by the discounted reward before back-propagating.
However, it seems that reinforcement learning uses negative log-likelihood as the loss instead of cross-entropy. How does that change the gradient?

Note: something that I think will get you on the right track:
The negative log likelihood is also know as the multiclass cross-entropy (Pattern Recognition and Machine Learning).
EDIT: misread the question. I thought this was talking about Deep Deterministic Policy Gradients
It would depend on your domain, but with a softmax, you are getting a probability across all output nodes. To me that doesn't really make sense in most domains when you think about DDPG. For example, if you are controlling the extension of robotic arms and legs, it wouldn't make sense to have limb extension measured as [.25, .25, .25, .25], if you wanted to have all limbs extended. In this case, .25 could mean fully extended, but what happens if the vector of outputs is [.75,.25,0,0]? So in this way, you could have a separate sigmoid function from 0 to 1 for all action nodes, where then you could represent it as [1,1,1,1] for all arms being extended. I hope that makes sense.
Since the actor network is what determines the actions in DDPG, we could then represent our network like this for our robot (rough keras example):
state = Input(shape=[your_state_shape])
hidden_layer = Dense(30, activation='relu')(state)
all_limbs = Dense(4, activation='sigmoid')(hidden_layer)
model = Model(input=state, output=all_limbs)
Then, your critic network will have to account for the action dimensions.
state = Input(shape=[your_state_shape])
action = Input(shape=[4])
state_hidden = Dense(30, activation='relu')(state)
state_hidden_2 = Dense(30, activation='linear')(state_hidden)
action_hidden = Dense(30, activation='linear')(action)
combined = merge([state_hidden_2, action_hidden], mode='sum')
squasher = Dense(30, activation='relu')(combined)
output = Dense(4, activation='linear')(squasher) #number of actions
Then you can use your target functions from there. Note, I don't know if this working code, as I haven't tested it, but hopefully you get the idea.
Source: https://arxiv.org/pdf/1509.02971.pdf
Awesome blog on this with Torc (not created by me): https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html
In the above blog, they also show how to use different output functions, such as one TAHN, and two sigmoid functions for actions.

Backpropagation through time in stateful RNNs

If I use a stateful RNN in Keras for processing a sequence of length N divided into N parts (each time step is processed individually),
how is backpropagation handled? Does it only affect the last time step, or does it backpropagate through the entire sequence?
If it does not propagate through the entire sequence, is there a way to do this?

The back propagation horizon is limited to the second dimension of the input sequence. i.e. if your data is of type (num_sequences, num_time_steps_per_seq, data_dim) then back prop is done over a time horizon of value num_time_steps_per_seq Take a look at
https://github.com/fchollet/keras/issues/3669

There are a couple things you need to know about RNNs in Keras. At default the parameter return_sequences=False in all recurrent neural networks. This means that at default only the activations of the RNN after processing the entire input sequence are returned as output. If you want to have the activations at every time step and optimize every time step seperately, you need to pass return_sequences=True as parameter (https://keras.io/layers/recurrent/#recurrent).
The next thing that is important to know is that all a stateful RNN does is remember the last activation. So if you have a large input sequence and break it up in smaller sequences (which I believe you are doing), the activation in the network is retained in the network after processing the first sequence and therefore affects the activations in the network when processing the second sequence. This has nothing to do with how the network is optimized, the network simply minimizes the difference between the output and the targets you give.

to the Q1: how is backpropagation handled? (as so as RNN is not only fully-connected vertically as in basic_NN, but also considered to be Deep - having also horizontal backprop connections in hidden layer)
Suppose batch_input_shape=(num_seq, 1, data_dim) - "Backprop will be truncated to 1 timestep , as the second dimension is 1. No gradient updates will be performed further back in time than the second dimension's value." - see here
Thus, if having time_step >1 there - gradient WILL update further back in time_steps assigned in second_dim of input_shape
set return_sequences=True for all recurrent layers except the last one (that use as needed output or Dense further to needed output) -- True is needed to have transmissible sequence from previous to the next rolled at +1 in sliding_window -- to be able to backprop according already estimated weights
return_state=True is used to get the states returned -- 2 state tensors in LSTM [output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")] or 1 state tensor in GRU [incl. in shapes] -- that "can be used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder."...
But remember (for any case): Stateful training does not allow shuffling, and is more time-consuming compared with stateless
p.s.
as you can see here -- (c,h) in tf or (h,c) in keras -- both h & c are elements of output, thus both becoming urgent in batched or multi-threaded training

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse