Using torch.nn.DataParallel with a custom CUDA extension - neural-network

To my understanding, the built-in PyTorch operations all automatically handle batches through implicit vectorization, allowing parallelism across multiple GPUs.
However, when writing a custom operation in CUDA as per the Documentation, the LLTM example given performs operations that are batch invariant, for example computing the gradient of the Sigmoid function elementwise.
However, I have a use case that is not batch element invariant and not vectorizable. Running on a single GPU, I currently (inefficiently) loop over each element in the batch, performing a kernel launch for each, like so (written in the browser, just to demonstrate):
std::vector<at::Tensor> op_cuda_forward(at::Tensor input,
at::Tensor elementSpecificParam) {
auto output = at::zeros(torch::CUDA(/* TYPE */), {/* DIMENSIONS */});
const size_t blockDim = //
const size_t gridDim = //
const size_t = numBatches = //
for (size_t i = 0; i < numBatches; i++) {
op_cuda_forward_kernel<T><<<gridDim, blockDim>>>(input[i],
elementSpecificParam[i],
output[i]);
}
return {output};
}
However, I wish to split this operation over multiple GPUs by batch element.
How would the allocation of the output Tensor work in a multi-GPU scenario?
Of course, one may create intermediate Tensors on each GPU before launching the appropriate kernel, however, the overhead of copying the input data to each GPU and back again would be problematic.
Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?
The end goal is to have a CUDA operation that works with torch.nn.DataParallel.

This is kind of unusual, as commonly "Batch" is exactly defined as all operations of the network being invariant along that dimension.
So you could, for example, just introduce another dimension. So you have the "former batch dimension" in which your operation is not invariant. For this keep your current implementation. Then, parallelize over the new dimension of multiple "actual batches" of data.
But, to stay closer to the question you asked, I see two options:
As you said, inside your implementation figure out which original batch you are operating on (depending on total number of parallel splits, etc). This can become hairy.
Consider your parameter as Part of Input! In your outside call, pass the parameter along your input data to the forward of your model.
So (Pythonlike-Pseudocode):
Network(nn.Module):
...
def forward(x, parameter):
x=self.pre_modules(x)
x=self.custom_module(x,parameter)
return x
parameter=torch.zeros(16,requires_grad=True)
net=nn.DataParallel(model)
net(input,parameter)
If your are willing to accept that this will be a leaky abstraction of the network and are mainly interested in getting things to work, I would try out the latter approach first.

Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?
Using environmental information, like ranks, local_ranks and local_rank, is a pretty common practice in distributed training (both DP and DDP)
These information are also used in sharding dataset, mapping workers to devices and etc.

Related

How to 'copy' matrix without creating a temporary matrix in memory that caused memory overflow?

By assigning a matrix into a much bigger allocated memory, matlab somehow will duplicate it while 'copying' it, and if the matrix to be copied is large enough, there will be memory overflow. This is the sample code:
main_mat=zeros(500,500,2000);
n=500;
slice_matrix=zeros(500,500,n);
for k=1:4
parfor i=1:n
slice_matrix(:,:,i)=gather(gpuArray(rand(500,500)));
end
main_mat(:,:,1+(k-1)*n:1+(k-1)*n+n-1)=slice_matrix; %This is where the memory will likely overflow
end
Any way to just 'smash' the slice_matrix onto the main_mat without the overhead? Thanks in advance.
EDIT:
The overflow occurred when main_mat is allocated beforehand. If main_mat is initialized with main_mat=zeros(500,500,1); (smaller size), the overflow will not occur, but it will slowed down as allocation is not done before matrix is assigned into it. This will significantly reduce the performance as the range of k increases.
The main issue is that numbers take more space than zeros.
main_mat=zeros(500,500,2000); takes little RAM while main_mat = rand(500,500,2000); take a lot, no matter if you use GPU or parfor (in fact, parfor will make you use more RAM). So This is not an unnatural swelling of memory. Following Daniel's link below, it seems that the assignment of zeros only creates pointers to memory, and the physical memory is filled only when you use the matrix for "numbers". This is managed by the operating system. And it is expected for Windows, Mac and Linux, either you do it with Matlab or other languages such as C.
Removing parfor will likely fix your problem.
parfor is not useful there. MATLAB's parfor does not use shared memory parallelism (i.e. it doesn't start new threads) but rather distributed memory parallelism (it starts new processes). It is designed to distribute work over a set or worker nodes. And though it also works within one node (or a single desktop computer) to distribute work over multiple cores, it is not an optimal way of doing parallelism within one node.
This means that each of the processes started by parfor needs to have its own copy of slice_matrix, which is the cause of the large amount of memory used by your program.
See "Decide When to Use parfor" in the MATLAB documentation to learn more about parfor and when to use it.
I assume that your code is just a sample code and that rand() represents a custom in your MVE. So there are a few hints and tricks for the memory usage in matlab.
There is a snippet from The MathWorks training handbooks:
When assigning one variable to another in MATLAB, as occurs when passing parameters into a function, MATLAB transparently creates a reference to that variable. MATLAB breaks the reference, and creates a copy of that variable, only when code modifies one or more of teh values. This behavior, known as copy-on-write, or lazy-copying, defers the cost of copying large data sets until the code modifies a values. Therefore, if the code performs no modifications, there is no need for extra memory space and execution time to copy variables.
The first thing to do would be to check the (memory) efficiency of your code. Even the code of excellent programmers can be futher optimized with (a little) brain power. Here are a few hints regarding memory efficiency
make use of the nativ vectorization of matlab, e.g. sum(X,2), mean(X,2), std(X,[],2)
make sure that matlab does not have to expand matrices (implicit expanding was changed recently). It might be more efficient to use the bsxfun
use in-place-operations, e.g. x = 2*x+3 rather than x = 2*x+3
...
Be aware that optimum regarding memory usage is not the same as if you would want to reduce computation time. Therefore, you might want to consider reducing the number of workers or refrain from using the parfor-loop. (As parfor cannot use shared memory, there is no copy-on-write feature with using the Parallel Toolbox.
If you want to have a closer look at your memory, what is available and that can be used by Matlab, check out feature('memstats'). What is interesting for you is the Virtual Memory that is
Total and available memory associated with the whole MATLAB process. It is limited by processor architecture and operating system.
or use this command [user,sys] = memory.
Quick side node: Matlab stores matrices consistently in memory. You need to have a large block of free RAM for large matrices. That is also the reason why you want to allocate variables, because changing them dynamically forces Matlab to copy the entire matrix to a larger spot in the RAM every time it outgrows the current spot.
If you really have memory issues, you might just want to dig into the art of data types -- as is required in lower level languages. E.g. you can cut your memory usage in half by using single-precision directly from the start main_mat=zeros(500,500,2000,'single'); -- btw, this also works with rand(...,'single') and more native functions -- although a few of the more sophisticated matlab functions require input of type double, which you can upcast again.
If I understand correctly your main issue is that parfor does not allow to share memory. Think of every parfor worker as almost a separate matlab instance.
There is basically just one workaround for this that I know (that I have never tried), that is 'shared matrix' on Fileexchange: https://ch.mathworks.com/matlabcentral/fileexchange/28572-sharedmatrix
More solutions: as others suggested: remove parfor is certainly one solution, get more ram, use tall arrays (that use harddrives when ram runs full, read here), divide operations in smaller chunks, last but not least, consider an alternative other than Matlab.
You may use following code. You actually don't need the slice_matrix
main_mat=zeros(500,500,2000);
n=500;
slice_matrix=zeros(500,500,n);
for k=1:4
parfor i=1:n
main_mat(:,:,1+(k-1)*n + i - 1) = gather(gpuArray(rand(500,500)));
end
%% now you don't need this main_mat(:,:,1+(k-1)*n:1+(k-1)*n+n-1)=slice_matrix; %This is where the memory will likely overflow
end

What does the parameter retain_graph mean in the Variable's backward() method?

I'm going through the neural transfer pytorch tutorial and am confused about the use of retain_variable(deprecated, now referred to as retain_graph). The code example show:
class ContentLoss(nn.Module):
def __init__(self, target, weight):
super(ContentLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.criterion = nn.MSELoss()
def forward(self, input):
self.loss = self.criterion(input * self.weight, self.target)
self.output = input
return self.output
def backward(self, retain_variables=True):
#Why is retain_variables True??
self.loss.backward(retain_variables=retain_variables)
return self.loss
From the documentation
retain_graph (bool, optional) – If False, the graph used to compute
the grad will be freed. Note that in nearly all cases setting this
option to True is not needed and often can be worked around in a much
more efficient way. Defaults to the value of create_graph.
So by setting retain_graph= True, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?
#cleros is pretty on the point about the use of retain_graph=True. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.
An illustrative example
Suppose that we have a computation graph shown above. The variable d and e is the output, and a is the input. For example,
import torch
from torch.autograd import Variable
a = Variable(torch.rand(1, 4), requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()
when we do d.backward(), that is fine. After this computation, the parts of the graph that calculate d will be freed by default to save memory. So if we do e.backward(), the error message will pop up. In order to do e.backward(), we have to set the parameter retain_graph to True in d.backward(), i.e.,
d.backward(retain_graph=True)
As long as you use retain_graph=True in your backward method, you can do backward any time you want:
d.backward(retain_graph=True) # fine
e.backward(retain_graph=True) # fine
d.backward() # also fine
e.backward() # error will occur!
More useful discussion can be found here.
A real use case
Right now, a real use case is multi-task learning where you have multiple losses that maybe be at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to backprop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.
# suppose you first back-propagate loss1, then loss2 (you can also do the reverse)
loss1.backward(retain_graph=True)
loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
optimizer.step() # update the network parameters
This is a very useful feature when you have more than one output of a network. Here's a completely made up example: imagine you want to build some random convolutional network that you can ask two questions of: Does the input image contain a cat, and does the image contain a car?
One way of doing this is to have a network that shares the convolutional layers, but that has two parallel classification layers following (forgive my terrible ASCII graph, but this is supposed to be three convlayers, followed by three fully connected layers, one for cats and one for cars):
-- FC - FC - FC - cat?
Conv - Conv - Conv -|
-- FC - FC - FC - car?
Given a picture that we want to run both branches on, when training the network, we can do so in several ways. First (which would probably be the best thing here, illustrating how bad the example is), we simply compute a loss on both assessments and sum the loss, and then backpropagate.
However, there's another scenario - in which we want to do this sequentially. First we want to backprop through one branch, and then through the other (I have had this use-case before, so it is not completely made up). In that case, running .backward() on one graph will destroy any gradient information in the convolutional layers, too, and the second branch's convolutional computations (since these are the only ones shared with the other branch) will not contain a graph anymore! That means, that when we try to backprop through the second branch, Pytorch will throw an error since it cannot find a graph connecting the input to the output!
In these cases, we can solve the problem by simple retaining the graph on the first backward pass. The graph will then not be consumed, but only be consumed by the first backward pass that does not require to retain it.
EDIT: If you retain the graph at all backward passes, the implicit graph definitions attached to the output variables will never be freed. There might be a usecase here as well, but I cannot think of one. So in general, you should make sure that the last backwards pass frees the memory by not retaining the graph information.
As for what happens for multiple backward passes: As you guessed, pytorch accumulates gradients by adding them in-place (to a variable's/parameters .grad property).
This can be very useful, since it means that looping over a batch and processing it once at a time, accumulating the gradients at the end, will do the same optimization step as doing a full batched update (which only sums up all the gradients as well). While a fully batched update can be parallelized more, and is thus generally preferable, there are cases where batched computation is either very, very difficult to implement or simply not possible. Using this accumulation, however, we can still rely on some of the nice stabilizing properties that batching brings. (If not on the performance gain)

Evaluating neural networks built with Comp Graph dl4j

I am trying to build a complex neural network using Computation Graph implementation in Deeplearning4J. I need to have multiple outputs so that's why I can't go with the generic MultiLayerConfiguration.
However, my problem is that in this case I do not know how to do the evaluation of my model and I would like at least to know the accuracy.
Has anybody worked with Comp Graphs in dl4j?
First of all yes: tons of people use computation graph. They usually start from our existing examples though and tend to mainly use it for things like seq2seq.
As for your question on evaluation, it's conceptually the same as multi layer network. How you evaluate is likely going to be task specific though. If you think about where evaluation happens, it's always tied to a task (classification,regression,binary classification,..) with an output layer . In the most common case usually you only have 1 output which outputs a classification. In that case you can just use the first array it outputs.
Otherwise for multiple outputs..you'd have to define what you're evaluating. Usually tasks merge to 1 path.
If they don't, you'd have multiple output layers where you want to do an evaluation object per output.
Computation graphs and multi layer network both use a .output method to give you raw arrays. That is typically what you pass to eval.eval.

tensorflow store training data on GPU memory

I am pretty new to tensorflow. I used to use theano for deep learning development. I notice a difference between these two, that is where input data can be stored.
In Theano, it supports shared variable to store input data on GPU memory to reduce the data transfer between CPU and GPU.
In tensorflow, we need to feed data into placeholder, and the data can come from CPU memory or files.
My question is: is it possible to store input data on GPU memory for tensorflow? or does it already do it in some magic way?
Thanks.
If your data fits on the GPU, you can load it into a constant on GPU from e.g. a numpy array:
with tf.device('/gpu:0'):
tensorflow_dataset = tf.constant(numpy_dataset)
One way to extract minibatches would be to slice that array at each step instead of feeding it using tf.slice:
batch = tf.slice(tensorflow_dataset, [index, 0], [batch_size, -1])
There are many possible variations around that theme, including using queues to prefetch the data to GPU dynamically.
It is possible, as has been indicated, but make sure that it is actually useful before devoting too much effort to it. At least at present, not every operation has GPU support, and the list of operations without such support includes some common batching and shuffling operations. There may be no advantage to putting your data on GPU if the first stage of processing is to move it to CPU.
Before trying to refactor code to use on-GPU storage, try at least one of the following:
1) Start your session with device placement logging to log which ops are executed on which devices:
config = tf.ConfigProto(log_device_placement=True)
sess = tf.Session(config=config)
2) Try to manually place your graph on GPU by putting its definition in a with tf.device('/gpu:0'): block. This will throw exceptions if ops are not GPU-supported.

Signal processing or algorithmic programming for a PLC

I have an application that takes voltages and temperatures as analog inputs and does some processing using an algorithm which involves signal processing such as low-pass filtering,
exponential smoothing, and other steps which might typically be done in a high-level programming language such as C or C++.
I'm curious how I could perform these same steps using a PLC, and in particular, the Allen-Bradley Control-Logix system? It seems to me that the instruction set with ladder logic is too limited for this. Could I perform this using structured text?
Ladder logic can do the computation just fine, although it isn't the nicest programming language in the world. It has a full complement of conditionals, arithmetic, arrays, etc.
Your real problem is fitting your computation into the cyclic execution model that most ladder logic engines (and Control Logix) run: a repeated execution of the program in the control from top to bottom, with each rung or computation being executed just once per scan.
If you need to loop over a set of values repeatedly before producing a result, you will likely have difficulty resolving the ladder engines' desire to execute everything "just once" per scan, and your need to execute a loop to produce a result. I believe in fact that there are FOR loop operators that can repeat a block of ladder code just as conventional loop; you need to ensure that the amount of time spent in your loops/algorithm don't materially affect the scan rate.
What may work well is for you to let the the scan rate act as one of your loops; typically you compute a filter by accepting a new value into an array and then computing a result over that array. Since you basically can't accept values faster than one-per-scan-cycle anyway, you can compute at-most-one-filter-result per scan cycle without losing any precision. If your array is of modest size (e.g., 10 values), you can in effect simply code a polynomial over the array as an equation to produce your filter result, and then code that polynomial (klunkily but straightforwardly) as ladder logic.
Control Logix PLCs do not have to execute on a cyclic sweep. I don't have RSLogix 5000 in front of me right now, but when defining the project, you are required to create one Program that executes on a cyclic sweep. But you can create other programs that do not. You can also run them off a trigger (not useful for regular input scanning) or off a fixed timer (very useful for input scanning). Keep in mind that there is no point in setting the input scan timer faster than your instrumentation updates-modern PLCs can frequently execute a scan much faster than a meter can update the data.
One good technique I have used for this is to create a program called one-second or something similar. This program will scan all your inputs, and perform all your signal processing, and then write to buffered memory locations. The rest of your program looks at those buffered memory locations, and never monitors the inputs directly. You can set the input-buffering program to execute as fast as needed for your process, up to whatever the PLC can handle before it faults.
It would also be a good idea to write your signal processing functions them selves as Add On Instructions, and then call them, with whatever parameters you need.
So you could have an AOI with a call interface like this:
input-1_buffered := input_smooth (low_pass, input-1);
This would call your input_smooth function, using input-1 as the value and input-1_buffered as the final location. low_pass would be used within the input_smooth function to jump to the appropriate logic.
Then you can write your actual smoothing logic in structured text, without anyone needing to understand it, because it will only exist in that one AOI.