Does tensorflow convnet only duplicate model across multiple GPUs?

I am currently running a Tensorflow convnet for image recognition and I am considering of buying new GPUs to enable more complex graphs, batch size, and input dimensions. I have read posts like this that do not recommend using AWS GPU instances to train convnets, but more opinions are always welcomed.
I've read Tensorflow's guide 'Training a Model Using Multiple GPU Cards', and it seems that the graph is duplicated across the GPUs. I would like to know is this the only way to use parallel GPUs in Tensorflow convnet?
The reason I am asking this is because if Tensorflow can only duplicate graphs across multiple GPUs, it would mean each GPU must have at least the memory size that my model requires for one batch. (Example if the minimum memory size required is 5GB, two card of 4GB each would not do the job)
Thank you in advance!

No, it is definitely possible to use different variables on different GPUs.
For every variable and every layer that you declare, you have the choice of where do you declare the variable.
And in the specific case, you would want to use multiple GPUs for duplicating your model only to increase its batch_size training parameter to train faster, you would still need to explicitly build your model using the concept of shared parameters and manage how do those parameters communicate.


Do I need a GPU even to deploy a deep learning model?

I know I do need a GPU to train a model but even after the model is trained do I need a GPU to deploy the same trained model?
For example I have a model for a car with auto-pilot to predict and take a decision... Do I need a GPU for the prediction too..
Specially in case of reinforcement learning
Strictly speaking you usually don't need a GPU for training either depending on the platform, it would just be much slower than if you utilized he GPU rather than the CPU.
For deploying the model you do not need a GPU. Most models are simply an organized list of weights which are used by the model to operate on its inputs. Since this usually isn't particularly computationally expensive, except for very large models, a GPU isn't necessary for deployment either, but may provide some performance benefit for lager models.

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are

how to improve TensorFlow object detection model?

I need to diagnosis captcha for a project. I did this using the object_detection provided by Tensorflow.
also, I added 500 captcha samples by turning images into XML by LabelImg and then to TFRecord.
beside I used "faster_rcnn_inception_v2_coco_2018_01_28"
The problem is that the accuracy of the machine is very low.
My questions are:
Can the problem be solved by increasing the number of training data?
Should I change my algorithm?
How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
Q. Can the problem be solved by increasing the number of training data?
A. It would be depend on how many data you can get more. I think that only increasing the number of training data is not good approach.
Consider using Fine-tuning existing trained model to detect object class. If you want to fine-tune the model, you need to be careful class label assignment because existing trained model like YOLO3, Faster RCNN, etc. has no label "captcha" in their training dataset.
I recommend you to refer to this website that can help you to fine-tune the model.
Q. Should I change my algorithm?
A. Do as you wish.
Q. How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
A. In my opinion, two different models are much the same if you don't need to consider inference time.

Using Multiple GPUs outside of training in PyTorch

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus.
The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.
Parallelism while training neural networks can be achieved in two ways.
Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.
In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
Do gradient accumulation for larger batches
Use DataParallel
If that doesn't suffice, go with DistributedDataParallel

Siamese networks: Why does the network to be duplicated?

The DeepFace paper from Facebook uses a Siamese network to learn a metric. They say that the DNN that extracts the 4096 dimensional face embedding has to be duplicated in a Siamese network, but both duplicates share weights. But if they share weights, every update to one of them will also change the other. So why do we need to duplicate them?
Why can't we just apply one DNN to two faces and then do backpropagation using the metric loss? Do they maybe mean this and just talk about duplicated networks for "better" understanding?
Quote from the paper:
We have also tested an end-to-end metric learning ap-
proach, known as Siamese network [8]: once learned, the
face recognition network (without the top layer) is repli-
cated twice (one for each input image) and the features are
used to directly predict whether the two input images be-
long to the same person. This is accomplished by: a) taking
the absolute difference between the features, followed by b)
a top fully connected layer that maps into a single logistic
unit (same/not same). The network has roughly the same
number of parameters as the original one, since much of it
is shared between the two replicas, but requires twice the
computation. Notice that in order to prevent overfitting on
the face verification task, we enable training for only the
two topmost layers.
The short answer is that yes, I think that looking at the architecture of the network will help you understand what is going on. You have two networks that are "joined at the hip" i.e. sharing weights. That's what makes it a "Siamese network". The trick is that you want the two images you feed into the network to pass through the same embedding function. So to ensure that this happens both branches of the network need to share weights.
Then we combine the two embeddings into a metric loss (called "contrastive loss" in the image below). And we can back-propagate as normal, we just have two input branches available so that we can feed in two images at a time.
I think a picture is worth a thousand words. So check out how a Siamese network is constructed (at least conceptually) below.
The gradients depend on the activation values. So for each branch gradients will be different and final update could be based on some averaging to share the weights