Using Multiple GPUs outside of training in PyTorch - neural-network

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus.
The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.

Parallelism while training neural networks can be achieved in two ways.
Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.
In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
Do gradient accumulation for larger batches
Use DataParallel
If that doesn't suffice, go with DistributedDataParallel

Related

Does tensorflow convnet only duplicate model across multiple GPUs?

I am currently running a Tensorflow convnet for image recognition and I am considering of buying new GPUs to enable more complex graphs, batch size, and input dimensions. I have read posts like this that do not recommend using AWS GPU instances to train convnets, but more opinions are always welcomed.
I've read Tensorflow's guide 'Training a Model Using Multiple GPU Cards', and it seems that the graph is duplicated across the GPUs. I would like to know is this the only way to use parallel GPUs in Tensorflow convnet?
The reason I am asking this is because if Tensorflow can only duplicate graphs across multiple GPUs, it would mean each GPU must have at least the memory size that my model requires for one batch. (Example if the minimum memory size required is 5GB, two card of 4GB each would not do the job)
Thank you in advance!
No, it is definitely possible to use different variables on different GPUs.
For every variable and every layer that you declare, you have the choice of where do you declare the variable.
And in the specific case, you would want to use multiple GPUs for duplicating your model only to increase its batch_size training parameter to train faster, you would still need to explicitly build your model using the concept of shared parameters and manage how do those parameters communicate.

'ufactor' parameter in metis for imbalance clustering

I have been using METIS for clustering social media users.
By default, it was outputting clusters with same number of vertices in each side, which is not ideal in real world scenario. So, I was trying to find way to loosen the constraint of "same number of vertices" and get possible imbalance partition with minimized cut value.
I find a parameter ufactor in the manual which is suitable(I think) for my case but I did not grasp what it is really doing. I have large graph and tried with some value of ufactor. For one data set ufactor=1000 works very well but for another dataset it could not even partition the graph. I can not interpret this result as i did not understand what it's really doing. Here is what i find in the manual about this:
Specifies the maximum allowed load imbalance among the partitions. A value of x indicates that the
allowed load imbalance is (1 + x)/1000. The load imbalance for the jth constraint is defined to be
max_i(w[j, i])/t[j, i]), where w[j, i] is the fraction of the overall weight of the jth constraint that
is assigned to the ith partition and t[j, i] is the desired target weight of the jth constraint for the
ith partition (i.e., that specified via -tpwgts). For -ptype=rb, the default value is 1 (i.e., load
imbalance of 1.001) and for -ptype=kway, the default value is 30 (i.e., load imbalance of 1.03).
Can anybody help me to interpret this? Here, what is jth constraints? what is -ptype=rb/kway?
First of all, let me mention that I think METIS is the wrong tool here, because it is used for graph partitioning, where the emphasis is on minimizing the number of edges between partitions while keeping the partitions balanced (more or less equal sizes)
What you probably want to do is community detection within social networks, i.e. the search for clusters which maximize internal connectivity (large number of edges between nodes from the same cluster) and minimize external connectivity (small number of edges between different clusters).
This can be achieved by maximizing the so-called Modularity of the clustering
There are several approaches to tackle this problem, a popular heuristic being Label propagation.
If you don't want to implement the algorithm yourself, I would recommend using a framework like NetworKit (unfortunately, I don't know any other such frameworks yet), which implements Label propagation, some modularity-based algorithms and many helpful tools.
But back to your original question:
What is -ptype=rb/kway?
There are multiple ways how you can approach the graph partitioning problem: You can either try to partition the graph into your desired number of partitions directly (k-way partitioning) or you can split the graph in half repeatedly until you have the desired number of partitions (recursive bisection, rb)
What is the jth constraint?
METIS allows you to try and optimize multiple balance constraints at the same time, i.e. if you have multiple types of calculations on the graph that should all be more or less balanced among the compute nodes.
See the manual:
Many important types of multi-phase and multi-
physics computations require that multiple quantities be load balanced simultaneously.
[...]
METIS includes partitioning routines that can be used to partition a graph in the presence of such multiple balancing
constraints. Each vertex is now assigned a vector of m weights and the objective of the partitioning routines is
to minimize the edge-cut subject to the constraints that each one of the m weights is equally distributed among the
domains.
EDIT: Since you clarified that you wanted to look at a fixed number of clusters, I see how graph partitioning could be helpful here. Let me illustrate what ufactor means:
The imbalance of a partitioned graph is (in this simple case) computed as the maximum of the imbalance for each partition, which is roughly the quotient partition size / average partition size. So if we allow a maximum imbalance of 2, this means that the largest partition is twice as big as the average partition. Note however that ufactor doesn't specify the imbalance directly, it specifies how many permille away from 1 the imbalance is allowed to be.
So ufactor=143 actually means that your maximal allowed imbalance is 1.143, which makes sense since your clusters are not that far from each other. So in your case, you will probably use larger values for ufactor to allow the groups to be of quite different sizes.
Consequences of large imbalance
If your imbalance is too large, it might happen that all the strongly-connected parts land in the same partition while only isolated nodes are put in the other partitions. This is due to the fact that the algorithm tries to minimize the number of cut edges between different partitions, which will be lower if we put all the high-degree nodes in the same partition.
What about spectral partitioning, ...?
The general approach of METIS works as follows:
Most input graphs are too large to partition directly, which is why so-called multilevel methods are used:
The graph is first coarsened (nodes are combined while trying to preserve the graph structure) until its size becomes feasible to partition directly
The coarsest graph is partitioned using an initial partitioning technique, where we could use a variety of approaches (combinatorial bisection, spectral bisection, exact solutions using ILPs, ...).
The graph is then uncoarsened, where in each step a small number of nodes are moved from partition to partition in a local search to improve the overall edge cut.
My personal recommendation
I should however note that while graph partitioning might be a valid model for your case, METIS itself might not be the ideal implementation for you:
As you can read on the METIS homepage, it is mostly used for rather sparse graphs ('finite element methods, linear programming, VLSI, and transportation'), whereas social networks are much denser and have a different structure (degrees follow a power-law distribution)
The coarsening approach of METIS uses heavy edge matching to combine nodes which are somehow close together, which works great for the intended applications, for social networks however, clustering-based coarsening techniques might prove more efficient.
Another library that is a bit slower in general, but implements some presets especially for social networks is KaHIP, see the manual for details.
(I should mention however that I am biased in this regard, since I worked extensively with this library ;-) )

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

How can I add concurrency to neural network processing?

The basics of neural networks, as I understand them, is there are several inputs, weights and outputs. There can be hidden layers that add to the complexity of the whole thing.
If I have 100 inputs, 5 hidden layers and one output (yes or no), presumably, there will be a LOT of connections. Somewhere on the order of 100^5. To do back propagation via gradient descent seems like it will take a VERY long time.
How can I set up the back propagation in a way that is parallel (concurrent) to take advantage of multicore processors (or multiple processors).
This is a language agnostic question because I am simply trying to understand structure.
If you have 5 hidden layers (assuming with 100 nodes each) you have 5 * 100^2 weights (assuming the bias node is included in the 100 nodes), not 100^5 (because there are 100^2 weights between two consecutive layers).
If you use gradient descent, you'll have to calculate the contribution of each training sample to the gradient, so a natural way of distributing this across cores would be to spread the training sample across the cores and sum the contributions to the gradient in the end.
With backpropagation, you can use batch backpropagation (accumulate weight changes from several training samples before updating the weights, see e.g. https://stackoverflow.com/a/11415434/288875 ).
I would think that the first option is much more cache friendly (updates need to be merged only once between processors in each step).

Application performance vs Peak performance

I have questions about real application performance running on a cluster vs cluster peak performance.
Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated?
To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct?
When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?
Thanks
Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.
Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).
The theoretical (peak) value is based on the capability of each individual core in the cluster, which depends on clock frequency, number of floating point units, parallel instruction issuing capacity, vector register sizes, etc. which are design characteristics of the core. The flops/s count for each core in the cluster is then aggregated to get the cluster flops/s count.
For a car the equivalent theoretical performance would be the maximum speed it can reach given the specification of its engine.
For a program to reach the theoretical count, it has to perform specific operations in a specific order so that the instruction-level parallelism is maximum and all floating-point units are working constantly without delay due to synchronization or memory access, etc. (See this SO question for more insights)
For a car, it is equivalent to measuring top speed on a straight line with no wind.
But of course, chances that such a program computes something of interest are small. So benchmarks like HPL use actual problems in linear algebra, with a highly optimized and tuned implementation, but which is still imperfect due to IO operations and the fact that the order of operations is not optimal.
For a car, it could be compared to measuring the top average speed on a race track with straight lines, curves, etc.
If the program requires a lot of network, or disk communications, which are operations that require a lot of clock cycle, then the CPU has often to stay idle waiting for data before it can perform arithmetic operations, effectively wasting away a lot of computing power. Then, the actual performance is estimated by dividing the number of floating points operations (addition and multiplications) the program is performing by the time it takes to perform them.
For a car, this would correspond to measuring the top average speed in town with red lights, etc. by calculating the length of the trip divided by the time needed to accomplish it.