Why is it faster to transfer data from CPU to GPU rather than GPU to CPU? - matlab

I've noticed that transferring data to recent high end GPUs is faster than gathering it back to the CPU. Here are the results using a benchmarking function provided to me by mathworks tech-support running on an older Nvidia K20 and a recent Nvidia P100 with PCIE:
Using a Tesla P100-PCIE-12GB GPU.
Achieved peak send speed of 11.042 GB/s
Achieved peak gather speed of 4.20609 GB/s
Using a Tesla K20m GPU.
Achieved peak send speed of 2.5269 GB/s
Achieved peak gather speed of 2.52399 GB/s
I've attached the benchmark function below for reference. What is the reason for the asymmetry on the P100? Is this system dependent or is it the norm on recent high end GPUs? Can the gather speed be increased?
gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name)
sizeOfDouble = 8; % Each double-precision number needs 8 bytes of storage
sizes = power(2, 14:28);
sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
gpuData = randi([0 9], numElements, 1, 'gpuArray');
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Achieved peak send speed of %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Achieved peak gather speed of %g GB/s\n',max(gatherBandwidth))
Edit: we now know it is not system dependent (see comments) . I still want to know the reason for the assymetry or if it can be changed.

This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.
System: Win10, 32GB DDR4-2400Mhz RAM, i7 6700K. MATLAB: R2018a.
Using a GeForce GTX 660 GPU.
Achieved peak send speed of 7.04747 GB/s
Achieved peak gather speed of 3.11048 GB/s
Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes
longer.
Contributor: Dev-iL
System: Win7, 32GB RAM, i7 4790K. MATLAB: R2018a.
Using a Quadro P6000 GPU.
Achieved peak send speed of 1.43346 GB/s
Achieved peak gather speed of 1.32355 GB/s
Contributor: Dev-iL

I am not familiar with Matlab GPU toolboxes, but I suspect that the second transfer (that gets data back from GPU) starts before the first has ended.
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
%
%No synchronization here
%
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
A similar question, for a C program, was posted here:
copy from GPU to CPU is slower than copying CPU to GPU
In that case, there is no explicit sync after launching a thread on the GPU and getting results data back from the GPU.
So the function that gets data back, in C cudaMemcpy(), has to wait for the GPU to end the previous launched thread, before transferring data, thus inflating the time measured for the data transfer.
With the Cuda C API, it is possible to force the CPU to wait for the GPU to end the previously launched thread(s), with:
cudaDeviceSynchronize();
And only then start measuring the time to transfer data back.
Maybe in Matlab there is also some synchronization primitive.
Also in the same answer, it is recommended to measure time with (Cuda) Events.
In this POST on optimizing data transfers, also in C sorry, Events are used to measure data transfer times:
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
The time for transferring data is the same in both directions.

Related

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome
Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

Matlab R2017a memory profiler gives a ridiculous number for allocated memory

My code is:
function eigs_mem_test
N = 20000;
density = 0.2;
numOfModes = 250;
A = sprand(N, N, density);
profile -memory on
eigs(A, numOfModes, 0.0)
profile off
profsave(profile('info'), 'eigs_test')
profview
end
And this returns
i.e. it says that MATLAB allocated 18014398508117708.00 Kb or 1.8e10 Gb -- completely impossible. How did this happen? The code finishes with correct output and in htop I can see the memory usage vary quite a bit, but staying under 16G.
For N = 2000, I get sensible results (i.e. 0.2G allocated.)
How can I profile this case effectively, if I want to obtain an upper bound on memory used for large sparse matrices?
I use MATLAB R2017a.
I cannot reproduce your issue in R2017b, with 128GB of RAM on my machine. Here is the result after running your example code:
Notably, the function peaked at 14726148Kb, or ~1.8GB. I'm more confused by the units MATLAB has used here, as I saw nearer 14GB of usage in the task manager, which matches your large observed usage (and 1.4e7KB in GB), I can only think the profiler is meant to state KB (kilobytes) instead of Kb (kilobits).
Ridiculously large, unexpected values like this are often the result of overflow, so this could be an internal overflow bug.
You could use whos to get the size on disk of a variable
w = whos('A'); % get details of variable A
sizeOnDisk = w.bytes; % get size on disk
This doesn't necessarily tell you how much memory a function like eigs in your example uses though. You could poll memory within your function to get the current usage.
I'll resist exploring this further, since the question of how to profile for memory usage has already been asked and answered.
N.B. I'm not sure why my machine was ~100x slower than yours, I assume the image of your memory usage didn't come from actually running your example code? Or my RAM is awful...

How to train neural networks on big sample sets in Matlab?

I am trying to train neural network on big training set.
inputs consists of aprox 4 million of columns and 128 rows, and targets consisting of 62 rows.
hiddenLayerSize is 128.
The script is follows:
net = patternnet(hiddenLayerSize);
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
net.trainFcn = 'trainbfg';
net.performFcn = 'mse'; % Mean squared error
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
'plotregression', 'plotfit'};
net.trainParam.show = 1;
net.trainParam.showCommandLine = 1;
[net,tr] = train(net,inputs,targets, 'showResources', 'yes', 'reduction', 10);
When train starts to execute, Matlab hangs, Windows hangs or slow, swapping runs disk huge and nothing else happens for dozens of minutes.
Computer is 12Gb Windows x64, Matlab is also 64 bit. Memory usage in process manager varies during operation.
What else can be done except reducing train set?
If reducing train set, then to which level? How to estimate it's size except trying?
Why doesn't function displays anything?
It is fairly hard to diagnose such problems from remote, to the point that I am not even sure that anything anyone can answer might actually help. Moreover you are asking several questions in one so I will take it step by step. Ultimately I will try to give you a better understanding of the memory consumption of your script.
Memory consumption
Dataset Size and Copies
Starting from the size of the dataset you are loading in memory, assuming that each entry contains a double floating-point precision number, your training data set requires (4e6 * 128 * 8) Bytes of memory which roughly resolves to 3.81 GB. If I understand correctly, your array of outputs contains (4e6 * 62) entries which become (4e6 * 62 * 8) Bytes, roughly equivalent to 1,15 GB. So even before running the network training you are consuming circa 5GB of memory.
Now yes MATLAB uses lazy copy so any assignment:
training = zeros(4e6, 128);
copy1 = training;
copy2 = training;
will not require new memory. However, any slicing operation:
training = zeros(4e6, 128);
part1 = training(1:1000, :);
part1 = training(1001:2000, :);
will indeed allocate more memory. Hence when selecting your training, validation and testing subsets:
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
internally the train() function could potentially be re-allocating the same amount of memory twice. Your grand total would now be 10GB. If you now consider that you operating system is running, along with a bunch of other applications, it is easy to understand why everything suddenly slows down. I might be telling you something obvious here but: your dataset is very large.
Profiling Helps
Now, whilst I am pretty sure about my 5 GB consumption calculation, I am not sure if this is a valid assumption. Bottom-line is I don't know the inside workings of the train() function that well.
This is why I urge you to test it out with MATLAB's very own profiler. This will indeed give you a much better understanding of function calls and memory consumption.
Reducing Memory Usage
What can be done to reduce memory consumption? Now this is probably the question that has been haunting programmers since the dawn of times. :) Once again, it is hard to provide a unique answer as the solution is often dependent on the task, problem and tools at hand. Matlab has a, let's give it the benefit of the doubt, informative page on how to reduce memory usage. Very often though the problem lies in the size of the data to be loaded in memory.
I, on one hand, would of course start by reducing the size of your dataset. Do you really need 4e6 * 128 datapoints? If you do then you might consider investing into dedicated solutions such as high-performance servers to perform your computation. If not you, but only you, must look at your dataset and start analysing which features might be unnecessary, to cut down the columns, and, most importantly, which samples might be unnecessary, to cut down the rows.
Being optimistic
On a side note, you did not complain about any OutOfMemory errors from MATLAB, which could be a good sign. Maybe your machine is simply hanging because the computation is THAT intensive. And this too is a reasonable assumption as you are creating a network with 128 hidden layers, 62 outputs and running several epochs of training, as you should be doing.
Kill The JVM
What you can do to put less load on the machine is to run MATLAB without the Java Environment (JVM). This ensures that MATLAB itself will require less memory to run. The JVM can be disabled by running:
matlab -nojvm
This works if you do not need to display any graphics, as MATLAB will run in a console-like environment.

Why does complex Matlab gpuArray take twice as much memory than it should?

I noticed that a large complex array takes twice as much memory on GPU than on CPU.
Here is a minimal example:
%-- First Try: Complex Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = complex(single(zeros(600000/8,1000))); % 600 MByte complex single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
Now I do the same with a twice as big array which is not complex:
%-- Second Try:, Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = single(zeros(600000/4,1000)); % 600MB MByte real single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
The output is:
Name Size Bytes Class Attributes
test 75000x1000 600000000 single complex
Used memory on GPU: 1.200095e+09
Name Size Bytes Class Attributes
test 150000x1000 600000000 single
Used memory on GPU: 6.000476e+08
On the CPU both arrays are 600MB - on the GPU the complex array uses 1.2 GByte.
I tested this on two graphics cards: GeForce GTX 680 and Tesla K20 using Matlab 2013a.
How can I avoid this? Is this a bug in Matlab?
This was answered on MATLAB central. To summarize MathWorks developer Edric Ellis's answer:
gpu.FreeMemory may not be an accurate measure of the available GPU memory because MATLAB does not immediately free up memory when it's done using it. gpu.AvailableMemory is a more accurate measure of available memory.
Transferring complex data to/from the GPU still requires 2x the memory because there is a format conversion that is done on the GPU. Specifically, complex arrays in CPU host memory are stored with the real/imaginary parts split into 2 separate vectors, whereas complex arrays on the GPU device are stored in interleaved format.
Testing on R2017a, I confirmed that:
Switching from gpu.FreeMemory to gpu.AvailableMemory indeed addresses this discrepancy in reported memory usage that prompted the original question.
With 8 GB of GPU memory, copying...
6 GB real array: success
6 GB complex array: "Out of memory on device" error
6 separate 1 GB complex arrays: success

Using more than one GPU in matlab

this is the output of ginfo using Jacket/matlab:
Detected CUDA-capable GPUs:
CUDA driver 270.81, CUDA toolkit 4.0
GPU0 Tesla C1060, 4096 MB, Compute 1.3 (single,double) (in use)
GPU1 Tesla C1060, 4096 MB, Compute 1.3 (single,double)
GPU2 Quadro FX 1800, 742 MB, Compute 1.1 (single)
Display Device: GPU2 Quadro FX 1800
The problem is :
Can I use the two Teslas at same time (parfor)? How?
How to know number of cores are currently running/executing the program?
After running the following code and make Quadro (in use) I found it takes less time than Tesla despite Tesla having 240 cores and Quadro has only 64? Maybe because it's the display device?maybe becouse it's single precision and Tesla is Double precision?
clc; clear all;close all;
addpath ('C:/Program Files/AccelerEyes/Jacket/engine');
i = im2double(imread('cameraman.tif'));
i_gpu=gdouble(i);
h=fspecial('motion',50,45);% Create predefined 2-D filter
h_gpu=gdouble(h);
tic;
for j=1:500
x_gpu = imfilter( i_gpu,h_gpu );
end
i2 = double(x_gpu); %memory transfer
t=toc
figure(2), imshow(i2);
Any help with the code will be appreciated. As you can see it's very trivial example used to demonstrate power of GPU, no more.
Using two Teslas at the same time: write a MEX file and call cudaChooseDevice(0), launch one kernel, then call cudaChooseDevice(1) and execute another kernel. Kernel calls and memory copies (i.e., cudaMemcpyAsync and cudaMemcpyPeerAsync) are asynchronous. I've given an example about how to write a MEX file (i.e., a DLL) in one of my other answers. Just add a second kernel to that example. FYI, you don't need Jacket if you can do some C/C++ programming. On the other hand, if you don't want to spend your time learning the Cuda SDK or you don't have a C/C++ compiler then you're stuck with Jacket or gp-you or GPUlib until Matlab changes the way that parfor works.
An alternative is to call OpenCL from Matlab (again through a MEX file). Then you could launch kernels on all the GPUs and CPUs. Again, this requires some C/C++ programming.
From Matlab 2012, GPU array and GPU related functions are fully integrated into the MATLAB so you might not need to use Jacket to achieve what you are trying to do.
In sum, put gpuDevice(deviceID); before running GPU commands, then the following codes will be run on the deviceIDth gpu. For instance
gpuDevice(1);
a = gpuArray(rand(3)); // a is on the first GPU memory
gpuDevice(2);
b = gpuArray(rand(4)); // b is on the second GPU memory
To run multiple GPUs. simply put
c = cell(1,num_device);
parfor i = 1:num_device
gpuDevice(i);
a = gpuArray(magic(3));
b = gpuArray(rand(3));
c{i} = gather(a*b);
end
You can see the GPU memory usage by typing nvidia-smi on the system command line.
This way of setting GPU id seems strange but it is the conventional way to set GPU id. In CUDA, if you want to use specific GPU then cudaSetDevice(gpuId) and the following codes will run on the gpuIdth GPU. (0-base indexing)
----------------------EDIT----------------
Tested and confirmed on MATLAB 2012b, MATLAB 2013b.
Checked using nvidia-smi that the code is actually using different GPU memories. You might have to scale it very large rand(5000) and check very quickly since temporary variables a and b would disappear after the for loop ends