Why does complex Matlab gpuArray take twice as much memory than it should? - matlab

I noticed that a large complex array takes twice as much memory on GPU than on CPU.
Here is a minimal example:
%-- First Try: Complex Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = complex(single(zeros(600000/8,1000))); % 600 MByte complex single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
Now I do the same with a twice as big array which is not complex:
%-- Second Try:, Single
gpu = gpuDevice(1);
m1 = gpu.FreeMemory;
test = single(zeros(600000/4,1000)); % 600MB MByte real single
whos('test')
test = gpuArray(test);
fprintf(' Used memory on GPU: %e\n', m1-gpu.FreeMemory);
The output is:
Name Size Bytes Class Attributes
test 75000x1000 600000000 single complex
Used memory on GPU: 1.200095e+09
Name Size Bytes Class Attributes
test 150000x1000 600000000 single
Used memory on GPU: 6.000476e+08
On the CPU both arrays are 600MB - on the GPU the complex array uses 1.2 GByte.
I tested this on two graphics cards: GeForce GTX 680 and Tesla K20 using Matlab 2013a.
How can I avoid this? Is this a bug in Matlab?

This was answered on MATLAB central. To summarize MathWorks developer Edric Ellis's answer:
gpu.FreeMemory may not be an accurate measure of the available GPU memory because MATLAB does not immediately free up memory when it's done using it. gpu.AvailableMemory is a more accurate measure of available memory.
Transferring complex data to/from the GPU still requires 2x the memory because there is a format conversion that is done on the GPU. Specifically, complex arrays in CPU host memory are stored with the real/imaginary parts split into 2 separate vectors, whereas complex arrays on the GPU device are stored in interleaved format.
Testing on R2017a, I confirmed that:
Switching from gpu.FreeMemory to gpu.AvailableMemory indeed addresses this discrepancy in reported memory usage that prompted the original question.
With 8 GB of GPU memory, copying...
6 GB real array: success
6 GB complex array: "Out of memory on device" error
6 separate 1 GB complex arrays: success

Related

Why is it faster to transfer data from CPU to GPU rather than GPU to CPU?

I've noticed that transferring data to recent high end GPUs is faster than gathering it back to the CPU. Here are the results using a benchmarking function provided to me by mathworks tech-support running on an older Nvidia K20 and a recent Nvidia P100 with PCIE:
Using a Tesla P100-PCIE-12GB GPU.
Achieved peak send speed of 11.042 GB/s
Achieved peak gather speed of 4.20609 GB/s
Using a Tesla K20m GPU.
Achieved peak send speed of 2.5269 GB/s
Achieved peak gather speed of 2.52399 GB/s
I've attached the benchmark function below for reference. What is the reason for the asymmetry on the P100? Is this system dependent or is it the norm on recent high end GPUs? Can the gather speed be increased?
gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name)
sizeOfDouble = 8; % Each double-precision number needs 8 bytes of storage
sizes = power(2, 14:28);
sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
gpuData = randi([0 9], numElements, 1, 'gpuArray');
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Achieved peak send speed of %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Achieved peak gather speed of %g GB/s\n',max(gatherBandwidth))
Edit: we now know it is not system dependent (see comments) . I still want to know the reason for the assymetry or if it can be changed.
This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.
System: Win10, 32GB DDR4-2400Mhz RAM, i7 6700K. MATLAB: R2018a.
Using a GeForce GTX 660 GPU.
Achieved peak send speed of 7.04747 GB/s
Achieved peak gather speed of 3.11048 GB/s
Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes
longer.
Contributor: Dev-iL
System: Win7, 32GB RAM, i7 4790K. MATLAB: R2018a.
Using a Quadro P6000 GPU.
Achieved peak send speed of 1.43346 GB/s
Achieved peak gather speed of 1.32355 GB/s
Contributor: Dev-iL
I am not familiar with Matlab GPU toolboxes, but I suspect that the second transfer (that gets data back from GPU) starts before the first has ended.
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
%
%No synchronization here
%
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
A similar question, for a C program, was posted here:
copy from GPU to CPU is slower than copying CPU to GPU
In that case, there is no explicit sync after launching a thread on the GPU and getting results data back from the GPU.
So the function that gets data back, in C cudaMemcpy(), has to wait for the GPU to end the previous launched thread, before transferring data, thus inflating the time measured for the data transfer.
With the Cuda C API, it is possible to force the CPU to wait for the GPU to end the previously launched thread(s), with:
cudaDeviceSynchronize();
And only then start measuring the time to transfer data back.
Maybe in Matlab there is also some synchronization primitive.
Also in the same answer, it is recommended to measure time with (Cuda) Events.
In this POST on optimizing data transfers, also in C sorry, Events are used to measure data transfer times:
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
The time for transferring data is the same in both directions.

Error using zeros Out of memory

When I try running
Adj = zeros(x*y);
I am receiving the following error:
Error using zeros
Out of memory. Type HELP MEMORY for your options.
where x*y=37901. The occupancy of my PC storage is
I know the C drive doesn't have much space but 34.2 GB should be more than enough for creating a 37901*37901 matrix.
When I run the memory command this is what I got:
>> memory
Maximum possible array: 4825 MB (5.059e+09 bytes) *
Memory available for all arrays: 4825 MB (5.059e+09 bytes) *
Memory used by MATLAB: 12369 MB (1.297e+10 bytes)
Physical Memory (RAM): 12218 MB (1.281e+10 bytes)
* Limited by System Memory (physical + swap file) available.
How can I solve this issue? (I am using MATLAB 2017b)
Actually, coding side, variables are normally stored into memory (your computer RAM) rather than into hard disk space. That's what your error complains about... you don't have enough memory to store the variable you want to allocate.
The default numerical variable used by Matlab is double, which is used to represent double precision floating-point values and takes up 8 bytes of memory. Hence, you are trying to allocate:
37901 * 37901 * 8 = 11491886408 bytes
~= 10.7 gigabytes
When you only have something like 11.9 gigabytes of available memory and Matlab is telling you that you can't allocate an array greater than 4.7 gigabytes. As a workaround, I suggest you to take a look at Tall Arrays, which are a Matlab feature tailored around handling very big data flows:
Tall arrays are used to work with out-of-memory data that is backed by
a datastore. Datastores enable you to work with large data sets in
small chunks that individually fit in memory, instead of loading the
entire data set into memory at once. Tall arrays extend this
capability to enable you to work with out-of-memory data using common
functions.
What is a Tall Array?
Since the data is not loaded into memory all at once, tall arrays can be arbitrarily large in the first dimension
(that is, they can have any number of rows). Instead of writing
special code that takes into account the huge size of the data, such
as with techniques like MapReduce, tall arrays let you work with large
data sets in an intuitive manner that is similar to the way you would
work with in-memory MATLAB® arrays. Many core operators and functions
work the same with tall arrays as they do with in-memory arrays.
MATLAB works with small chunks of the data at a time, handling all of
the data chunking and processing in the background, so that common
expressions, such as A+B, work with big data sets.
Benefits of Tall Arrays
Unlike in-memory arrays, tall arrays typically remain unevaluated until you request that the calculations
be performed using the gather function. This deferred evaluation
allows you to work quickly with large data sets. When you eventually
request output using gather, MATLAB combines the queued calculations
where possible and takes the minimum number of passes through the
data. The number of passes through the data greatly affects execution
time, so it is recommended that you request output only when
necessary.

Matlab R2017a memory profiler gives a ridiculous number for allocated memory

My code is:
function eigs_mem_test
N = 20000;
density = 0.2;
numOfModes = 250;
A = sprand(N, N, density);
profile -memory on
eigs(A, numOfModes, 0.0)
profile off
profsave(profile('info'), 'eigs_test')
profview
end
And this returns
i.e. it says that MATLAB allocated 18014398508117708.00 Kb or 1.8e10 Gb -- completely impossible. How did this happen? The code finishes with correct output and in htop I can see the memory usage vary quite a bit, but staying under 16G.
For N = 2000, I get sensible results (i.e. 0.2G allocated.)
How can I profile this case effectively, if I want to obtain an upper bound on memory used for large sparse matrices?
I use MATLAB R2017a.
I cannot reproduce your issue in R2017b, with 128GB of RAM on my machine. Here is the result after running your example code:
Notably, the function peaked at 14726148Kb, or ~1.8GB. I'm more confused by the units MATLAB has used here, as I saw nearer 14GB of usage in the task manager, which matches your large observed usage (and 1.4e7KB in GB), I can only think the profiler is meant to state KB (kilobytes) instead of Kb (kilobits).
Ridiculously large, unexpected values like this are often the result of overflow, so this could be an internal overflow bug.
You could use whos to get the size on disk of a variable
w = whos('A'); % get details of variable A
sizeOnDisk = w.bytes; % get size on disk
This doesn't necessarily tell you how much memory a function like eigs in your example uses though. You could poll memory within your function to get the current usage.
I'll resist exploring this further, since the question of how to profile for memory usage has already been asked and answered.
N.B. I'm not sure why my machine was ~100x slower than yours, I assume the image of your memory usage didn't come from actually running your example code? Or my RAM is awful...

Largest Matrix Matlab Linprog can Support

I want to use MATLAB linprog to solve a problem, and I check it by a much smaller, much simpler example.
But I wonder if MATLAB can support my real problem, there may be a 300*300*300*300 matrix...
Maybe I should give the exact problem. There is a directed graph of network nodes, and I want to get the lowest utilization of the edge capacity under some constraints. Let m be the number of edges, and n be the number of nodes. There are mn² variables and nm² constraints. Unfortunately, n may reach 300...
I want to use MATLAB linprog to solve it. As described above, I am afraid MATLAB can not support it...Lastly the matrix must be sparse, can some way simplify it?
First: a 300*300*300*300 array is not called a matrix, but a tensor (or simply array). Therefore you can not use matrix/vector algebra on it, because that is not defined for arrays with dimensionality greater than 2, and you can certainly not use it in linprog without some kind of interpretation step.
Second: if I interpret that 300⁴ to represent the number of elements in the matrix (and not the size), it really depends if MATLAB (or any other software) can support that.
As already answered by ben, if your matrix is full, then the answer is likely to be no. 300^4 doubles would consume almost 65GB of memory, so it's quite unlikely that any software package is going to be capable of handling that all from memory (unless you actually have > 65 GB of RAM). You could use a blockproc-type scheme, where you only load parts of the matrix in memory and leave the rest on harddisk, but that is insanely slow. Moreover, if you have matrices that huge, it's entirely possible you're overlooking some ways in which your problem can be simplified.
If you matrix is sparse (i.e., contains lots of zeros), then maybe. Have a look at MATLAB's sparse command.
So, what exactly is your problem? Where does that enormous matrix come from? Perhaps I or someone else sees a way in which to reduce that matrix to something more manageable.
On my system, with 24GByte RAM installed, running Matlab R2013a, memory gives me:
Maximum possible array: 44031 MB (4.617e+10 bytes) *
Memory available for all arrays: 44031 MB (4.617e+10 bytes) *
Memory used by MATLAB: 1029 MB (1.079e+09 bytes)
Physical Memory (RAM): 24574 MB (2.577e+10 bytes)
* Limited by System Memory (physical + swap file) available.
On a 64-bit version of Matlab, if you have enough RAM, it should be possible to at least create a full matrix as big as the one you suggest, but whether linprog can do anything useful with it in a realistic time is another question entirely.
As well as investigating the use of sparse matrices, you might consider working in single precision: that halves your memory usage for a start.
well you could simply try: X=zeros( 300*300*300*300 )
on my system it gives me a very clear statement:
>> X=zeros( 300*300*300*300 )
Error using zeros
Maximum variable size allowed by the program is exceeded.
since zeros is a build in function, which only fills a array of the given size with zeros you can asume that handling such a array will not be possible
you can also use the memory command
>> memory
Maximum possible array: 21549 MB (2.260e+10 bytes) *
Memory available for all arrays: 21549 MB (2.260e+10 bytes) *
Memory used by MATLAB: 685 MB (7.180e+08 bytes)
Physical Memory (RAM): 12279 MB (1.288e+10 bytes)
* Limited by System Memory (physical + swap file) available.
>> 2.278e+10 /8
%max bytes avail for arrays divided by 8 bytes for double-precision real values
ans =
2.8475e+09
>> 300*300*300*300
ans =
8.1000e+09
which means I dont even have the memory to store such a array.
while this may not answer your question directly it might still give you some insight.

Matlab: your opinion about a small memory issue working with matrix

I have a small question regarding MATLAB memory consumption.
My Architecture:
- Linux OpenSuse 12.3 64bit
- 16 GB of RAM
- Matlab 2013a 64 bit
I handle a matrix of double with size: 62 x 11969100 (called y)
When I try the following:
a = bsxfun(#minus,y,-1)
or simply
a = minus(y, -1)
I got a OUT of MEMORY error (in both cases).
I've just computed the ram space allocated for the matrix:
62 x 11969100 x 8 = 5.53 GB
Where am I wrong?!
Thanks a lot!
I'm running on Win64, with 16GB RAM.
Starting with a fresh MATLAB, with only a couple of other inconsequential applications open, my baseline memory usage is about 3.8GB. When I create y, that increases to 9.3GB (9.3-3.8 = 5.5GB, about what you calculate). When I then run a = minus(y, -1), I don't run out of memory, but it goes up to about 14.4GB.
You wouldn't need much extra memory to have been taken away (1.6GB at most) for that to cause an out of memory error.
In addition, when MATLAB stores an array, it requires a contiguous block of memory to do so. If your memory was a little fragmented - perhaps you had a couple of other tiny variables that happened to be stored right in the middle of one of those 5.5GB blocks - you would also get an out of memory error (you can sometimes avoid that issue with the use of pack).
The output of memory on windows platform:
>> memory
Maximum possible array: 2046 MB (2.145e+009 bytes) *
Memory available for all arrays: 3226 MB (3.382e+009 bytes) **
Memory used by MATLAB: 598 MB (6.272e+008 bytes)
Physical Memory (RAM): 3561 MB (3.734e+009 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
The output of computer on linux/mac:
>> [~,maxSize] = computer
maxSize =
2.814749767106550e+14 % Max. number of elements in a single array
with some hacks (found here):
>> java.lang.Runtime.getRuntime.maxMemory
ans =
188416000
>> java.lang.Runtime.getRuntime.totalMemory
ans =
65011712
>> java.lang.Runtime.getRuntime.freeMemory
ans =
57532968
As you can see, aside from memory limitations per variable, there are also limitations on total storage for all variables. This is not different for Windows or Linux.
The important thing to note is that for example on my Windows machine, it is impossible to create two 1.7GB variables, even though I have enough RAM, and neither is limited by maximum variable size.
Since carrying out the minus operation will assign a result of equal size to a new variable (a in your case, or ans when not assigning to anything), there need to be at least two of these humongous things in memory.
My guess is you run into the second limit of total memory space available for all variables.
bsxfun is vectorized for efficiency. Typically vectorized solutions require more than just minimal memory.
You could try using repmat, or if that does not work a simple for loop.
In general I believe the for loop will require the least memory.