CUDA: Reasons for using preprocessing variables to specify the problem size

CUDA: Reasons for using preprocessing variables to specify the problem size - matlab

I'm coding CUDA in Matlab mex-Files. When you look at CUDA examples on the internet or even manuals from nvidia, you often see the use of preprocessing variables to specify the problem size, e.g. the vector length for a vector addition or something like this. I coded my program also like this: Preprocessing Variables for specifying the problem size. And I have to admit it: I like it since you can access those everywhere in your code, e.g. as limits in a loop or something like this, without having to explicitly pass them via argument to the function.
But I ran into the following problem: I wanted to bench the program for several different problem sizes and thus I need to compile the code everytime again by passing the preprocessing-variable to the compiler. It's not a problem, I already coded the benchmark and it works. But I just wonder afterwards now, why I chose this version and did not simply specify it by a user input on runtime. And thus I'm looking for reasons one might want to use preprocessing variables instead of simply passing the problem size to the program.
Thanks!

When you compile-in problem-size constants in the kernel, then the compiler can make certain classes of optimizations that it can't if the sizes are only known at runtime. Full loop unrolling is an obvious example.
In other cases, for instance shared memory array sizes, it is a lot clearer if the sizes are compiled-in; otherwise you have to pass in the total shared memory size at kernel launch time and break that memory up into the number of shared arrays you need. That works fine, but the code is much clearer if you can just have static declarations, for which you need the compile-time sizes.

The main reason is that in general the problem size will be intimately linked to the GPU architecture, e.g. number of threads per block, number of blocks, amount of shared memory per thread, number of registers per thread, etc. In general these numbers are all carefully hand tuned to get the maximum usage of available resources and you can't easily change the problem size dynamically while still maintaining optimum performance.

Related

How to 'copy' matrix without creating a temporary matrix in memory that caused memory overflow?

By assigning a matrix into a much bigger allocated memory, matlab somehow will duplicate it while 'copying' it, and if the matrix to be copied is large enough, there will be memory overflow. This is the sample code:
main_mat=zeros(500,500,2000);
n=500;
slice_matrix=zeros(500,500,n);
for k=1:4
parfor i=1:n
slice_matrix(:,:,i)=gather(gpuArray(rand(500,500)));
end
main_mat(:,:,1+(k-1)*n:1+(k-1)*n+n-1)=slice_matrix; %This is where the memory will likely overflow
end
Any way to just 'smash' the slice_matrix onto the main_mat without the overhead? Thanks in advance.
EDIT:
The overflow occurred when main_mat is allocated beforehand. If main_mat is initialized with main_mat=zeros(500,500,1); (smaller size), the overflow will not occur, but it will slowed down as allocation is not done before matrix is assigned into it. This will significantly reduce the performance as the range of k increases.

The main issue is that numbers take more space than zeros.
main_mat=zeros(500,500,2000); takes little RAM while main_mat = rand(500,500,2000); take a lot, no matter if you use GPU or parfor (in fact, parfor will make you use more RAM). So This is not an unnatural swelling of memory. Following Daniel's link below, it seems that the assignment of zeros only creates pointers to memory, and the physical memory is filled only when you use the matrix for "numbers". This is managed by the operating system. And it is expected for Windows, Mac and Linux, either you do it with Matlab or other languages such as C.

Removing parfor will likely fix your problem.
parfor is not useful there. MATLAB's parfor does not use shared memory parallelism (i.e. it doesn't start new threads) but rather distributed memory parallelism (it starts new processes). It is designed to distribute work over a set or worker nodes. And though it also works within one node (or a single desktop computer) to distribute work over multiple cores, it is not an optimal way of doing parallelism within one node.
This means that each of the processes started by parfor needs to have its own copy of slice_matrix, which is the cause of the large amount of memory used by your program.
See "Decide When to Use parfor" in the MATLAB documentation to learn more about parfor and when to use it.

I assume that your code is just a sample code and that rand() represents a custom in your MVE. So there are a few hints and tricks for the memory usage in matlab.
There is a snippet from The MathWorks training handbooks:
When assigning one variable to another in MATLAB, as occurs when passing parameters into a function, MATLAB transparently creates a reference to that variable. MATLAB breaks the reference, and creates a copy of that variable, only when code modifies one or more of teh values. This behavior, known as copy-on-write, or lazy-copying, defers the cost of copying large data sets until the code modifies a values. Therefore, if the code performs no modifications, there is no need for extra memory space and execution time to copy variables.
The first thing to do would be to check the (memory) efficiency of your code. Even the code of excellent programmers can be futher optimized with (a little) brain power. Here are a few hints regarding memory efficiency
make use of the nativ vectorization of matlab, e.g. sum(X,2), mean(X,2), std(X,[],2)
make sure that matlab does not have to expand matrices (implicit expanding was changed recently). It might be more efficient to use the bsxfun
use in-place-operations, e.g. x = 2*x+3 rather than x = 2*x+3
...
Be aware that optimum regarding memory usage is not the same as if you would want to reduce computation time. Therefore, you might want to consider reducing the number of workers or refrain from using the parfor-loop. (As parfor cannot use shared memory, there is no copy-on-write feature with using the Parallel Toolbox.
If you want to have a closer look at your memory, what is available and that can be used by Matlab, check out feature('memstats'). What is interesting for you is the Virtual Memory that is
Total and available memory associated with the whole MATLAB process. It is limited by processor architecture and operating system.
or use this command [user,sys] = memory.
Quick side node: Matlab stores matrices consistently in memory. You need to have a large block of free RAM for large matrices. That is also the reason why you want to allocate variables, because changing them dynamically forces Matlab to copy the entire matrix to a larger spot in the RAM every time it outgrows the current spot.
If you really have memory issues, you might just want to dig into the art of data types -- as is required in lower level languages. E.g. you can cut your memory usage in half by using single-precision directly from the start main_mat=zeros(500,500,2000,'single'); -- btw, this also works with rand(...,'single') and more native functions -- although a few of the more sophisticated matlab functions require input of type double, which you can upcast again.

If I understand correctly your main issue is that parfor does not allow to share memory. Think of every parfor worker as almost a separate matlab instance.
There is basically just one workaround for this that I know (that I have never tried), that is 'shared matrix' on Fileexchange: https://ch.mathworks.com/matlabcentral/fileexchange/28572-sharedmatrix
More solutions: as others suggested: remove parfor is certainly one solution, get more ram, use tall arrays (that use harddrives when ram runs full, read here), divide operations in smaller chunks, last but not least, consider an alternative other than Matlab.

You may use following code. You actually don't need the slice_matrix
main_mat=zeros(500,500,2000);
n=500;
slice_matrix=zeros(500,500,n);
for k=1:4
parfor i=1:n
main_mat(:,:,1+(k-1)*n + i - 1) = gather(gpuArray(rand(500,500)));
end
%% now you don't need this main_mat(:,:,1+(k-1)*n:1+(k-1)*n+n-1)=slice_matrix; %This is where the memory will likely overflow
end

Labview - Managing large numbers of constants

This is more of a formatting problem than code logic and probably seems silly (considering I've seen far more dense block diagrams). I'm working with a lot of numeric constants and they're starting to clutter my Block Diagram. Is there something I can use to group them nice and compactly?
Preferably I would like to avoid clustering them because I would need to bundle and unbundle every time I needed access.
EDIT: Picture of code in question (code segment is used repeatedly, so would be nice to have a more compact case structure)

I think you should rethink how much of your block diagram you expect to devote to constants :-)
Using numbers directly in code, the equivalent of unlabelled constants on the LabVIEW block diagram, is a recognised anti-pattern. Unless the reason for the constant value is both obvious and fundamental to the operation being carried out, anyone looking at your code (including you, any time after a couple of weeks since you wrote it) will not understand why the value was chosen. Therefore, you should make this clear by labelling the constant somehow (equivalent to assigning it to a name in a text language) and also make it easy to change the value if necessary.
It's usually clear what a 0 or 1 constant is doing there but in the code image you've posted you have two constants of 1000 and one of 999. Why is it 1000, and if I decide that it should be (say) 2000 instead, do I need to update the other two values as well? If so you should define it once, label it with a suitable name describing what it is (in your example it might be chunk size or something) and wire that value to wherever you need to use it. Where you have a constant 999 you could get that value with a Decrement function, or you could also change your Greater Than function to a Greater or Equal and compare directly with the 1000 value. In this way your initial constant definition will take up more space because of the label, but you'll save space and improve maintainability by wiring that value to wherever you need it rather than placing additional constants.
If you need to refer to the same constants in multiple places on your block diagram, you can place the constants (and just the constants, not any other program logic) in a subVI, with each constant wired to an indicator with a suitable label, and each indicator wired to a different output on the connector pane. When you hover the wiring tool over the SubVI's terminals you'll see the label in the tip-strip. Alternatively, especially if you need loads of different constant values, you can do the same thing but in your SubVI bundle the different constants into a named cluster (which you save as a typedef), and then use Unbundle by Name to access specific constant values from the cluster where you need them. Again this doesn't necessarily save block diagram space, but it does make your code more readable and maintainable.

Simple answer was to reorganize my block diagram making more space for the constants. Dave_St suggested creating subvi's for the case structures for anyone looking for alternatives. Wanted to mark this as resolved regardless.

MATLAB and clearing the swap space

In the debugging mode I stop at some breakpoint and do some matrix manipulation in order to test the program. These manipulations are computationally expensive so MATLAB uses the swap space on my linux system. Then, after continuing the program running, the swap space is almost full so MATLAB crushes. Is there a way I could clean the swap at the debugging node? Doing clear all and clear classes makes effect only on RAM memory, but do not affect the swap.

You can't. Swap isn't special, so just work through this as as a regular out-of-memory issue. If you free up memory, you'll indirectly free up the swap that's being used to back it (or avoid having to use swap to supplement it).
Swap space is just an OS-managed backing store for virtual memory. From a normal program's point of view, swap is RAM (just slow RAM) and you don't manage it separately. (Well... you can "wire" pages to prevent them from being swapped out and so on, or use OS APIs to directly manipulate swap, but those are low-level platform-specific details, (like, below malloc), and not exposed to you as a Matlab M-code programmer, and not what you want to do here.) If your Matlab program runs out of memory, that means it's used up or fragmented its process's virtual memory, not something in particular about your swap space. (Unless there's a low-level bug somewhere.)
When this happens, you may need to look elsewhere in your Matlab program (e.g. in global variables, figure handle properties, or other levels of the function call stack) to find additional data that hasn't been cleared yet, or just restart the Matlab process to fix memory fragmentation (which can happen if your code fills up the memory with lots of small arrays).
Like #siliconwafer suggests, memory, whos, and feature memstats are good tools for debugging this. And if you're stopped inside the debugger, realize you can't actually clear everything until you dbquit out of it.
Doing large matrix operations inside the debugger is not necessarily a recoverable operation: if you've modified arrays held in local variables in the stack frame(s) you're working on, but there are still copies of them held in other variables or frames, Matlab's copy-on-write mechanism needs to hold on to both copies of the arrays, and you might be out of luck for that run of the program if you hit your RAM limits.
If clear all and clear classes after exiting the debugger are not recovering enough memory for you, that smells like either memory fragmentation or a C-level memory leak (like in a MEX file). In either case, you need to restart Matlab to resolve it. Avoid the use of large cellstr arrays or other arrays-of-small-arrays to reduce fragmentation. And take a good hard look at your C code if you're using any custom MEX functions.
Or you just might not have enough memory to do the operations you're doing.

'Out of memory' in Matlab. A slow but a permanent solution?

I am wondering if my suggestion to 'Out of Memory' problem is impossible. Here is my suggestion:
The idea is seamlessly saving huge matrices (say BIG = rand(10^6)) to HDD as a .mat(-v7.3) file when it is not possible to keep it in memory and call it seamlessly whenever required. Then, when you want to use it like:
a = BIG(3678,2222);
s = size(BIG);
, it seamlessly does this behind the scene:
m = matfile('BIG.m');
a = m.BIG(3678,2222);
s = size(m,'BIG');
I know that speed is important but suppose that I have enough time but not enough memory. And also its better to write an memory efficient program but again suppose that I need to use someone else's function which cant be optimized. I do actually have some more related questions: Can this be implemented using objects? Or does it require a infrastructural change in Matlab?

Seems to me like this is certainly possible, as this essentially what many operating systems do in the form of paging.
Moreover, something similar is provided by MATLAB Distributed Computing Server. This allows you (among other things) to store the data for a single matrix on multiple machines, and access it seamlessly in the way that you propose.
IMHO, allowing for data to be paged to file/swap should be a setting in MATLAB. Unfortunately, that is not how MATLAB's memory model works, and I suspect it is very difficult to implement this on their side. Plus, when this setting is enabled, users will not be protected anymore against making silly mistakes like zeros(1e7) instead of zeros(1e7,1); it will simply seem to hang the system, as MATLAB is busy filling your entire drive with zeros.
Anyway, I think it is possible using MATLAB classes. But I wouldn't recommend it. Note that implementing a proper subsref and subsasgn is *ahum* challenging, plus, it is likely that you'll have to re-implement many algorithms (like mldivide). This will most likely mean you'll lose a great deal of performance; think factors of in the thousands.
Here's an interesting random relevant paper I found while googling around a bit.

What could probably be a solution to your problem is memory mapped io (which matlab supports).
There a file is mapped to the memory and all read/writes to that memory address are actually read/writes to the file.
This only reserves/blocks memory addresses, it does not consume physical memory.
However, I would only advise it with 64 bit matlab, since with 32 bit matlab the address space is simply not large enough to use ram for data, code for matlab and dlls, and memory mapped io.
Check out the examples for the documentation page of memmapfile(), e.g.,
m = memmapfile('records.dat', ...
'Offset', 1024, ...
'Format', {'uint32' [4 10 18] 'x'});
A = m.Data(1).x;
whos A
Name Size Bytes Class
A 4x10x18 2880 uint32 array
Note that, accesses to m.Data(1).x redirect to file IO, i.e. no memory is consumed. So it provides efficient random access to parts of possibly very large data files residing on disk. Also note, that more complex data structures as in the example can be realized.
It seems to me that this provides the "implementation with objects" you had in mind.
Unfortunately, this does not allow to memmap MATfiles directly, which would be really useful. Probably this is difficult because of the compression.

Write a function:
function a=BIG(x,y)
m = matfile('BIG.mat');
a = m.BIG(x,y);
end
Every time you write BIG(a,b) the function is called.

Workaround For No Dynamic Memory Support In Embedded MATLAB Function Blocks

Background:
I have inherited a Discrete-Event Simulation MATLAB model and wish to automate and speed it's execution. Rather than calling sim(modelName) and having MATLAB run interpreted code, I would like a solution akin to calling system('modelName.exe ...'). My motivation for this comes from initial tests which suggest that a speed increase of nearly 1000%. I have managed to use the Real-Time Workshop with the Rapid Simulation target to produce an exe with static memory allocation. The problem is that there are Embedded MATLAB Function Blocks in the model for which the parameters will vary in size and shape in each run. And there will be hundreds if not thousands of runs.
According to the MathWorks documentation:
Dynamic Memory Allocation Not Supported for Embedded MATLAB Function Blocks:
"You cannot use dynamic memory allocation for variable-size data in Embedded MATLAB Function blocks. Use bounded instead of unbounded variable-size data."
Question:
What would be a potential workaround for this limitation?
Thoughts:
Use a static variable sizes which are sufficiently large, and additionally pass int variables / tunable parameters to explicitly window the portion of the data to range over.
S-Functions?
What I'm implementing today: Programmatically recompile the simulation each time it's called to generate static code, dynamically.
Port everything to a real/modern programming language such as python or c++.
Keywords:
MATLAB dynamic memory allocation embedded Discrete Event Simulation Real-Time Workshop Simulink SimEvents Tunable Parameters

Following up on this years later... We went with the dynamic static recompilation I had implemented that day for a year or so, then another stats developer rewrote it in c++. Using the maximum possible memory every run simply wasn't a feasible waste of computing resources.

You should view this webinar : http://www.mathworks.com/company/events/webinars/wbnr43180.html . It explains an automatic solution similar to your first thought.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse