Getting the imageDenoising CUDA example to work using a MATLAB CUDAKernel - matlab

TL;DR
I'm looking for a way to extract a part of an existing CUDA Toolkit example and turn it into a CUDAKernel executable in MATLAB.
The Story
In an attempt to obtain a short-runtime implementation of the non-local means (NLM) 2D filter, I stumbled upon the imageDenoising example provided with the CUDA Toolkit which implements two variants of this filter, called NLM & NLM2 (or "quick NLM").
Having no previous experience with CUDA coding, I initially attempted to follow MATLAB's documentation on the subject, which resulted in several strange errors including: ptx compilation, multiple entry points and wrong number of inputs in the C prototype. At this point I realized that this isn't going to be a "just works" case and that some tinkering is required.
So I decided to eliminate the multiple entry point issue by simply deleting parts of imageDenoising.cu file and consolidating the relevant .cuh (either ..._nlm_kernel.cuh or ..._nlm2_kernel.cuh) into the .cu so as to obtain a single entry point at any given time.
To my surprise this actually managed to compile and I was finally able to create a CUDAKernel without an error (using the command k = parallel.gpu.CUDAKernel('imageDenoising.ptx', 'uint8_T *, int, int, float, float');).
This however was not enough, because I mistakenly concluded that the 1st argument is the unprocessed image in the form of an RGB matrix (i.e. X*Y*3 uint8), and so the result I was getting back was exactly the input but with 0 in the 1st 4 elements.
After searching a bit more I realized that there are additional, and critical, aspects I'm entirely unaware of (like the need to initialize __device__ variables) to such a conversion process, at which stage I decided to ask for help.
The Problem
I'm currently wondering how to efficiently continue from here. While I'd love to hear if this kind of approach can generally bear fruit (and whether a complete example of this process is available somewhere), which other pitfalls I should look out for, and what alternative courses of action I can take (considering my very limited knowledge in CUDA and the fact I won't hire anybody else to do this for me), I keep in mind that this is SO and so I must have a specific programming problem, so here goes:
How do I modify imageDenoising.cu such that the MATLAB CUDAKernel constructed from it will also accept the unprocessed image as an input?
Note: in my application, the input matrix is a 2d, grayscale, double matrix.
Related: How CudaMalloc work?
P.S.
A working piece of code would obviously be welcomed, but I'd really rather "learn to fish".

I ended up taking an alternative approach to CUDAKernel, using .MEX, by doing the following:
Setting up the external libraries OpenCV v2.4.10 (not v3!) and mexopencv.
Writing a small wrapper function for OpenCV's fastNlMeansDenoising using the guidelines of mexopencv for unimplemented functions, as seen below (excluding the documentation):
#include "mexopencv.hpp"
using namespace cv;
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// Check arguments
if (nlhs != 1 || nrhs<1 || ((nrhs % 2) != 1) )
mexErrMsgIdAndTxt("fastNLM:invalidArgs", "Wrong number of arguments");
// Argument vector
vector<MxArray> rhs(prhs, prhs + nrhs);
// Option processing
// Defaults:
double h = 3;
int templateWindowSize = 7;
int searchWindowSize = 21;
// Parsing input name-value pairs:
for (int i = 1; i<nrhs; i += 2) {
string key = rhs[i].toString();
if (key == "h")
h = rhs[i + 1].toDouble();
else if (key == "templateWindowSize")
templateWindowSize = rhs[i + 1].toInt();
else if (key == "searchWindowSize")
searchWindowSize = rhs[i + 1].toInt();
else
mexErrMsgIdAndTxt("mexopencv:error", "Unrecognized option");
}
// Process
Mat src(rhs[0].toMat()), dst;
fastNlMeansDenoising(src, dst, h, templateWindowSize, searchWindowSize);
// Convert cv::Mat back to mxArray*
plhs[0] = MxArray(dst);
}
Compiling it..... and viola - a working CUDA-accelerated NLM filter.
The answer to my question itself can be found by comparing opencv\sources\modules\photo\src\cuda\nlm.cu (this is the opencv2 path) with imageDenoising_nlm2_kernel.cuh.
This solution worked well for me because it was more important for me to get an NLM filter running, rather than using CUDAKernel.
The main lesson I learned from this (and I'd like to pass on to others) is:
Running CUDA code in MATLAB can also be done in ways other than CUDAKernel, such as using .mex wrappers as shown above.

Related

Speed-critical section in gcc?

I'm already using -O3, so I don't think I can do much better in general, but the generated assembly is still a tight fit for the timing requirements. Is there a way to, for example, tell avr-gcc to keep non-speed-critical code out of a speed-critical section?
if (READY)
{
ACKNOWLEDGE();
SETUP_PART1();
SETUP_PART2();
//start speed-critical section
SYNC_TO_TIMER();
MINIMAL_SPEED_CRITICAL_CODE();
//end speed-critical section
CLEANUP();
}
A tedious reading of the -O3-optimized assembly listing (.lss file) shows that SETUP_PART2(); has been reordered to come between SYNC_TO_TIMER(); and MINIMAL_SPEED_CRITICAL_CODE();. That's a problem because it adds time spent in the speed-critical section that doesn't need to exist.
I don't want to make the critical section itself any longer than it needs to be, so I'm hesitant to relax the optimization. (the same action that causes this problem may also be what makes the final version fit inside the time available) So how can I tell it that "this behavior" needs to stay pure and uncluttered, but still optimize it fully?
I've used inline assembly before, so I'm sure I could figure out again how to communicate between that and C without completely tying up or clobbering registers, etc. But I'd much rather stick with 100% C if I can, and let the compiler figure out how not to run over itself.
Not quite the direct answer that I as looking for, but I found something that works. So it's technically an XY problem, but I'll still take a direct answer to the question as stated.
Here's what I ended up doing:
1. Separate the speed-critical section into its own function, and prevent it from inlining:
void __attribute__((noinline)) _fast_function(uint8_t arg1, uint8_t arg2)
{
//start speed-critical section
SYNC_TO_TIMER();
MINIMAL_SPEED_CRITICAL_CODE(arg1, arg2);
//end speed-critical section
}
void normal_function(void)
{
if (READY)
{
ACKNOWLEDGE();
SETUP_PART1();
SETUP_PART2(); //determines arg1, arg2
_fast_function(arg1, arg2);
CLEANUP();
}
}
That keeps the speed-critical stuff together, and by itself. It also adds some overhead, but that happens OUTSIDE the fast part, where I can afford it.
2. Manually unroll some loops:
The original code was something like this:
uint8_t data[3];
uint8_t byte_num = 3;
while(byte_num)
{
byte_num--;
uint8_t bit_mask = 0b10000000;
while(bit_mask)
{
if(data[byte_num] & bit_mask) WEIRD_OUTPUT_1();
else WEIRD_OUTPUT_0();
bit_mask >>= 1;
}
}
The new code is like this:
#define WEIRD_OUTPUT(byte,bit) \
do \
{ \
if((byte) & (bit)) WEIRD_OUTPUT_1(); \
else WEIRD_OUTPUT_0(); \
} while(0)
uint8_t data[3];
WEIRD_OUTPUT(data[2],0b10000000);
WEIRD_OUTPUT(data[2],0b01000000);
WEIRD_OUTPUT(data[2],0b00100000);
WEIRD_OUTPUT(data[2],0b00010000);
WEIRD_OUTPUT(data[2],0b00001000);
WEIRD_OUTPUT(data[2],0b00000100);
WEIRD_OUTPUT(data[2],0b00000010);
WEIRD_OUTPUT(data[2],0b00000001);
WEIRD_OUTPUT(data[1],0b10000000);
WEIRD_OUTPUT(data[1],0b01000000);
WEIRD_OUTPUT(data[1],0b00100000);
WEIRD_OUTPUT(data[1],0b00010000);
WEIRD_OUTPUT(data[1],0b00001000);
WEIRD_OUTPUT(data[1],0b00000100);
WEIRD_OUTPUT(data[1],0b00000010);
WEIRD_OUTPUT(data[1],0b00000001);
WEIRD_OUTPUT(data[0],0b10000000);
WEIRD_OUTPUT(data[0],0b01000000);
WEIRD_OUTPUT(data[0],0b00100000);
WEIRD_OUTPUT(data[0],0b00010000);
WEIRD_OUTPUT(data[0],0b00001000);
WEIRD_OUTPUT(data[0],0b00000100);
WEIRD_OUTPUT(data[0],0b00000010);
WEIRD_OUTPUT(data[0],0b00000001);
In addition to dropping the loop code itself, this also replaces an array index with a direct access, and a variable with a constant, each of which offers its own speedup.
And it still barely fits in the time available. But it does fit! :)
You can try to use a memory barrier to prevent ordering code across it:
__asm volatile ("" ::: "memory");
The compiler will not move volatile accesses or memory accesses across such a barrier. It might however move other stuff across it like arithmetic that does not involve memory access.
If the compiler is reordering code, that it's a valid compilation w.r.t. the C language standard, which means that you'll have some hard time avoiding it by pure C code hacks.
If you are after speed, then consider writing WEIRD_OUTPUT(data[2],0b10000000); etc. as inline asm, or writing the whole function in assembly. This gives you definite control over the timing (not considering IRQs), whereas the C language standard does not make any statements on program timing whatsoever.

Swift: macro for __attribute__((section))

This is kind of a weird and un-Swift-thonic question, so bear with me.
I want to do in Swift something like the same thing I'm currently doing in Objective-C/C++, so I'll start by describing that.
I have some existing C++ code that defines a macro that, when used in an expression anywhere in the code, will insert an entry into a table in the binary at compile time. In other words, the user writes something like this:
#include "magic.h"
void foo(bool b) {
if (b) {
printf("%d\n", MAGIC(xyzzy));
}
}
and thanks to the definition
#define MAGIC(Name) \
[]{ static int __attribute__((used, section("DATA,magical"))) Name; return Name; }()
what actually happens at compile time is that a static variable named xyzzy (modulo name-mangling) is created and allocated into the special magical section of my Mach-O binary, so that running nm -m foo.o to dump the symbols shows something a lot like this:
0000000000000098 (__TEXT,__eh_frame) non-external EH_frame0
0000000000000050 (__TEXT,__cstring) non-external L_.str
0000000000000000 (__TEXT,__text) external __Z3foob
00000000000000b0 (__TEXT,__eh_frame) external __Z3foob.eh
0000000000000040 (__TEXT,__text) non-external __ZZ3foobENK3$_0clEv
00000000000000d8 (__TEXT,__eh_frame) non-external __ZZ3foobENK3$_0clEv.eh
0000000000000054 (__DATA,magical) non-external [no dead strip] __ZZZ3foobENK3$_0clEvE5xyzzy
(undefined) external _printf
Through the magic of getsectbynamefromheader(), I can then load the symbol table for the magical section, scan through it, and find out (by demangling every symbol I find) that at some point in the user's code, he calls MAGIC(xyzzy). Eureka!
I can replicate the whole second half of that workflow just fine in Swift — starting with the getsectbynamefromheader() part. However, the first part has me stumped.
Swift has no preprocessor, so spelling the magic as elegantly as MAGIC(someidentifier) is impossible. I don't want it to be too ugly, though.
As far as I know, Swift has no way to insert symbols into a given section — no equivalent of __attribute__((section)). This is okay, though, since nothing in my plan requires a dedicated section; that part just makes the second half easier.
As far as I know, the only way to get a symbol into the symbol table in Swift is via a local struct definition. Something like this:
func foo(b: Bool) -> Void {
struct Local { static var xyzzy = 0; };
println(Local.xyzzy);
}
That works, but it's a bit of extra typing, and can't be done inline in an expression (not that that'll matter if we can't make a MAGIC macro in Swift anyway), and I'm worried that the Swift compiler might optimize it away.
So, there are three questions here, all about how to make Swift do things that Swift doesn't want to do: Macros, attributes, and creating symbols that are resistant to compiler optimization.
I'm aware of #asmname but I don't think it helps me since I can already deal with demangling on my own.
I'm aware that Swift has "generics", but they seem to be closer to Java generics than to C++ templates; I don't think they can be used as a substitute for macros in this particular case.
I'm aware that the code for the Swift compiler is now open-source; I've skimmed bits of it in vain; but I can't read through all of it looking for tricks that might not even be there.
Here is the answer to your question about preprocessor (and macros).
Swift has no preprocessor, so spelling the magic as elegantly as MAGIC(someidentifier) is impossible. I don't want it to be too ugly, though.
Swift project has a preprocessor (but, AFAIK, it is not distributed with Swift's binary).
From swift-users mailing list:
What are .swift.gyb files?
It’s a preprocessor the Swift
team wrote so that when they needed to build, say, ten nearly-identical
variants of Int, they wouldn’t have to literally copy and paste the same
code ten times. If you open one of those files, you’ll see that they’re
mainly Swift code, but with some lines of code intermixed that are written in Python.
It is not as beautiful as C macros, but, IMHO, is more powerful.
You can see the available commands with ./swift/utils/gyb --help command after cloning the Swift's git repo.
$ swift/utils/gyb --help
usage, etc (TL;DR)...
Example template:
- Hello -
%{
x = 42
def succ(a):
return a+1
}%
I can assure you that ${x} < ${succ(x)}
% if int(y) > 7:
% for i in range(3):
y is greater than seven!
% end
% else:
y is less than or equal to seven
% end
- The End. -
When run with "gyb -Dy=9", the output is
- Hello -
I can assure you that 42 < 43
y is greater than seven!
y is greater than seven!
y is greater than seven!
- The End. -
My example of GYB usage is available on GitHub.Gist.
For more complex examples look for *.swift.gyb files in #apple/swift/stdlib/public/core.

Simulink generate code with unsupported istruction (ceil)

Generating code by Simulink (Matlab R2011A on MacOS 64bit)
I got a problem: it uses ceil function inside the code, but it isn't supported on my target platform.
I'm generating using ERT, for Arm Cortex processor (on a Cypress PSoC).
Is it possible to solve this problem?
I tried the solution without success.
Also in Code Generation - Interface, I tried to disable floating-point or not-finite numbers... but in this way every signal of my project raises some errors (same behaviour also changing it data-type).
Really thanks to anyone suggests me what I can try to do
You could write your own ceil function and include it in whatever your output code is for your target device. Assuming you are generating C-code, the function would look something like:
int ceil (double number) {
if (number == 0)
return 0;
if (number > 0) {
if (number - (int) number > 0)
return (int) number + 1;
else
return (int) number;
}
else {
if (number - (int) number < 0)
return (int) number - 1;
else
return (int) number;
}
}
With a prototype in your header file like:
int ceil (double);
Now your C-code can call integerValuedNumber = ceil(doubleValuedNumber) and it should work. You can also do this with macros in the C-file (see nintendo's answer).
EDIT: I corrected my code to use proper type casting syntax for C. Basically what you are doing with the (int) number syntax is taking the double-valued number variable and forcing it to be an integer. You can find more information about data types in C here, or Google "type casting C" or "data types C" for more information.
Also, some additional parenthesis might be needed, like return ((int) number) + 1; and similar. I am a little rusty on my C-programming, but hopefully this gets you going toward a viable solution.
EDIT 2: I corrected the return data type of our self-defined ceil function. You would want this to return an int, or maybe long. Again, check out documentation on data types in C if you are not sure what data type is appropriate for your application. If the values you are applying ceil to are not very large (less than +/- 2^15 for instance) then int is probably fine.
Ok... I solved.
The problem was in the destination environment (PSoC Creator).
As explained here http://www.cypress.com/?id=4&rID=42838 :
Go to Project -> Build Settings -> Linker -> General -> Additional Libraries. Type m in the Additional Libraries field.
If you are not adding this Additional Library then you will get the following Build error "undefined reference to `sqrt'" where sqrt is a math function.
Nothing changes if the problem is with sqrt() or ceil(), because they are in the same library (math.h).
PS: thank you Engineero... your solution is very useful, and can be apprecied from other people with my problem (but in other environments).

Using std::complex with iPhone's vDSP functions

I've been working on some vDSP code and I have come up against an annoying problem. My code is cross platform and hence uses std::complex to store its complex values.
Now I assumed that I would be able to set up an FFT as follows:
DSPSplitComplex dspsc;
dspsc.realp = &complexVector.front().real();
dspsc.imagp = &complexVector.front().imag();
And then use a stride of 2 in the appropriate vDSP_fft_* call.
However this just doesn't seem to work. I can solve the issue by doing a vDSP_ztoc but this requires temporary buffers that I really don't want hanging around. Is there any way to use the vDSP_fft_* functions directly on interleaved complex data? Also can anyone explain why I can't do as I do above with a stride of 2?
Thanks
Edit: As pointed out by Bo Persson the real and imag functions don't actually return a reference.
However it still doesn't work if I do the following instead
DSPSplitComplex dspsc;
dspsc.realp = ((float*)&complexVector.front()) + 0;
dspsc.imagp = ((float*)&complexVector.front()) + 1;
So my original question still does stand :(
The std::complex functions real() and imag() return by value, they do not return a reference to the members of complex.
This means that you cannot get their addresses this way.
This is how you do it.
const COMPLEX *in = reinterpret_cast<const COMPLEX*>(std::complex);
Source: http://www.fftw.org/doc/Complex-numbers.html
EDIT:
To clarify the source; COMPLEX and fftw_complex use the same data layout (although fftw_complex uses double and COMPLEX float)

Constants in MATLAB

I've come into ownership of a bunch of MATLAB code and have noticed a bunch of "magic numbers" scattered about the code. Typically, I like to make those constants in languages like C, Ruby, PHP, etc. When Googling this problem, I found that the "official" way of having constants is to define functions that return the constant value. Seems kludgey, especially because MATLAB can be finicky when allowing more than one function per file.
Is this really the best option?
I'm tempted to use / make something like the C Preprocessor to do this for me. (I found that something called mpp was made by someone else in a similar predicament, but it looks abandoned. The code doesn't compile, and I'm not sure if it would meet my needs.)
Matlab has constants now. The newer (R2008a+) "classdef" style of Matlab OOP lets you define constant class properties. This is probably the best option if you don't require back-compatibility to old Matlabs. (Or, conversely, is a good reason to abandon back-compatibility.)
Define them in a class.
classdef MyConstants
properties (Constant = true)
SECONDS_PER_HOUR = 60*60;
DISTANCE_TO_MOON_KM = 384403;
end
end
Then reference them from any other code using dot-qualification.
>> disp(MyConstants.SECONDS_PER_HOUR)
3600
See the Matlab documentation for "Object-Oriented Programming" under "User Guide" for all the details.
There are a couple minor gotchas. If code accidentally tries to write to a constant, instead of getting an error, it will create a local struct that masks the constants class.
>> MyConstants.SECONDS_PER_HOUR
ans =
3600
>> MyConstants.SECONDS_PER_HOUR = 42
MyConstants =
SECONDS_PER_HOUR: 42
>> whos
Name Size Bytes Class Attributes
MyConstants 1x1 132 struct
ans 1x1 8 double
But the damage is local. And if you want to be thorough, you can protect against it by calling the MyConstants() constructor at the beginning of a function, which forces Matlab to parse it as a class name in that scope. (IMHO this is overkill, but it's there if you want it.)
function broken_constant_use
MyConstants(); % "import" to protect assignment
MyConstants.SECONDS_PER_HOUR = 42 % this bug is a syntax error now
The other gotcha is that classdef properties and methods, especially statics like this, are slow. On my machine, reading this constant is about 100x slower than calling a plain function (22 usec vs. 0.2 usec, see this question). If you're using a constant inside a loop, copy it to a local variable before entering the loop. If for some reason you must use direct access of constants, go with a plain function that returns the value.
For the sake of your sanity, stay away from the preprocessor stuff. Getting that to work inside the Matlab IDE and debugger (which are very useful) would require deep and terrible hacks.
I usually just define a variable with UPPER_CASE and place near the top of the file. But you have to take the responsibly of not changing its value.
Otherwise you can use MATLAB classes to define named constants.
MATLAB doesn't have an exact const equivalent. I recommend NOT using global for constants - for one thing, you need to make sure they are declared everywhere you want to use them. I would create a function that returns the value(s) you want. You might check out this blog post for some ideas.
You might some of these answers How do I create enumerated types in MATLAB? useful. But in short, no there is not a "one-line" way of specifying variables whose value shouldn't change after initial setting in MATLAB.
Any way you do it, it will still be somewhat of a kludge. In past projects, my approach to this was to define all the constants as global variables in one script file, invoke the script at the beginning of program execution to initialize the variables, and include "global MYCONST;" statements at the beginning of any function that needed to use MYCONST. Whether or not this approach is superior to the "official" way of defining a function to return a constant value is a matter of opinion that one could argue either way. Neither way is ideal.
My way of dealing with constants that I want to pass to other functions is to use a struct:
% Define constants
params.PI = 3.1416;
params.SQRT2 = 1.414;
% Call a function which needs one or more of the constants
myFunction( params );
It's not as clean as C header files, but it does the job and avoids MATLAB globals. If you wanted the constants all defined in a separate file (e.g., getConstants.m), that would also be easy:
params = getConstants();
Don't call a constant using myClass.myconst without creating an instance first! Unless speed is not an issue. I was under the impression that the first call to a constant property would create an instance and then all future calls would reference that instance, (Properties with Constant Values), but I no longer believe that to be the case. I created a very basic test function of the form:
tic;
for n = 1:N
a = myObj.field;
end
t = toc;
With classes defined like:
classdef TestObj
properties
field = 10;
end
end
or:
classdef TestHandleObj < handle
properties
field = 10;
end
end
or:
classdef TestConstant
properties (Constant)
field = 10;
end
end
For different cases of objects, handle-objects, nested objects etc (as well as assignment operations). Note that these were all scalars; I didn't investigate arrays, cells or chars. For N = 1,000,000 my results (for total elapsed time) were:
Access(s) Assign(s) Type of object/call
0.0034 0.0042 'myObj.field'
0.0033 0.0042 'myStruct.field'
0.0034 0.0033 'myVar' //Plain old workspace evaluation
0.0033 0.0042 'myNestedObj.obj.field'
0.1581 0.3066 'myHandleObj.field'
0.1694 0.3124 'myNestedHandleObj.handleObj.field'
29.2161 - 'TestConstant.const' //Call directly to class(supposed to be faster)
0.0034 - 'myTestConstant.const' //Create an instance of TestConstant
0.0051 0.0078 'TestObj > methods' //This calls get and set methods that loop internally
0.1574 0.3053 'TestHandleObj > methods' //get and set methods (internal loop)
I also created a Java class and ran a similar test:
12.18 17.53 'jObj.field > in matlab for loop'
0.0043 0.0039 'jObj.get and jObj.set loop N times internally'
The overhead in calling the Java object is high, but within the object, simple access and assign operations happen as fast as regular matlab objects. If you want reference behavior to boot, Java may be the way to go. I did not investigate object calls within nested functions, but I've seen some weird things. Also, the profiler is garbage when it comes to a lot of this stuff, which is why I switched to manually saving the times.
For reference, the Java class used:
public class JtestObj {
public double field = 10;
public double getMe() {
double N = 1000000;
double val = 0;
for (int i = 1; i < N; i++) {
val = this.field;
}
return val;
}
public void setMe(double val) {
double N = 1000000;
for (int i = 1; i < N; i++){
this.field = val;
}
}
}
On a related note, here's a link to a table of NIST constants: ascii table and a matlab function that returns a struct with those listed values: Matlab FileExchange
I use a script with simple constants in capitals and include teh script in other scripts tr=that beed them.
LEFT = 1;
DOWN = 2;
RIGHT = 3; etc.
I do not mind about these being not constant. If I write "LEFT=3" then I wupold be plain stupid and there is no cure against stupidity anyway, so I do not bother.
But I really hate the fact that this method clutters up my workspace with variables that I would never have to inspect. And I also do not like to use sothing like "turn(MyConstants.LEFT)" because this makes longer statements like a zillion chars wide, making my code unreadible.
What I would need is not a variable but a possibility to have real pre-compiler constants. That is: strings that are replaced by values just before executing the code. That is how it should be. A constant should not have to be a variable. It is only meant to make your code more readible and maintainable. MathWorks: PLEASE, PLEASE, PLEASE. It can't be that hard to implement this. . .