GPU Acceleration expectation - matlab

I have a code that does some havey computing on 3D arrays. The code is optimized to work on GPU. The 3D array is basically a set of 2D arrays. Each one is stored in a page of the 3D array.
For simplicity, let us consider the whole code is:
A = rand(512,512,100,'gpuArray');
B = fftshift(fftshift(fft2(fftshift(fftshift(A,1),2)),1),2);
Where 512*512 is each 2D array dimensions and the 100 is the number of those 2D arrays.
On nVidia GTX 1060, it takes x millisecond to be computed.
I tried to change the 2D array size to 256*256 or 128*128, no performance enhancement would be noticed. The new time is around 0.9*x milliseconds.
This behavior is expected since, AFAIK, smaller array will not utilize GPU enough and many cores will be IDLE. So, no real gain here.
My question:
If I bought better GPU such as 1080ti or the new 2080ti (seems not available now), would I get a real performance enhancement? I mean since I did not even utilize the whole 1060 capability, would a better GPU make a real difference?
I could think that the clock speed may make a difference. However, I am not sure how much this enhancement would be significant.
One benefit of a better GPU is the bigger memory which will let me do the process on 1024*1024 (which I can not do on the 1060 GPU). However, this is not my main concern.

Related

Metal compute- pixels dependent on other pixels colour

I have recently started to learn the metal framework so I can write some filters for my swift app.
I am about to write a metal kernel that dithers a picture based on error diffusion dithering.
Each pixel is given a Color and then values are distributed to neighbouring pixels based on the original pixels Color. The values are spread out over the whole image as each pixel is calculated so all the pixels are dependent on each other.
The example will be a Floyd stein berg dither.
With the way metal deals with threading this dithering method won’t work. When dithering an image the pixels can only be computed in order from first to last.
Is it possible to have a kernel that doesn’t involve threading, or a way to select the whole image array to be computed by a single thread?
In short No, you cannot do that with GPU computing as the GPU approach is implicitly parallel. That means one result cannot depend on all the other results. What you could try is breaking down the computation into stages, so that one stage at a time can be done in parallel. It depends on what you computation logic does though. If you only want to use "one thread" on the GPU then it would likely be faster to just do the computations on the CPU instead. If you are interested, I wrote up an approach that you might want to read about Rice decompression with Metal. This approach does block by block segmentation of a decompression task.

Converting all variables into gpuArrays doesn't speed up computation

I'm writing simulation with MATLAB where I used CUDA acceleration.
Suppose we have vector x and y, matrix A and scalar variables dt,dx,a,b,c.
What I found out was that by putting x,y,A into gpuArray() before running the iteration and built-in functions, the iteration could be accelerated significantly.
However, when I tried to put variables like dt,dx,a,b,c into the gpuArray(), the program would be significantly slowed down, by a factor of over 30%. (Time increased from 7s to 11s).
Why it was not a good idea to put all the variables into the gpuArray()?
(Short comment, those scalars were multiplied together with x,y,A, and was never used during the iteration alone.)
GPU hardware is optimised for working on relatively large amounts of data. You only really see the benefit of GPU computing when you can feed the many processing cores lots of data to keep them busy. Typically this means you need operations working on thousands or millions of elements.
The overheads of launching operations on the GPU dwarf the computation time when you're dealing with scalar quantities, so it is no surprise that they are slower than on the CPU. (This is not peculiar to MATLAB & gpuArray).

Hardware accelerated image comparison/search?

I need to find the position of a smaller image inside a bigger image. The smaller image is a subset of the bigger image. The requirement is also that pixel values can slightly differ for example if images were produced by different JPEG compressions.
I've implemented the solution by comparing bytes using the CPU but I'm now looking into any possibility to speed up the process.
Could I somehow utilize OpenGLES and thus iPhone GPU for it?
Note: images are grayscale.
#Ivan, this is a pretty standard problem in video compression (finding position of current macroblock in previous frame). You can use a metric for difference in pixels such as sum of abs differences (SAD), sum of squared differences (SSD), or sum of Hadamard-transformed differences (SATD). I assume you are not trying to compress video but rather looking for something like a watermark. In many cases, you can use a gradient descent type search to find a local minimum (best match), on the empirical observation that comparing an image (your small image) to a slightly offset version of same (a match the position of which hasn't been found exactly) produces a closer metric than comparing to a random part of another image. So you can start by sampling the space of all possible offsets/positions (motion vectors in video encoding) rather coarsely, and then do local optimization around the best result. The local optimization works by comparing a match to some number of neighboring matches, and moving to the best of those if any is better than your current match, repeat. This is very much faster than brute force (checking every possible position), but it may not work in all cases (it is dependent on the nature of what is being matched). Unfortunately, this type of algorithm does not translate very well to GPU, because each step depends on previous steps. It may still be worth it; if you check eg 16 neighbors to the position for a 256x256 image, that is enough parallel computation to send to GPU, and yes it absolutely can be done in OpenGL-ES. However the answer to all that really depends on whether you're doing brute force or local minimization type search, and whether local minimization would work for you.

Image cross-correlation with Matlab GPGPU, indexing into 3d array

The problem I'm encountering is writing code such that the built-in features of Matlab's GPU programming will correctly divide data for parallel execution. Specifically, I'm sending N 'particle' images to the GPU's memory, organized in a 3-d array with the third dimension representing each image, and attempting to compare each of the N images with one single image that represents the target, also in the GPU memory.
My current implementation, really more or less how I'd like to see it implemented, is with one line of code:
particle_ifft = ifft2(particle_fft.*target_fft);
Note this is after taking the fft of each of the uploaded images. Herein lies the indexing problem: This statement requires equally sized "particle_fft" and "target_fft" matrices to use the '.*' operator. It would be inefficient in terms of memory usage to have multiple copies of the same target image for the sake of comparing with each particle image. I have used this inefficient method to get good performance results but it significantly affects the number of particle images I can upload to the GPU.
Is there a way that I can tell matlab to compare each 2d element of the particle images 3d array (each image) with only the single target image?
I have tried using a for loop to index into the 3d array and access each of the particle images individually for comparison with the single target but Matlab does not parallelize this type of operation on the GPU, i.e. it runs nearly 1000 times slower than equivalent code using the memory inefficient target array.
I realize I could write my own kernel that would solve this indexing problem but I'm interested in finding a way to leverage matlab's existing capabilities (specifically to not rewrite the fft2 and ifft2 functions). Ideas?
In Parallel Computing Toolbox release R2012a, bsxfun was added - I think that's what you need, i.e.
bsxfun(#times, particle_fft, target_fft);
See: http://www.mathworks.co.uk/help/toolbox/distcomp/bsxfun.html

Buddhabrot Fractal

I am trying to implement buddhabrot fractal. I can't understand one thing: all implementations I inspected pick random points on the image to calculate the path of the particle escaping. Why do they do this? Why not go over all pixels?
What purpose do the random points serve? More points make better pictures so I think going over all pixels makes the best picture - am I wrong here?
From my test data:
Working on 400x400 picture. So 160 000 pixels to iterate if i go all over.
Using random sampling,
Picture only starts to take shape after 1 million points. Good results show up around 1 billion random points which takes hours to compute.
Random sampling is better than grid sampling for two main reasons. First because grid sampling will introduce grid-like artifacts in the resulting image. Second is because grid sampling may not give you enough samples for a converged resulting image. If after completing a grid pass, you wanted more samples, you would need to make another pass with a slightly offset grid (so as not to resample the same points) or switch to a finer grid which may end up doing more work than is needed. Random sampling gives very smooth results and you can stop the process as soon as the image has converged or you are satisfied with the results.
I'm the inventor of the technique so you can trust me on this. :-)
The same holds for flame fractals: Buddha brot are about finding the "attractors",
so even if you start with a random point, it is assumed to quite quickly converge to these attracting curves. You typically avoid painting the first 10 pixels in the iteration or so anyways, so the starting point is not really relevant, BUT, to avoid doing the same computation twice, random sampling is much better. As mentioned, it eliminates the risk of artefacts.
But the most important feature of random sampling is that it has all levels of precision (in theory, at least). This is VERY important for fractals: they have details on all levels of precision, and hence require input from all levels as well.
While I am not 100% aware of what the exact reason would be, I would assume it has more to do with efficiency. If you are going to iterate through every single point multiple times, it's going to waste a lot of processing cycles to get a picture which may not look a whole lot better. By doing random sampling you can reduce the amount work needed to be done - and given a large enough sample size still get a result that is difficult to "differentiate" from iterating over all the pixels (from a visual point of view).
This is possibly some kind of Monte-Carlo method so yes, going over all pixels would produce the perfect result but would be horribly time consuming.
Why don't you just try it out and see what happens?
Random sampling is used to get as close as possible to the exact solution, which in cases like this cannot be calculated exactly due to the statistical nature of the problem.
You can 'go over all pixels', but since every pixel is in fact some square region with dimensions dx * dy, you would only use num_x_pixels * num_y_pixels points for your calculation and get very grainy results.
Another way would be to use a very large resolution and scale down the render after the calculation. This would give some kind of 'systematic' render where every pixel of the final render is divided in equal amounts of sub pixels.
I realize this is an old post, but wanted to add my thoughts based on a current project.
The problem with tying your samples to pixels, like others said:
Couples your sample grid to your view plane, making it difficult to do projections, zooms, etc
Not enough fidelity. Random sampling is more efficient as everyone else said, so you need even more samples if you want to sample using a uniform grid
You're much more likely to see grid artifacts at equivalent sample counts, whereas random sampling tends to just look grainy at low counts
However, I'm working on a GPU-accelerated version of buddhabrot, and ran into a couple issues specific to GPU code with random sampling:
I've had issues with overhead/quality of PRNGs in GPU code where I need to generate thousands of samples in parallel
Random sampling produces highly scattered traces, and the GPU really, really wants threads in a given block/warp to move together as much as possible. The performance difference for clustered vs scattered traces was extreme in my testing
Hardware support in modern GPUs for efficient atomicAdd means writes to global memory don't bottleneck GPU buddhabrot nearly as much now
Grid sampling makes it very easy to do a two-pass render, skipping blocks of sample points based on a low-res pass to find points that don't escape or don't ever touch a pixel in the view plane
Long story short, while the GPU technically has to do more work this way for equivalent quality, it's actually faster in practice AFAICT, and GPUs are so fast that re-rendering is often a matter of seconds/minutes instead of hours (or even milliseconds at lower resolution / quality levels, realtime buddhabrot is very cool)