Is there a way to pin arrays in Numba, for fast data transfer to/from device? - numba

In Pytorch, there is an option to pin CPU arrays for fast transfer to GPU (does not seem to work for GPU -> CPU though).
I am wondering if there is a way to pin Numba arrays to memory, or any alternative technique for fast transfer from CPU to GPU. I do not see a direct option for this from the documentation, so my guess is that we need to specify allocation during the array creation.

Related

GPU Acceleration expectation

I have a code that does some havey computing on 3D arrays. The code is optimized to work on GPU. The 3D array is basically a set of 2D arrays. Each one is stored in a page of the 3D array.
For simplicity, let us consider the whole code is:
A = rand(512,512,100,'gpuArray');
B = fftshift(fftshift(fft2(fftshift(fftshift(A,1),2)),1),2);
Where 512*512 is each 2D array dimensions and the 100 is the number of those 2D arrays.
On nVidia GTX 1060, it takes x millisecond to be computed.
I tried to change the 2D array size to 256*256 or 128*128, no performance enhancement would be noticed. The new time is around 0.9*x milliseconds.
This behavior is expected since, AFAIK, smaller array will not utilize GPU enough and many cores will be IDLE. So, no real gain here.
My question:
If I bought better GPU such as 1080ti or the new 2080ti (seems not available now), would I get a real performance enhancement? I mean since I did not even utilize the whole 1060 capability, would a better GPU make a real difference?
I could think that the clock speed may make a difference. However, I am not sure how much this enhancement would be significant.
One benefit of a better GPU is the bigger memory which will let me do the process on 1024*1024 (which I can not do on the 1060 GPU). However, this is not my main concern.

Best way to store information permanently from an Artificial Neural Network

Im asking myself what may be the best way to store and access information/data fastly on a pc. Im asking this in terms of Artificial Intelligence (espacially: Artificial Neural Networks -> LSTMs, etc.) because I want to know how to store information from an Artificial Neural Network (ANN), which has a huge number of Neurons and so alot of synaptic weigths to hold. By saving the data from the neurons I want to reduce the usage of hardware resources because the ANN just exists in the RAM and I have the fear to overload my RAM/JavaVirtualMachine (My ANN is written in JAVA). I know that I could simply save the weigths into a file and let it read when needed but is there a better way (like datastructs or anything?) to save the information.
To reduce RAM usage you can also simulate ANN in your GPU(and to speed up). You should learn openCL and JNI but I think this is good idea. I want do that in my own ANN libralies.

How does Matlab implement GPU computation in CPU parallel loops?

Can we improve performance by calculating some parts of CPU's parfor or spmd blocks using gpuArray of GPU functions? Is this a rational way to improve performance or there are limitations in this procedure? I read somewhere that we can use this procedure when we have some GPU units. Is this the only way that we can use GPU computing besides CPU parallel loops?
It is possible that using gpuArray within a parfor loop or spmd block can give you a performance benefit, but really it depends on several factors:
How many GPUs you have on your system
What type of GPUs you have (some are better than others at dealing with being "oversubscribed" - i.e. where there are multiple processes using the same GPU)
How many workers you run
How much GPU memory you need for your alogrithm
How well suited the problem is to the GPU in the first place.
So, if you had two high-powered GPUs in your machine and ran two workers in a parallel pool on a problem that could keep a single GPU fully occupied - you'd expect to see good speedup. You might still get decent speedup if you ran 4 workers.
One thing that I would recommend is: if possible, try to avoid transferring gpuArray data from client to workers, as this is slower than usual data transfers (the gpuArray is first gathered to the CPU and then reconstituted on the worker).

Ways to accelerate reduce operation on Xeon CPU, GPU and Xeon Phi

I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?
Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?
How are reduce operations implemented in these hardware in general?
You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.
The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.
However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.
Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.
EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.
Turns out none of the hardware have reduce operation circuit built-in.
I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is #pragma omp parallel for reduction in openmp. I am yet to test its performance though.
This operation is going to be bandwidth-limited and thus vectorization almost certainly doesn't matter. You want the hardware with the most memory bandwidth. An Intel Xeon Phi processor has more aggregate bandwidth (but not bandwidth-per-core) than a Xeon processor.

GPU perfomance request, what's the best solution?

I work on an audio processing project that needs to do a lot of basic computations (+, -, *) like a FFT (Fast Fourier Transform) calculation.
We're considering using a graphics card to accelerate these computations. But we don't know if this is the best solution. Our desired solution needs to be a good computation system costing less than $500.
We use Matlab programming, and we have a sound card acquisition which have to be plug in the system.
Do you know a solution other than graphics card + motherboard to do lot of calculus?
You can use the free Matlab CUDA library to perform the computations on the GPU. $500 will give you a very decent NVIDIA GPU. Beware that GPU's have limited video memory and will run out of memory with large data volumes even faster than Matlab.
I have benchmarked an 8core intel CPU against an 8800 Nvidia GPU (128streams) with GPUMat , for 512Kb datasets the GPU spun out at the same speed as the 8 core intel at 2Ghz, including transfer times to the GPU memory. For serious GPU work I recommend a dedicated card compared to the one you are using to drive the monitor. Use the motherboard cheapie intel video to drive the monitor and pass the array computes to the Nvidia.
Parallel Computing Toolbox from MathWorks now includes GPU support. In particular, elementwise operations and arithmetic are supported, as well as 1- and 2-dimensional FFTs (along with a whole bunch of other stuff to support hand-written CUDA code if you have that). If you're interested in performing calculations in double-precision, the recent Tesla and Quadro branded cards will give you the best performance.
Here's a trivial example showing how you might use the GPU in MATLAB using Parallel Computing Toolbox:
gA = gpuArray( rand(1000) );
gB = fft( 1 + gA * 3 );