How do I feed a 2-dimensional array into a kernel with pycuda? - pycuda

I have created a numpy array of float32s with shape (64, 128), and I want to send it to the GPU. How do I do that? What arguments should my kernel function accept? float** myArray?
I have tried directly sending the array as it is to the GPU, but pycuda complains that objects are being accessed...

Two dimensional arrays in numpy/PyCUDA are stored in pitched linear memory in row major order by default. So you only need to have a kernel something like this:
__global__
void kernel(float* a, int lda, ...)
{
int r0 = threadIdx.y + blockDim.y * blockIdx.y;
int r1 = threadIdx.x + blockDim.x * blockIdx.x;
float val = a[r0 + r1*lda];
....
}
to access a numpy ndarray or PyCUDA gpuarray passed by reference to the kernel from Python.

Related

Reduce size of vector in MEX-file

Given a typed vector like this
matlab::data::ArrayFactory Factory;
matlab::data::TypedArray<double> BigArray = Factory.createArray({420, 1});
How can I shrink BigArray size without (re)allocations? All I want is to set its internal length-dimension to a value smaller than 420.
Well supposing you spring for the C API instead of the C++ API, you can use mxSetN or mxSetM on the mxArray object to reduce it.
int M = 420;
int N = 1;
mxArray *BigArray = mxCreateNumericMatrix(M, N, mxDOUBLE_CLASS, mxREAL);
mxSetM(BigArray, M - 4);

matlab GPU computation

I am learning matlab GPU functions. My function myfun takes 2 input parameters delta, p. Eventually, I will apply myfun to many combinations of delta,p. For each combination of delta,p, 'myfun' will how many V's satisfies the condition delta*V-p>0, where V = [0:0.001:1]. Ideally, I want V to be a global variable. But it seems that matlab GPU has some restrictions on global variable. So I use another way to do this thing. The code is as the following:
function result = gpueg2()
dd = 0.1;
DELTA = [dd:dd:1];
dp = 0.001;
P = [0:dp:1];
[p,delta]=meshgrid(P,DELTA);
p = gpuArray(p(:));
delta = gpuArray(delta(:));
V = [0:0.001:1];
function [O] = myfun(delta,p)
O = sum((delta*V-p)>0);
end
result = arrayfun(#myfun,delta,p);
end
However, it through an error message
Function passed as first input argument contains unsupported or unknown function 'sum'.
But I believe sum is applicable in GPU.
Any advice and suggestions are highly appreciated.
The problem with sum is not with the GPU, it's with using arrayfun on a GPU. The list of functions that arrayfun on a GPU accepts is given here: https://www.mathworks.com/help/distcomp/run-element-wise-matlab-code-on-a-gpu.html. sum is not on the list on that page of documentation.
Your vectors are not so big (though I accept this may be a toy example of your real problem). I suggest the following alternative implementation:
function result = gpueg2()
dd = 0.1;
DELTA = dd:dd:1;
dp = 0.001;
P = 0:dp:1;
V = 0:0.001:1;
[p,delta,v] = meshgrid(P,DELTA,V);
p = gpuArray(p);
delta = gpuArray(delta);
v = gpuArray(v);
result = sum(delta.*v-p>0, 3);
end
Note the following differences:
I make 3D arrays of p,delta,v, rather than 2D. These three are only 24MB in total.
I do the calculation delta.*v-p>0 on the whole 3D array: this will be well split on the GPU.
I do the sum on the 3rd index, i.e. over V.
I have checked that your routine on the CPU and mine on the GPU give the same results.

Reimplement vDSP_deq22 for Biquad IIR Filter by hand

I'm porting a filterbank that currently uses the Apple-specific (Accelerate) vDSP function vDSP_deq22 to Android (where Accelerate is not available). The filterbank is a set of bandpass filters that each return the RMS magnitude for their respective band. Currently the code (ObjectiveC++, adapted from NVDSP) looks like this:
- (float) filterContiguousData: (float *)data numFrames:(UInt32)numFrames channel:(UInt32)channel {
// Init float to store RMS volume
float rmsVolume = 0.0f;
// Provide buffer for processing
float tInputBuffer[numFrames + 2];
float tOutputBuffer[numFrames + 2];
// Copy the two frames we stored into the start of the inputBuffer, filling the rest with the current buffer data
memcpy(tInputBuffer, gInputKeepBuffer[channel], 2 * sizeof(float));
memcpy(tOutputBuffer, gOutputKeepBuffer[channel], 2 * sizeof(float));
memcpy(&(tInputBuffer[2]), data, numFrames * sizeof(float));
// Do the processing
vDSP_deq22(tInputBuffer, 1, coefficients, tOutputBuffer, 1, numFrames);
vDSP_rmsqv(tOutputBuffer, 1, &rmsVolume, numFrames);
// Copy the last two data points of each array to be put at the start of the next buffer.
memcpy(gInputKeepBuffer[channel], &(tInputBuffer[numFrames]), 2 * sizeof(float));
memcpy(gOutputKeepBuffer[channel], &(tOutputBuffer[numFrames]), 2 * sizeof(float));
return rmsVolume;
}
As seen here, deq22 implements a biquad filter on a given input vector via a recursive function. This is the description of the function from the docs:
A =: Single-precision real input vector
IA =: Stride for A.
B =: 5 single-precision inputs (filter coefficients), with stride 1.
C =: Single-precision real output vector.
IC =: Stride for C.
N =: Number of new output elements to produce.
This is what I have so far (it's in Swift, like the rest of the codebase that I already have running on Android):
// N is fixed on init to be the same size as buffer.count, below
// 'input' and 'output' are initialised with (N+2) length and filled with 0s
func getFilteredRMSMagnitudeFromBuffer(var buffer: [Float]) -> Float {
let inputStride = 1 // hardcoded for now
let outputStride = 1
input[0] = input[N]
input[1] = input[N+1]
output[0] = output[N]
output[1] = output[N+1]
// copy the current buffer into input
input[2 ... N+1] = buffer[0 ..< N]
// Not sure if this is neccessary, just here to duplicate NVDSP behaviour:
output[2 ... N+1] = [Float](count: N, repeatedValue: 0)[0 ..< N]
// Again duplicating NVDSP behaviour, can probably just start at 0:
var sumOfSquares = (input[0] * input[0]) + (input[1] * input[1])
for n in (2 ... N+1) {
let sumG = (0...2).reduce(Float(0)) { total, p in
return total + input[(n - p) * inputStride] * coefficients[p]
}
let sumH = (3...4).reduce(Float(0)) { total, p in
return total + output[(n - p + 2) * outputStride] * coefficients[p]
}
let filteredFrame = sumG - sumH
output[n] = filteredFrame
sumOfSquares = filteredFrame * filteredFrame
}
let meanSquare = sumOfSquares / Float(N + 2) // we added 2 values by hand, before the loop
let rootMeanSquare = sqrt(meanSquare)
return rootMeanSquare
}
The filter gives a different magnitude output to deq22 though, and seems to have a cyclic wurbling circular 'noise' in it (with a constant input tone, that frequency's magnitude pumps up and down).
I've checked to ensure the coefficients arrays are identical between each implementation. Each filter actually seems to "work" in that it picks up the correct frequency (and only that frequency), it's just this pumping, and that the RMS magnitude output is a lot quieter than vDSP's, often by orders of magnitude:
Naive | vDSP
3.24305e-06 0.000108608
1.57104e-06 5.53645e-05
1.96445e-06 4.33506e-05
2.05422e-06 2.09781e-05
1.44778e-06 1.8729e-05
4.28997e-07 2.72648e-05
Can anybody see an issue with my logic?
Edit: here is a gif video of the result with a constant 440Hz tone. The various green bars are the individual filter bands. The 3rd band (shown here) is the one tuned to 440Hz.
The NVDSP version just shows a constant (non-fluctuating) magnitude reading proportional to the input volume, as expected.
Ok, the line sumOfSquares = filteredFrame * filteredFrame should be a +=, not an assignment. So only the last frame was being calculated, explains a lot ;)
Feel free to use this if you want to do some biquad filtering in Swift. MIT Licence like NVDSP before it.

How to do Weighted Averaging of n conscutive values in an Array

I have a 900×1 vector of values (in MATLAB). Each 9 consecutive values should be averaged -without overlap- result in a 100×1 vector of values. The problem is that the averaging should be weighted based on a weighting vector of [1 2 1;2 4 2;1 2 1]. Is there any efficient way to do that averaging? I’ve heard about conv function in MATLAB; Is it helpful?
conv works by sliding a kernel through your data. But in your case, you need the mask to be jumping through your data, so I don't think conv will work for you.
If you want to use existing MATLAB function, you can do this (I have to assume your weighting matrix has only one dimension) :
kernel = [1;2;1;2;4;2;1;2;1];
in_matrix = reshape(in_matrix, 9, 100);
base = sum(kernel);
out_matrix = bsxfun(#times, in_matrix, kernel);
result = sum(out_matrix,1)/base;
I don't know if there is any clever way to speed this up. bsxfun allows singleton expansion, but maybe not dimension reduction.
A faster way would be to use mex. Open a new file in editor, paste the following code and save file as weighted_average.c.
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *in_matrix, *kernel, *out_matrix, base;
int niter;
size_t nrows_data, nrows_kernel;
/* Get number of element along first dimension of input matrix. */
nrows_kernel = mxGetM(prhs[1]);
nrows_data = mxGetM(prhs[0]);
/* Create output matrix*/
plhs[0] = mxCreateDoubleMatrix((mwSize)nrows_data/nrows_kernel,1,mxREAL);
/* Get a pointer to the real data */
in_matrix = mxGetPr(prhs[0]);
kernel = mxGetPr(prhs[1]);
out_matrix = mxGetPr(plhs[0]);
/* Sum the elements in weighting array */
base = 0;
for (int i = 0; i < nrows_kernel; i +=1)
{
base += kernel[i];
}
/* Perform calculation */
niter = nrows_data/nrows_kernel;
for (int i = 0; i < niter ; i += 1)
{
for (int j = 0; j < nrows_kernel; j += 1)
{
out_matrix[i] += in_matrix[i*nrows_kernel+j]*kernel[j];
}
out_matrix[i] /= base;
}
}
Then in command window , type in
mex weighted_average.c
To use it:
result = weighted_average(input, kernel);
Note that both input and kernel have to be M x 1 matrix. On my computer, the first method took 0.0012 second. The second method took 0.00007 second. That's an order of magnitude faster than the first method.

operations on a 2D array in CUDA kernel for matlab

suppose I have the following serial C:
int add(int* a, int* b, int n)
{
for(i=0; i<n; i++)
{
for(j=0; j<n; j++)
{
a[i][j]+=b[i][j];
}
}
return 0;
}
I think the best way to paralellise it is to realise it is a 2D problem and use 2D thread blocks as per CUDA kernel - nested for loop
With that in mind I started writing my cuda kernal like this:
__global__ void calc(int **A, int **B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i>=n || j>=n)
return;
A[i][j]+=B[i][j];
}
nvcc tells me that:
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
1) I am correct with my philosphy?
2) I think I understand blocks, thread etc but I don't understand what
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
does
3) Is this the most efficient/fastest way of performing operations on a 2D array in general? i.e not just matrix addition it could be any "element by element" operation.
4) Will I be able to call it from matlab? normally it freaks when the prototype is of the form type** var
Thanks guys
The compiler warnings you are getting come from the fact that on older GPUs, the memory structure is not "flat". The compiler can't know what memory space the addresses held by the pointer arrays your kernel is working in are. So it is warning you that it is assuming the operation is being peforming in global memory. If you compile the code for a Fermi card (sm_20 or sm_21 architecture), you won't see that warning because the memory model on those cards is "flat", and pointers are correctly interpreted by the hardware at runtime. The compiler doesn't need to handle it at compile time.
To answer each of your questions:
Yes. And no. The overall idea is about 90% right, but there are several implementation issues which will become apparent from the answers which follow.
CUDA C has built in variables to allow each thread to determine its "coordinates" in the execution grid which it is running, and the dimensions of each block and the grid itsef. threadIdx.{xyz} provides the thread coordinates within a block, and blockIdx.{xyz} the block coordinate with the grid. blockDim.{xyz} and gridDim.{xyz} provide the dimensions of the block and the grid, respectively (note not all hardware supports 3D grids). CUDA uses column major order for numbering threads within each block and block within each grid. The calculation you are querying is computing the equivalent {i,j} coordinate in a 2D grid using the thread and block coordinates and the block size. This is discussed in some detail in the first few pages of the "Programming model" chapter of the CUDA programming guide.
No, and I say that for two reasons.
Firstly, using arrays of pointers for memory access is not a good idea in CUDA. Two levels of pointer indirection hugely increases the latency penalty to get to your data. The key difference in a typical GPU architecture compared to a modern CPU achitecture is the memory system. GPUs have stunningly high peak memory bandwidth, but very high access latency, whereas CPUs are designed for minimal latency. So having to read and indirect two pointers to fetch a value from memory is a very big performance penalty. Store your 2D array or matrix in linear memory instead. This is what BLAS, LAPACK and Matlab do anyway.
Secondly, every thread in your code is performing four integer arithmetic operations of setup overhead (the index calculations) for every one "productive" integer operation (the addition). There are strategies to reduce that, usually involving having each thread process more than one array element.
If I was to write a kernel for that operation I would do it something like the code at the bottom of my answer. This uses linear memory and a 1D grid. A suitable number of threads to properly occupy the GPU process the whole input array, with each thread processing many inputs.
No. As I mentioned earlier in my answer, Matlab uses linear memory to store matrices, not an array of pointers. This doesn't match the layout your kernel code is expecting.
Sample code:
__global__ void calc(int *A, int *B, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int s = blockDim.x * gridDim.x;
for( ; i<N; i+=s) {
A[i] += B[i];
}
}
I am assuming you are working with n-by-n, row major order array. Try the following :
__global__ void calc(int *A, int *B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i<n && j<n) {
A[i*n+j] += B[i*n+j];
}
}