What about this OpenCL kernel is causing the error CL_INVALID_COMMAND_QUEUE

What about this OpenCL kernel is causing the error CL_INVALID_COMMAND_QUEUE - neural-network

I'm having a problem implementing a Feed-Forward MultiLayer Perceptron, with back-prop learning in OpenCL in Java, using JOCL. Here is the kernel code for the calculation phase:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void Neuron(__global const double *inputPatterns,
__global double *weights,
__global const int *numInputs,
__global const int *activation,
__global const double *bias,
__global const int *usingBias,
__global double *values,
__global const int *maxNumFloats,
__global const int *patternIndex,
__global const int *inputPatternSize,
__global const int *indexOffset,
__global const int *isInputNeuron,
__global const int *inputs)
{
int gid = get_global_id(0);
double sum = 0.0;
for(int i = 0; i < numInputs[gid+indexOffset[0]]; i++)
{
sum += values[inputs[(gid+indexOffset[0]) * maxNumFloats[0] + i]] *
weights[(gid+indexOffset[0]) * maxNumFloats[0] + i];
}
if(usingBias[gid+indexOffset[0]])
sum += bias[gid+indexOffset[0]];
if(isInputNeuron[gid+indexOffset[0]])
sum += inputPatterns[gid+indexOffset[0]+(patternIndex[0] * inputPatternSize[0])];
if(activation[gid+indexOffset[0]] == 1)
sum = 1.0 / (1.0 + exp(-sum));
values[gid + indexOffset[0]] = sum;
}
Basically, I run this kernel for each layer in the network. For the first layer, there are no "inputs", so the loop does not execute. As the first layer is an input node layer however, it does add the relevant value from the input pattern. This executes fine, and I can read back the values at this point.
When I try and run the SECOND layer however (which does have inputs, every node from the first layer), a call to clFinish() returns the error CL_INVALID_COMMAND_QUEUE. Sometimes this error is coupled with a driver crash and recovery. I have read around (here for example) that this might be a problem with TDR timeouts, and have made an attempt to raise the limit but unsure if this is making any difference.
I'm going through the calls to clSetKernelArg() to check for anything stupid, but can anyone spot anything obviously off in the code? It would seem that the error is introduced in the second layer due to the inclusion of the for loop... I can clarify any of the parameters if its needed but it seemed a bit overkill for an initial post.
Also, I'm fully aware this code will probably be an affront to competent coders everywhere, but feel free to flame :P
EDIT: Host code:
//Calc
for(int k = 0; k < GPUTickList.length; k++)
{
clFlush(clCommandQueue);
clFinish(clCommandQueue);
//If input nodes
if(k == 0)
//Set index offset to 0
GPUMapIndexOffset.asIntBuffer().put(0, 0);
else
//Update index offset
GPUMapIndexOffset.asIntBuffer().put(0,
GPUMapIndexOffset.asIntBuffer().get(0) + GPUTickList[k-1]);
//Write index offset to GPU buffer
ret = clEnqueueWriteBuffer(clCommandQueue, memObjects[12], CL_TRUE, 0,
Sizeof.cl_int, Pointer.to(GPUMapIndexOffset.position(0)), 0, null, null);
//Set work size (width of layer)
global_work_size[0] = GPUTickList[k];
ret = clEnqueueNDRangeKernel(clCommandQueue, kernel_iterate, 1,
global_work_offset, global_work_size, local_work_size,
0, null, null);
}
EDIT 2: I've uploaded the full code to pastebin.

Solved. Fixed the error by making everything indexed with [0] a straight kernel parameter, rather than a buffer. Clearly the hardware doesn't like lots of stuff accessing one particular element of a buffer at once.

I'm not sure about what you have above the loop.. do you use the queue other than in this loop? Below is something you may want to try out.
//flush + finish if you need to before the loop, otherwise remove these lines
clFlush(clCommandQueue);
clFinish(clCommandQueue);
cl_event latestEvent;
//Calc
for(int k = 0; k < GPUTickList.length; k++)
{
//If input nodes
if(k == 0)
//Set index offset to 0
GPUMapIndexOffset.asIntBuffer().put(0, 0);
else
//Update index offset
GPUMapIndexOffset.asIntBuffer().put(0,
GPUMapIndexOffset.asIntBuffer().get(0) + GPUTickList[k-1]);
//Write index offset to GPU buffer
ret = clEnqueueWriteBuffer(clCommandQueue, memObjects[12], CL_TRUE, 0,
Sizeof.cl_int, Pointer.to(GPUMapIndexOffset.position(0)), 0, null, null);
//Set work size (width of layer)
global_work_size[0] = GPUTickList[k];
ret = clEnqueueNDRangeKernel(clCommandQueue, kernel_iterate, 1,
global_work_offset, global_work_size, local_work_size,
0, null, &latestEvent);
clWaitForEvents(1, &latestEvent);
}

Related

DPDK implementation of MPSC ring buffer

While going through the implementation of the DPDK MPSC (multi-produce & single-consumer) Ring Buffer API, i found the code to move the head of the producer for inserting new elements in the Ring buffer. The function is as follows :
static __rte_always_inline unsigned int
__rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
unsigned int n, enum rte_ring_queue_behavior behavior,
uint32_t *old_head, uint32_t *new_head,
uint32_t *free_entries)
{
const uint32_t capacity = r->capacity;
uint32_t cons_tail;
unsigned int max = n;
int success;
*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
do {
/* Reset n to the initial burst count */
n = max;
/* Ensure the head is read before tail */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
/* load-acquire synchronize with store-release of ht->tail
* in update_tail.
*/
cons_tail = __atomic_load_n(&r->cons.tail,
__ATOMIC_ACQUIRE);
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
* *old_head > cons_tail). So 'free_entries' is always between 0
* and capacity (which is < size).
*/
*free_entries = (capacity + cons_tail - *old_head);
/* check that we have enough room in ring */
if (unlikely(n > *free_entries))
n = (behavior == RTE_RING_QUEUE_FIXED) ?
0 : *free_entries;
if (n == 0)
return 0;
*new_head = *old_head + n;
if (is_sp)
r->prod.head = *new_head, success = 1;
else
/* on failure, *old_head is updated */
success = __atomic_compare_exchange_n(&r->prod.head,
old_head, *new_head,
0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
}
The load and compare exchange of the producer's head is done using __ATOMIC_RELAXED memory ordering. Isn't this a problem when multiple producers from different threads produce to the queue. Or am I missing something?
https://doc.dpdk.org/guides/prog_guide/ring_lib.html describes the basic mechanism that DPDK uses for implementing the Ring buffer.

Cant rewrite variable after assigned as infinity Ansi C

I have double field ypos[] and in some cases instead of double is written into field -1.#INF00. I need to rewrite this values with 0.0, but i cant simply assign zero value, application falls everytime. Is there a way, how to solve this problem?

Please see http://codepad.org/KKYLhbkh for a simple example of how things "should work". Since you have not provided the code that doesn't work, we're guessing here.
#include <stdio.h>
#define INFINITY (1.0 / 0.0)
int main(void) {
double a[5] = {INFINITY, 1, 2, INFINITY, 4};
int ii;
for(ii = 0; ii < 5; ii++) {
printf("before: a[%d] is %lf\n", ii, a[ii]);
if(isinf(a[ii])) a[ii] = 0.0;
printf("after: a[%d] is %lf\n", ii, a[ii]);
}
return 0;
}
As was pointed out in #doynax 's comment, you might want to disable floating point exceptions to stop them from causing your program to keel over.
edit if your problem is caused by having taken a logarithm from a number outside of log's domain (i.e. log(x) with x<=0 ), the following code might help:
#include <stdio.h>
#include <signal.h>
#include <math.h>
#define INFINITY (1.0 / 0.0)
int main(void) {
double a[5] = {INFINITY, 1, 2, INFINITY, 4};
int ii;
signal(SIGFPE, SIG_IGN);
a[2] = log(-1.0);
for(ii = 0; ii < 5; ii++) {
printf("before: a[%d] is %lf\n", ii, a[ii]);
if(isinf(a[ii]) || isnan(a[ii])) a[ii] = 0.0;
printf("after: a[%d] is %lf\n", ii, a[ii]);
}
return 0;
}
You get a different value for log(0.0) (namely -Inf) and for log(-1.0) (namely nan). The above code shows how to deal with either of these.

A geuess that your problem is somewhere else. Can you write 0.0 before INF is written?

Sorry guys, I've solving this problem about 2 hours and tried to ask here but I've made a solution by comparing value of variable with -DBL_MAX. But thank you all for try.

Renderscript, forEach_root, and porting from OpenCL

1) How can I access in forEach_root() other elements except for the current one?
In OpenCL we have pointer to the first element and then can use get_global_id(0) to get current index. But we can still access all other elements. In Renderscript, do we only have pointer to the current element?
2) How can I loop through an Allocation in forEach_root()?
I have a code that uses nested (double) loop in java. Renderscript automates the outer loop, but I can't find any information on implementing the inner loop. Below is my best effort:
void root(const float3 *v_in, float3 *v_out) {
rs_allocation alloc = rsGetAllocation(v_in);
uint32_t cnt = rsAllocationGetDimX(alloc);
*v_out = 0;
for(int i=0; i<cnt; i++)
*v_out += v_in[i];
}
But here rsGetAllocation() fails when called from forEach_root().
05-11 21:31:29.639: E/RenderScript(17032): ScriptC::ptrToAllocation, failed to find 0x5beb1a40
Just in case I add my OpenCL code that works great under Windows. I'm trying to port it to Renderscript
typedef float4 wType;
__kernel void gravity_kernel(__global wType *src,__global wType *dst)
{
int id = get_global_id(0);
int count = get_global_size(0);
double4 tmp = 0;
for(int i=0;i<count;i++) {
float4 diff = src[i] - src[id];
float sq_dist = dot(diff, diff);
float4 norm = normalize(diff);
if (sq_dist<0.5/60)
tmp += convert_double4(norm*sq_dist);
else
tmp += convert_double4(norm/sq_dist);
}
dst[id] = convert_float4(tmp);
}

You can provide data apart from your root function. In the current android version (4.2) you could do the following (It is an example from an image processing scenario):
Renderscript snippet:
#pragma version(1)
#pragma rs java_package_name(com.example.renderscripttests)
//Define global variables in your renderscript:
rs_allocation pixels;
int width;
int height;
// And access these in your root function via rsGetElementAt(pixels, px, py)
void root(uchar4 *v_out, uint32_t x, uint32_t y)
{
for(int px = 0; px < width; ++px)
for(int py = 0; py < height; ++py)
{
// unpack a color to a float4
float4 f4 = rsUnpackColor8888(*(uchar*)rsGetElementAt(pixels, px, py));
...
Java file snippet
// In your java file, create a renderscript:
RenderScript renderscript = RenderScript.create(this);
ScriptC_myscript script = new ScriptC_myscript(renderscript);
// Create Allocations for in- and output (As input the bitmap 'bitmapIn' should be used):
Allocation pixelsIn = Allocation.createFromBitmap(renderscript, bitmapIn,
Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
Allocation pixelsOut = Allocation.createTyped(renderscript, pixelsIn.getType());
// Set width, height and pixels in the script:
script.set_width(640);
script.set_height(480);
script.set_pixels(pixelsIn);
// Call the for each loop:
script.forEach_root(pixelsOut);
// Copy Allocation to the bitmap 'bitmapOut':
pixelsOut.copyTo(bitmapOut);
You can see, the input 'pixelsIn' is previously set and used inside the renderscript when calling the forEach_root function to calculate values for 'pixelsOut'. Also width and height are previously set.

Expression result unused

I got some codes and I'm trying to fix some compiling bugs:
StkFrames& PRCRev :: tick( StkFrames& frames, unsigned int channel )
{
#if defined(_STK_DEBUG_)
if ( channel >= frames.channels() - 1 ) {
errorString_ << "PRCRev::tick(): channel and StkFrames arguments are incompatible!";
handleError( StkError::FUNCTION_ARGUMENT );
}
#endif
StkFloat *samples = &frames[channel];
unsigned int hop = frames.channels();
for ( unsigned int i=0; i<frames.frames(); i++, samples += hop ) {
*samples = tick( *samples );
*samples++; <<<<<<<<<--------- Expression result unused.
*samples = lastFrame_[1];
}
return frames;
}
I don't understand what the codes is trying to do. The codes are huge and I fixed quite a few. But googling didn't work for this.
Any ideas?

First, you do an increment (the line which actually gives you warning).
*samples++;
And then you assign to that variable something else, which makes previous action unused.
*samples = lastFrame_[1];
I recommend you to read this code inside 'for' loop more carefully. It doesn't look very logical.

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...

Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.

Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}

First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What about this OpenCL kernel is causing the error CL_INVALID_COMMAND_QUEUE - neural-network

Solved. Fixed the error by making everything indexed with [0] a straight kernel parameter, rather than a buffer. Clearly the hardware doesn't like lots of stuff accessing one particular element of a buffer at once.

Related

DPDK implementation of MPSC ring buffer

Cant rewrite variable after assigned as infinity Ansi C

Renderscript, forEach_root, and porting from OpenCL

Expression result unused

Fastest way of bitwise AND between two arrays on iPhone?

Categories

Resources