Renderscript, forEach_root, and porting from OpenCL - renderscript

1) How can I access in forEach_root() other elements except for the current one?
In OpenCL we have pointer to the first element and then can use get_global_id(0) to get current index. But we can still access all other elements. In Renderscript, do we only have pointer to the current element?
2) How can I loop through an Allocation in forEach_root()?
I have a code that uses nested (double) loop in java. Renderscript automates the outer loop, but I can't find any information on implementing the inner loop. Below is my best effort:
void root(const float3 *v_in, float3 *v_out) {
rs_allocation alloc = rsGetAllocation(v_in);
uint32_t cnt = rsAllocationGetDimX(alloc);
*v_out = 0;
for(int i=0; i<cnt; i++)
*v_out += v_in[i];
}
But here rsGetAllocation() fails when called from forEach_root().
05-11 21:31:29.639: E/RenderScript(17032): ScriptC::ptrToAllocation, failed to find 0x5beb1a40
Just in case I add my OpenCL code that works great under Windows. I'm trying to port it to Renderscript
typedef float4 wType;
__kernel void gravity_kernel(__global wType *src,__global wType *dst)
{
int id = get_global_id(0);
int count = get_global_size(0);
double4 tmp = 0;
for(int i=0;i<count;i++) {
float4 diff = src[i] - src[id];
float sq_dist = dot(diff, diff);
float4 norm = normalize(diff);
if (sq_dist<0.5/60)
tmp += convert_double4(norm*sq_dist);
else
tmp += convert_double4(norm/sq_dist);
}
dst[id] = convert_float4(tmp);
}

You can provide data apart from your root function. In the current android version (4.2) you could do the following (It is an example from an image processing scenario):
Renderscript snippet:
#pragma version(1)
#pragma rs java_package_name(com.example.renderscripttests)
//Define global variables in your renderscript:
rs_allocation pixels;
int width;
int height;
// And access these in your root function via rsGetElementAt(pixels, px, py)
void root(uchar4 *v_out, uint32_t x, uint32_t y)
{
for(int px = 0; px < width; ++px)
for(int py = 0; py < height; ++py)
{
// unpack a color to a float4
float4 f4 = rsUnpackColor8888(*(uchar*)rsGetElementAt(pixels, px, py));
...
Java file snippet
// In your java file, create a renderscript:
RenderScript renderscript = RenderScript.create(this);
ScriptC_myscript script = new ScriptC_myscript(renderscript);
// Create Allocations for in- and output (As input the bitmap 'bitmapIn' should be used):
Allocation pixelsIn = Allocation.createFromBitmap(renderscript, bitmapIn,
Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
Allocation pixelsOut = Allocation.createTyped(renderscript, pixelsIn.getType());
// Set width, height and pixels in the script:
script.set_width(640);
script.set_height(480);
script.set_pixels(pixelsIn);
// Call the for each loop:
script.forEach_root(pixelsOut);
// Copy Allocation to the bitmap 'bitmapOut':
pixelsOut.copyTo(bitmapOut);
You can see, the input 'pixelsIn' is previously set and used inside the renderscript when calling the forEach_root function to calculate values for 'pixelsOut'. Also width and height are previously set.

Related

Visualize PCD containing custom double point structure

I have created a custom double point-type for storing the point position in the PCD file. I required the double data type since my points are in global coordinates and have very large values (of order 10^6 to 10^7) and require good precision. Since the values are large and the default FLOAT32 precision is limited, there is considerable data approximation which is also visible during visualization.
I created this PCD by transforming the raw pointcloud with the initial global reference coordinate from GPS in the data bag that I have. I am using a 15 point precision.
I created a separate script for visualizing this custom point-type PCD. But by visually comparing, I cannot see any considerable difference between the FLOAT32 and double data-type PCD's.
Raw_float_pcd_visualization
Transformed_float_pcd_visualization
Transformed_double_pcd_visualization
You can see that the transformed_double and transformed_float PCD's are quite similar and approximated. While the raw_float PCD is quite good as compared to these two.
I am attaching the PCD files for reference:
raw_float
transformed_float
transformed_double
I think that I am skipping some things while loading the pointcloud and there are some more changes that need to be done in order to visualize the points with double point precision.
I used "pcl_viewer" from pcl_tools for visualizing FLOAT type PCD's.
Code for visualizaing custom DOUBLE point-structure PCD:
#define PCL_NO_PRECOMPILE
#include <iostream>
// #include "double_viz/pcl_double.h"
#include <pcl-1.7/pcl/common/common.h>
#include <pcl-1.7/pcl/io/pcd_io.h>
#include <pcl-1.7/pcl/visualization/pcl_visualizer.h>
#include <pcl-1.7/pcl/console/parse.h>
#include <pcl-1.7/pcl/point_cloud.h>
#include <pcl-1.7/pcl/point_types.h>
namespace pcl
{
#define PCL_ADD_UNION_POINT4D_DOUBLE \
union EIGEN_ALIGN16 { \
double data[4]; \
struct { \
double x; \
double y; \
double z; \
}; \
};
struct _PointXYZDouble
{
PCL_ADD_UNION_POINT4D_DOUBLE; // This adds the members x,y,z which can also be accessed using the point (which is float[4])
EIGEN_MAKE_ALIGNED_OPERATOR_NEW
};
struct EIGEN_ALIGN16 PointXYZDouble : public _PointXYZDouble
{
inline PointXYZDouble (const _PointXYZDouble &p)
{
x = p.x; y = p.y; z = p.z; data[3] = 1.0;
}
inline PointXYZDouble ()
{
x = y = z = 0.0;
data[3] = 1.0;
}
inline PointXYZDouble (double _x, double _y, double _z)
{
x = _x; y = _y; z = _z;
data[3] = 1.0;
}
EIGEN_MAKE_ALIGNED_OPERATOR_NEW
};
}
POINT_CLOUD_REGISTER_POINT_STRUCT (pcl::_PointXYZDouble,
(double, x, x)
(double, y, y)
(double, z, z)
)
POINT_CLOUD_REGISTER_POINT_WRAPPER(pcl::PointXYZDouble, pcl::_PointXYZDouble)
// This function displays the help
void
showHelp(char * program_name)
{
std::cout << std::endl;
std::cout << "Usage: " << program_name << " cloud_filename.[pcd]" << std::endl;
std::cout << "-h: Show this help." << std::endl;
}
// This is the main function
int
main (int argc, char** argv)
{
// Show help
if (pcl::console::find_switch (argc, argv, "-h") || pcl::console::find_switch (argc, argv, "--help"))
{
showHelp (argv[0]);
return 0;
}
// Fetch point cloud filename in arguments | Works with PCD
std::vector<int> filenames;
if (filenames.size () != 1)
{
filenames = pcl::console::parse_file_extension_argument (argc, argv, ".pcd");
if (filenames.size () != 1)
{
showHelp (argv[0]);
return -1;
}
}
// Load file | Works with PCD and PLY files
pcl::PointCloud<pcl::PointXYZDouble>::Ptr source_cloud (new pcl::PointCloud<pcl::PointXYZDouble> ());
if (pcl::io::loadPCDFile (argv[filenames[0]], *source_cloud) < 0)
{
std::cout << "Error loading point cloud " << argv[filenames[0]] << std::endl << std::endl;
showHelp (argv[0]);
return -1;
}
// Visualization
// printf( "\nPoint cloud colors : white = original point cloud\n"
// " red = transformed point cloud\n");
pcl::visualization::PCLVisualizer viewer ("Visualize double PCL");
// Define R,G,B colors for the point cloud
pcl::visualization::PointCloudColorHandlerCustom<pcl::PointXYZDouble> source_cloud_color_handler (source_cloud, 100, 100, 100);
// We add the point cloud to the viewer and pass the color handler
viewer.addPointCloud (source_cloud, source_cloud_color_handler, "original_cloud");
viewer.addCoordinateSystem (1.0, "cloud", 0);
viewer.setBackgroundColor(0.05, 0.05, 0.05, 0); // Setting background to a dark grey
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_OPACITY, 1, "original_cloud");
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_POINT_SIZE, 1, "original_cloud");
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_LINE_WIDTH, 1, "original_cloud");
//viewer.setPosition(800, 400); // Setting visualiser window position
while (!viewer.wasStopped ()) // Display the visualiser until 'q' key is pressed
{
viewer.spinOnce ();
}
return 0;
}
In the raw_float file, the size field has been defined as 4 bytes each: SIZE 4 4 4 4,
to be read as double it should be SIZE 8 8 8 8.
With your current implementation each field is being read as Float32

Extracting data from a matlab struct in mex

I'm following this example but I'm not sure what I missed. Specifically, I have this struct in MATLAB:
a = struct; a.one = 1.0; a.two = 2.0; a.three = 3.0; a.four = 4.0;
And this is my test code in MEX ---
First, I wanted to make sure that I'm passing in the right thing, so I did this check:
int nfields = mxGetNumberOfFields(prhs[0]);
mexPrintf("nfields =%i \n\n", nfields);
And it does yield 4, since I have four fields.
However, when I tried to extract the value in field three:
tmp = mxGetField(prhs[0], 0, "three");
mexPrintf("data =%f \n\n", (double *)mxGetData(tmp) );
It returns data =1.000000. I'm not sure what I did wrong. My logic is that I want to get the first element (hence index is 0) of the field three, so I expected data =3.00000.
Can I get a pointer or a hint?
EDITED
Ok, since you didn't provide your full code but you are working on a test, let's try to make a new one from scratch.
On Matlab side, use the following code:
a.one = 1;
a.two = 2;
a.three = 3;
a.four = 4;
read_struct(a);
Now, create and compile the MEX read_struct function as follows:
#include "mex.h"
void read_struct(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
if (nrhs != 1)
mexErrMsgTxt("One input argument required.");
/* Let's check if the input is a struct... */
if (!mxIsStruct(prhs[0]))
mexErrMsgTxt("The input must be a structure.");
int ne = mxGetNumberOfElements(prhs[0]);
int nf = mxGetNumberOfFields(prhs[0]);
mexPrintf("The structure contains %i elements and %i fields.\n", ne, nf);
mwIndex i;
mwIndex j;
mxArray *mxValue;
double *value;
for (i = 0; i < nf; ++i)
{
for (j = 0; j < ne; ++j)
{
mxValue = mxGetFieldByNumber(prhs[0], j, i);
value = mxGetPr(mxValue);
mexPrintf("Field %s(%d) = %.1f\n", mxGetFieldNameByNumber(prhs[0],i), j, value[0]);
}
}
return;
}
Does this correctly prints your structure?

Cairo: Draw an array of pixels

I'm just starting to explore Cairo, but right now I really want to use it for something very simple.
I have a very low-tech bitmap, i.e., a 3*X*Y array of numbers. I'd like to use Cairo to make this into a bitmap and write to a file. I'm looking through tutorials and I'm not seeing a way to use it for comparatively low-level functions like this.
I don't think I need guidance on how to use the tool once I know what the tool is.
I didn't actually test this, but the following should give you lots of useful hints:
#include <cairo.h>
#include <stdint.h>
#define WIDTH 42
#define HEIGHT 42
uint8_t data[WIDTH][HEIGHT][3];
cairo_surface_t* convert()
{
cairo_surface_t *result;
unsigned char *current_row;
int stride;
result = cairo_image_surface_create(CAIRO_FORMAT_RGB24, WIDTH, HEIGHT);
if (cairo_surface_status(result) != CAIRO_STATUS_SUCCESS)
return result;
cairo_surface_flush(result);
current_row = cairo_image_surface_get_data(result);
stride = cairo_image_surface_get_stride(result);
for (int y = 0; y < HEIGHT; y++) {
uint32_t *row = (void *) current_row;
for (int x = 0; x < WIDTH; x++) {
uint32_t r = data[x][y][0];
uint32_t g = data[x][y][1];
uint32_t b = data[x][y][2];
row[x] = (r << 16) | (g << 8) | b;
}
current_row += stride;
}
cairo_surface_mark_dirty(result);
return result;
}

FFTW with MEX and MATLAB argument issues

I wrote the following C/MEX code using the FFTW library to control the number of threads used for a FFT computation from MATLAB. The code works great (complex FFT forward and backward) with the FFTW_ESTIMATE argument in the planner although it is slower than MATLAB. But, when I switch to the FFTW_MEASURE argument to tune up the FFTW planner, it turns out that applying one FFT forward and then one FFT backward does not return the initial image. Instead, the image is scaled by a factor. Using FFTW_PATIENT gives me an even worse result with null matrices.
My code is as follows:
Matlab functions:
FFT forward:
function Y = fftNmx(X,NumCPU)
if nargin < 2
NumCPU = maxNumCompThreads;
disp('Warning: Use the max maxNumCompThreads');
end
Y = FFTN_mx(X,NumCPU)./numel(X);
FFT backward:
function Y = ifftNmx(X,NumCPU)
if nargin < 2
NumCPU = maxNumCompThreads;
disp('Warning: Use the max maxNumCompThreads');
end
Y = iFFTN_mx(X,NumCPU);
Mex functions:
FFT forward:
# include <string.h>
# include <stdlib.h>
# include <stdio.h>
# include <mex.h>
# include <matrix.h>
# include <math.h>
# include </home/nicolas/Code/C/lib/include/fftw3.h>
char *Wisfile = NULL;
char *Wistemplate = "%s/.fftwis";
#define WISLEN 8
void set_wisfile(void)
{
char *home;
if (Wisfile) return;
home = getenv("HOME");
Wisfile = (char *)malloc(strlen(home) + WISLEN + 1);
sprintf(Wisfile, Wistemplate, home);
}
fftw_plan CreatePlan(int NumDims, int N[], double *XReal, double *XImag, double *YReal, double *YImag)
{
fftw_plan Plan;
fftw_iodim Dim[NumDims];
int k, NumEl;
FILE *wisdom;
for(k = 0, NumEl = 1; k < NumDims; k++)
{
Dim[NumDims - k - 1].n = N[k];
Dim[NumDims - k - 1].is = Dim[NumDims - k - 1].os = (k == 0) ? 1 : (N[k-1] * Dim[NumDims-k].is);
NumEl *= N[k];
}
/* Import the wisdom. */
set_wisfile();
wisdom = fopen(Wisfile, "r");
if (wisdom) {
fftw_import_wisdom_from_file(wisdom);
fclose(wisdom);
}
if(!(Plan = fftw_plan_guru_split_dft(NumDims, Dim, 0, NULL, XReal, XImag, YReal, YImag, FFTW_MEASURE *(or FFTW_ESTIMATE respectively)* )))
mexErrMsgTxt("FFTW3 failed to create plan.");
/* Save the wisdom. */
wisdom = fopen(Wisfile, "w");
if (wisdom) {
fftw_export_wisdom_to_file(wisdom);
fclose(wisdom);
}
return Plan;
}
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[] )
{
#define B_OUT plhs[0]
int k, numCPU, NumDims;
const mwSize *N;
double *pr, *pi, *pr2, *pi2;
static long MatLeng = 0;
fftw_iodim Dim[NumDims];
fftw_plan PlanForward;
int NumEl = 1;
int *N2;
if (nrhs != 2) {
mexErrMsgIdAndTxt( "MATLAB:FFT2mx:invalidNumInputs",
"Two input argument required.");
}
if (!mxIsDouble(prhs[0])) {
mexErrMsgIdAndTxt( "MATLAB:FFT2mx:invalidNumInputs",
"Array must be double");
}
numCPU = (int) mxGetScalar(prhs[1]);
if (numCPU > 8) {
mexErrMsgIdAndTxt( "MATLAB:FFT2mx:invalidNumInputs",
"NumOfThreads < 8 requested");
}
if (!mxIsComplex(prhs[0])) {
mexErrMsgIdAndTxt( "MATLAB:FFT2mx:invalidNumInputs",
"Array must be complex");
}
NumDims = mxGetNumberOfDimensions(prhs[0]);
N = mxGetDimensions(prhs[0]);
N2 = (int*) mxMalloc( sizeof(int) * NumDims);
for(k=0;k<NumDims;k++) {
NumEl *= NumEl * N[k];
N2[k] = N[k];
}
pr = (double *) mxGetPr(prhs[0]);
pi = (double *) mxGetPi(prhs[0]);
//B_OUT = mxCreateNumericArray(NumDims, N, mxDOUBLE_CLASS, mxCOMPLEX);
B_OUT = mxCreateNumericMatrix(0, 0, mxDOUBLE_CLASS, mxCOMPLEX);
mxSetDimensions(B_OUT , N, NumDims);
mxSetData(B_OUT , (double* ) mxMalloc( sizeof(double) * mxGetNumberOfElements(prhs[0]) ));
mxSetImagData(B_OUT , (double* ) mxMalloc( sizeof(double) * mxGetNumberOfElements(prhs[0]) ));
pr2 = (double* ) mxGetPr(B_OUT);
pi2 = (double* ) mxGetPi(B_OUT);
fftw_init_threads();
fftw_plan_with_nthreads(numCPU);
PlanForward = CreatePlan(NumDims, N2, pr, pi, pr2, pi2);
fftw_execute_split_dft(PlanForward, pr, pi, pr2, pi2);
fftw_destroy_plan(PlanForward);
fftw_cleanup_threads();
}
FFT backward:
This MEX function differs from the above only in switching pointers pr <-> pi, and pr2 <-> pi2 in the CreatePlan function and in the execution of the plan, as suggested in the FFTW documentation.
If I run
A = imread('cameraman.tif');
>> A = double(A) + i*double(A);
>> B = fftNmx(A,8);
>> C = ifftNmx(B,8);
>> figure,imagesc(real(C))
with the FFTW_MEASURE and FFTW_ESTIMATE arguments respectively I get this result.
I wonder if this is due to an error in my code or in the library. I tried different thing around the wisdom, saving not saving. Using the wisdom produce by the FFTW standalone tool to produce wisdom. I haven't seen any improvement. Can anyone suggest why this is happening?
Additional information:
I compile the MEX code using static libraries:
mex FFTN_Meas_mx.cpp /home/nicolas/Code/C/lib/lib/libfftw3.a /home/nicolas/Code/C/lib/lib/libfftw3_threads.a -lm
The FFTW library hasn't been compiled with:
./configure CFLAGS="-fPIC" --prefix=/home/nicolas/Code/C/lib --enable-sse2 --enable-threads --&& make && make install
I tried different flags without success. I am using MATLAB 2011b on a Linux 64-bit station (AMD opteron quad core).
FFTW computes not normalized transform, see here:
http://www.fftw.org/doc/What-FFTW-Really-Computes.html
Roughly speaking, when you perform direct transform followed by inverse one, you get
back the input (plus round-off errors) multiplied by the length of your data.
When you create a plan using flags other than FFTW_ESTIMATE, your input is overwritten:
http://www.fftw.org/doc/Planner-Flags.html

What about this OpenCL kernel is causing the error CL_INVALID_COMMAND_QUEUE

I'm having a problem implementing a Feed-Forward MultiLayer Perceptron, with back-prop learning in OpenCL in Java, using JOCL. Here is the kernel code for the calculation phase:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void Neuron(__global const double *inputPatterns,
__global double *weights,
__global const int *numInputs,
__global const int *activation,
__global const double *bias,
__global const int *usingBias,
__global double *values,
__global const int *maxNumFloats,
__global const int *patternIndex,
__global const int *inputPatternSize,
__global const int *indexOffset,
__global const int *isInputNeuron,
__global const int *inputs)
{
int gid = get_global_id(0);
double sum = 0.0;
for(int i = 0; i < numInputs[gid+indexOffset[0]]; i++)
{
sum += values[inputs[(gid+indexOffset[0]) * maxNumFloats[0] + i]] *
weights[(gid+indexOffset[0]) * maxNumFloats[0] + i];
}
if(usingBias[gid+indexOffset[0]])
sum += bias[gid+indexOffset[0]];
if(isInputNeuron[gid+indexOffset[0]])
sum += inputPatterns[gid+indexOffset[0]+(patternIndex[0] * inputPatternSize[0])];
if(activation[gid+indexOffset[0]] == 1)
sum = 1.0 / (1.0 + exp(-sum));
values[gid + indexOffset[0]] = sum;
}
Basically, I run this kernel for each layer in the network. For the first layer, there are no "inputs", so the loop does not execute. As the first layer is an input node layer however, it does add the relevant value from the input pattern. This executes fine, and I can read back the values at this point.
When I try and run the SECOND layer however (which does have inputs, every node from the first layer), a call to clFinish() returns the error CL_INVALID_COMMAND_QUEUE. Sometimes this error is coupled with a driver crash and recovery. I have read around (here for example) that this might be a problem with TDR timeouts, and have made an attempt to raise the limit but unsure if this is making any difference.
I'm going through the calls to clSetKernelArg() to check for anything stupid, but can anyone spot anything obviously off in the code? It would seem that the error is introduced in the second layer due to the inclusion of the for loop... I can clarify any of the parameters if its needed but it seemed a bit overkill for an initial post.
Also, I'm fully aware this code will probably be an affront to competent coders everywhere, but feel free to flame :P
EDIT: Host code:
//Calc
for(int k = 0; k < GPUTickList.length; k++)
{
clFlush(clCommandQueue);
clFinish(clCommandQueue);
//If input nodes
if(k == 0)
//Set index offset to 0
GPUMapIndexOffset.asIntBuffer().put(0, 0);
else
//Update index offset
GPUMapIndexOffset.asIntBuffer().put(0,
GPUMapIndexOffset.asIntBuffer().get(0) + GPUTickList[k-1]);
//Write index offset to GPU buffer
ret = clEnqueueWriteBuffer(clCommandQueue, memObjects[12], CL_TRUE, 0,
Sizeof.cl_int, Pointer.to(GPUMapIndexOffset.position(0)), 0, null, null);
//Set work size (width of layer)
global_work_size[0] = GPUTickList[k];
ret = clEnqueueNDRangeKernel(clCommandQueue, kernel_iterate, 1,
global_work_offset, global_work_size, local_work_size,
0, null, null);
}
EDIT 2: I've uploaded the full code to pastebin.
Solved. Fixed the error by making everything indexed with [0] a straight kernel parameter, rather than a buffer. Clearly the hardware doesn't like lots of stuff accessing one particular element of a buffer at once.
I'm not sure about what you have above the loop.. do you use the queue other than in this loop? Below is something you may want to try out.
//flush + finish if you need to before the loop, otherwise remove these lines
clFlush(clCommandQueue);
clFinish(clCommandQueue);
cl_event latestEvent;
//Calc
for(int k = 0; k < GPUTickList.length; k++)
{
//If input nodes
if(k == 0)
//Set index offset to 0
GPUMapIndexOffset.asIntBuffer().put(0, 0);
else
//Update index offset
GPUMapIndexOffset.asIntBuffer().put(0,
GPUMapIndexOffset.asIntBuffer().get(0) + GPUTickList[k-1]);
//Write index offset to GPU buffer
ret = clEnqueueWriteBuffer(clCommandQueue, memObjects[12], CL_TRUE, 0,
Sizeof.cl_int, Pointer.to(GPUMapIndexOffset.position(0)), 0, null, null);
//Set work size (width of layer)
global_work_size[0] = GPUTickList[k];
ret = clEnqueueNDRangeKernel(clCommandQueue, kernel_iterate, 1,
global_work_offset, global_work_size, local_work_size,
0, null, &latestEvent);
clWaitForEvents(1, &latestEvent);
}