Question on the dimension of cuda block indexing

Question on the dimension of cuda block indexing - matlab

In the following cuda code taken from book "Accelerating MATLAB with GPU computing: a primer with examples", I think
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 1)
return;
should actually be
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 0 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 0 || col > numCols - 1)
return;
Am I right?
The following is the whole code that does image convolution using cuda code called from MATLAB.
#include "conv2Mex.h"
__global__ void conv2MexCuda(float* src,
float* dst,
int numRows,
int numCols,
float* mask)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 1)
return;
int dstIndex = col * numRows + row;
dst[dstIndex] = 0;
int mskIndex = 3 * 3 - 1;
for (int kc = -1; kc < 2; kc++)
{
int srcIndex = (col + kc) * numRows + row;
for (int kr = -1; kr < 2; kr++)
{
dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];
}
}
}
void conv2Mex(float* src, float* dst, int numRows, int numCols, float* msk)
{
...
conv2MexCuda<<<gridSize, blockSize>>>...
...
}

Am I right?
I don't think you are right.
The construction of the row and col indices in the kernel code is such that they will vary (across threads in the grid) from 0 to numRows-1 and 0 to numCols-1 (and perhaps larger, depending on actual grid sizing, which you haven't shown).
Based on the code you have shown, the mask is evidently a 3x3 mask, which means that it acts as a stencil over the current (row, col) position, and extends plus and minus one row, and plus and minus one column. Let's take a careful look at the indexing here for the case where (row, col) = (0,0); this is one of the positions you have allowed to execute based on your proposed change:
for (int kc = -1; kc < 2; kc++)
{
int srcIndex = (col + kc) * numRows + row;
for (int kr = -1; kr < 2; kr++)
{
dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];
At the first iteration of the outer for loop, kc will be -1, therefore srcIndex is (0-1)*numRows+0. Let's assume numRows is reasonably large, like 256. So srcIndex is -1*256 or -256. At the first iteration of the inner for-loop, kr is -1, so the computed index for the access to src is -256-1 = -257. That is almost never sensible.
If anything, the upper bounds look incorrect to me. If we assume that the valid image index ranges are 0..numRows-1 and 0..numCols-1, then I think the restrictions should be as follows:
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 2)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 2)
return;
That appears to be the classic computer science off-by-1 error.

Related

finding local mean in an image using mex-cuda

I have an image named HSIImage, of size is 565x585, in which I have find the local mean and standard deviation at every pixel. For this I am using a window W of size 9x9, if we a re finding the mean of x(i,j) we need values in the W where x(i,j) is at its center.
For working on the corner and edge pixels, I am padding the HSIImage and naming it as HSIImage2.
MATLAB code
[m,n,~] = size(HSIImage);
HSIImage2=padarray(HSIImage,[4,4],'symmetric');
mean1 = zeros(m,n);
sd = zeros(m,n);
phi_x=zeros(m,n);
for i=5:m+4
for j=5:n+4
mean1(i-4,j-4) = mean( mean(HSIImage2(i-4:i+4, j-4:j+4, 3) )); %sum / (4*4);
sd(i-4,j-4) = std( std(HSIImage2(i-4:i+4, j-4:j+4, 3), 1));
end
end
[phi_x2,mean2,sd2] = getPhi(HSIImage(:,:,3)',HSIImage2(:,:,3)',m,n);
Serial mean displayed as image.
My cuda code for finding mean and sd is
__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;
if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
sum= sum + d_HSIImage2[i*col+j];
}
}
d_mean[(Y-4)*col+X-4] = sum/81;
double mean = sum/81;
sum = 0;
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
int index = i*col+j;
sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
}
}
d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}
void mexFunction( int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
double* HSIImage;
double* d_HSIImage;
double* HSIImage2;
double* d_HSIImage2;
double row;
double col;
double* phi_x;
double* d_phi_x;
double* mean2;
double* d_mean;
double* d_std;
double* sd2;
HSIImage = (double*)mxGetPr(prhs[0]);
HSIImage2 = (double*)mxGetPr(prhs[1]);
row = mxGetScalar(prhs[2]);
col = mxGetScalar(prhs[3]);
plhs[0] = mxCreateDoubleMatrix(row,col,mxREAL);
phi_x = mxGetPr(plhs[0]);
plhs[1] = mxCreateDoubleMatrix(row,col,mxREAL);
mean2 = mxGetPr(plhs[1]);
plhs[2] = mxCreateDoubleMatrix(row,col,mxREAL);
sd2 = mxGetPr(plhs[2]);
dim3 grid(((col+8)/TILE_WIDTH)+1,((row+8)/TILE_WIDTH)+1,1);
dim3 block(TILE_WIDTH,TILE_WIDTH,1);
if ( cudaMalloc(&d_HSIImage,sizeof(double)*row*col)!= cudaSuccess )
mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_HSIImage2,sizeof(double)*(row+8)*(col+8))!= cudaSuccess )
mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_phi_x,sizeof(double)*row*col)!= cudaSuccess )
mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_mean,sizeof(double)*row*col)!= cudaSuccess )
mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_std,sizeof(double)*row*col)!= cudaSuccess )
mexErrMsgTxt("Memory allocating failure on the GPU");
cudaMemcpy(d_HSIImage,HSIImage,sizeof(double)*row*col,cudaMemcpyHostToDevice);
cudaMemcpy(d_HSIImage2,HSIImage2,sizeof(double)*(row+8)*(col+8),cudaMemcpyHostToDevice);
phi <<< grid,block >>> (d_HSIImage,d_HSIImage2,row,col,d_phi_x,d_mean,d_std);
cudaMemcpy(phi_x,d_phi_x,sizeof(double)*row*col,cudaMemcpyDeviceToHost);
cudaMemcpy(mean2,d_mean,sizeof(double)*row*col,cudaMemcpyDeviceToHost);
cudaMemcpy(sd2,d_std,sizeof(double)*row*col,cudaMemcpyDeviceToHost);
cudaFree(d_HSIImage);
cudaFree(d_HSIImage2);
cudaFree(d_phi_x);
}
its working fine when image is full of ones. but when I give regular image, there is lot of difference in serial(MATLAB) and parallel(CUDA) outputs(When mean1 and mean2 are compared). Please tell me the error.
I am launching with
dim3 grid(((col+8)/TILE_WIDTH)+1,((row+8)/TILE_WIDTH)+1,1);
dim3 block(TILE_WIDTH,TILE_WIDTH,1);
TILEWIDTH is 32. row=565, col=584.
Parallel mean displayed as image

It is important to note Matlab's c api is column-major ordered, however as mentioned in the comments OP has made sure of the consistency. The problem is that the stride used to access the data did not include the pads of the image. Going from one row to another requires a stride of col+8 (8 being padding of 4 on each side.
changing
__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;
if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
sum= sum + d_HSIImage2[i*col+j];
}
}
d_mean[(Y-4)*col+X-4] = sum/81;
double mean = sum/81;
sum = 0;
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
int index = i*col+j;
sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
}
}
d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}
to
__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;
if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
sum= sum + d_HSIImage2[i*(col+8)+j];
}
}
d_mean[(Y-4)*col+X-4] = sum/81;
double mean = sum/81;
sum = 0;
for(i=Y-4;i<=Y+4;i++){
for(j=X-4;j<=X+4;j++){
int index = i*(col+8)+j;
sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
}
}
d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}
Should work, however, I have included a compilable example that I validated on a small sample, that should be easy to expand.
It is not optimized, but that wasn't part of your question. Optimization using shared memory would give a large performance boost.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
using namespace std;
__global__ void phi(double *img, int row, int col, double *d_mean){
int X=blockDim.x*blockIdx.x+threadIdx.x+4;
int Y=blockDim.y*blockIdx.y+threadIdx.y+4;
double sum = 0;
if(Y<row+4 && X<col+4){
for(int i=-4; i<=4; ++i){
for(int j=-4; j<=4; ++j){
sum+=img[ (Y+j)*(col+8)+X+i];
}
}
sum/=81;
d_mean[(Y-4)*col+X-4]=sum;
}
}
int main(int argc, char * argv[]) {
int width=10, height=10;
double *h_img=new double[(width+8)*(height+8)];
for(int i=0; i<height+8; i++){
for(int j=0; j<width+8; j++){
h_img[i*(width+8)+j]=0.0;
}
}
for(int i=0; i<height; i++){
for(int j=0; j<width; j++){
int index = (i+4)*(width+8)+j+4;
h_img[index]=i*width+j;
}
}
for(int i=0; i<height+8; i++){
for(int j=0; j<width+8; j++){
cout<<h_img[i*(width+8)+j]<<" ";
}cout<<endl;
}
double *d_img;
size_t size=sizeof(double)*(height+8)*(width*8);
cudaMalloc(&d_img, size);
cudaMemcpy(d_img, h_img, size, cudaMemcpyHostToDevice);
size = sizeof(double)*height*width;
double *d_avg;
cudaMalloc(&d_avg, size);
dim3 block(32, 32, 1);
dim3 grid(width/32+1, height/32+1, 1);
phi<<<grid, block>>>(d_img, height, width, d_avg);
cudaDeviceSynchronize();
double *h_avg=new double[width*height];
cudaMemcpy(h_avg, d_avg, size, cudaMemcpyDeviceToHost);
for(int i=0; i<height; i++){
for(int j=0; j<width; j++){
cout<<h_avg[i*width+j]<<" ";
}cout<<endl;
}
return 0;
}

Here's my 2 cents regarding local mean and local std.
You should check whether using matlab's optimized built-in functions (conv2 and stdfilt , with their gpu support) gives you better performance than a "simple" mex version. For example, to take the local mean, the fastest will be to use conv2 as follows:
local_mean_image=conv2(image,normalized_window,'same');
where in your case normalized_window=ones(9)./9^2;
For local std use stdfilt :
local_std_image = stdfilt(image, ones(9));
Both options are available for faster GPU performance, I use conv2 with Jacket routinely, and I saw the stdfilt supports gpuarray variables.

By observing the answers of #Christian Sarofeen and of #bla, I made some changes to my code and now I am able to find the mean exactly same as MATLAB. I posting this thinking that some one may use it in future(I am sending the image as is from MATLAB). Still finding standard deviation is little problem.
__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;
if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
int index = (X-4)*row+Y-4;
for(i=-4;i<=4;i++){
for(j=-4;j<=4;j++){
sum= sum + d_HSIImage2[(X+j)*(row+8)+(Y+i)];
}
}
d_mean[index] = sum/81;
double mean = 0;
double temp_std[9] = {0} ;
for(j=-4;j<=4;j++){
sum = 0;
for(i=-4;i<=4;i++){
sum = sum + d_HSIImage2[(X+j)*(row+8)+(Y+i)];//vector mean
}
mean = sum/9;
sum =0 ;
for(i=-4;i<=4;i++){
int index = (X+j)*(row+8)+(Y+i);
sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
}
temp_std[j+4] = (sqrt(sum/9));//vector std
}
sum =0 ;
for(j=-4;j<=4;j++){
sum = sum + temp_std[j+4];//mean of vectors
}
mean = sum/9;
sum = 0 ;
for(j=-4;j<=4;j++){
sum = sum + (temp_std[j+4]-mean) * (temp_std[j+4]-mean);
}
d_std[index] = sqrt(sum);//std of vectors
d_phi_x[index] = 1.0/(1.0+exp((d_mean[index]-d_HSIImage[index])/d_std[index]));
}
}

Cuda matrix multiplication results differs from MATLAB

Its been two days and I am still cant figure it out why my implementation of CUDA matrix multiplication differs from the results produced in MATLAB.
CUDA kernel: A(200x60000) = W(200x784) * Data(784x6000)
__global__ void CalculateA(Matrix W, Matrix Data, Matrix A)
{
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < W.row) && (Col < Data.col)){
float Cvalue = 0.0;
for (int i = 0; i < W.col; ++i){
Cvalue += W.elements[Row*W.col+i] * Data.elements[i*Data.col+Col];
}
A.elements[Row*A.col+Col] = Cvalue;
}
}
And calling the kernel:
void myFunc(Matrix W1, Matrix data){
Matrix d_W1, d_data, d_a2, a2;
size_t size;
a2.row = W1.row; d_a2.row = a2.row;
a2.col = data.col; d_a2.col = a2.col;
size = a2.col*a2.row*sizeof(float);
cudaMalloc(&d_a2.elements,size);
d_W1.row = W1.row; d_W1.col = W1.col;
size = W1.col*W1.row*sizeof(float);
cudaMalloc(&d_W1.elements,size);
cudaMemcpy(d_W1.elements,W1.elements,size,cudaMemcpyHostToDevice);
d_data.col = data.col; d_data.row = data.row;
size = data.row*data.col*sizeof(float);
cudaMalloc(&d_data.elements,size);
cudaMemcpy(d_data.elements,data.elements,size,cudaMemcpyHostToDevice);
dim3 dimGrid(data.col/32 + 1, W1.row/32 + 1, 1);
dim3 dimBlock(32, 32, 1);
CalculateA<<<dimGrid, dimBlock>>>(d_W1, d_data, d_a2);
a2.elements = new float [a2.row*a2.col];
cudaMemcpy(a2.elements,d_a2.elements,sizeof(float)*a2.row*a2.col,cudaMemcpyDeviceToHost);
printf("\nA2 first and last member %f - %f\n",a2.elements[0],a2.elements[a2.row*a2.col-1]);
}
Results difference is not low for example first and last elements of CUDA code is 0.011322 and -0.179534 but multiplying in MATLAB results in 0.4280 and 0.0056.
this is how I do it in MATLAB:
>> size(W1) ans = 200 784
>> size(data) ans = 784 60000
>> z2=W1*data;
>> size(z2) ans = 200 60000
>> z2 = z2(:);
>> z2(1) ans = 0.4280
>> z2(200*60000)ans = 0.0056

There is nothing wrong with the code you posted. If I expand your kernel and function into a complete running example like this:
#include <iostream>
struct Matrix
{
int row;
int col;
float *elements;
__device__ __host__
float& operator()(int r, int c) { return elements[r*col + c]; };
};
__global__ void CalculateA(Matrix W, Matrix Data, Matrix A)
{
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < W.row) && (Col < Data.col)){
float Cvalue = 0.0;
for (int i = 0; i < W.col; ++i){
Cvalue += W.elements[Row*W.col+i] * Data.elements[i*Data.col+Col];
}
A.elements[Row*A.col+Col] = Cvalue;
}
}
void myFunc(Matrix W1, Matrix data)
{
Matrix d_W1, d_data, d_a2, a2;
size_t size;
a2.row = W1.row; d_a2.row = a2.row;
a2.col = data.col; d_a2.col = a2.col;
size = a2.col*a2.row*sizeof(float);
cudaMalloc(&d_a2.elements,size);
d_W1.row = W1.row; d_W1.col = W1.col;
size = W1.col*W1.row*sizeof(float);
cudaMalloc(&d_W1.elements,size);
cudaMemcpy(d_W1.elements,W1.elements,size,cudaMemcpyHostToDevice);
d_data.col = data.col; d_data.row = data.row;
size = data.row*data.col*sizeof(float);
cudaMalloc(&d_data.elements,size);
cudaMemcpy(d_data.elements,data.elements,size,cudaMemcpyHostToDevice);
dim3 dimGrid(data.col/32 + 1, W1.row/32 + 1, 1);
dim3 dimBlock(32, 32, 1);
CalculateA<<<dimGrid, dimBlock>>>(d_W1, d_data, d_a2);
a2.elements = new float [a2.row*a2.col];
cudaMemcpy(a2.elements,d_a2.elements,sizeof(float)*a2.row*a2.col,cudaMemcpyDeviceToHost);
for(int j=0; j<a2.col; ++j) {
for(int i=0; i<a2.row; ++i) {
std::cout << a2(i,j) << " ";
}
std::cout << std::endl;
}
}
int main(void)
{
float a[6] = { 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f };
float b[6] = { 0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f};
Matrix W1; W1.row=2; W1.col=3; W1.elements = &a[0];
Matrix Data; Data.row=3; Data.col=2; Data.elements = &b[0];
myFunc(W1, Data);
return 0;
}
and run it, I get this:
>nvcc -arch=sm_21 -Xptxas="-v" -m32 matrix.cu
matrix.cu
tmpxft_000014f4_00000000-5_matrix.cudafe1.gpu
tmpxft_000014f4_00000000-10_matrix.cudafe2.gpu
matrix.cu
ptxas : info : 132 bytes gmem, 28 bytes cmem[14]
ptxas : info : Compiling entry function '_Z10CalculateA6MatrixS_S_' for 'sm_21'
ptxas : info : Function properties for _Z10CalculateA6MatrixS_S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 14 registers, 68 bytes cmem[0]
tmpxft_000014f4_00000000-5_matrix.cudafe1.cpp
tmpxft_000014f4_00000000-15_matrix.ii
>cuda-memcheck a.exe
========= CUDA-MEMCHECK
2.2 4.9
2.8 6.4
========= ERROR SUMMARY: 0 errors
which is the correct answer for the dot product assuming column major ordering (which is the Matlab convention).
So if your results don't agree, it is because of something you haven't shown us. One likelihood is that your test problem is so large (and kernel so inefficient) that if you are running this on a display GPU, your program is hitting the display driver watchdog timer limit and being killed before the kernel finishes running. Also note that you have no CUDA API error checking whatsoever, so it is possible that you are getting runtime errors which is either stopping your kernel from finishing or even running at all, but you simply don't notice because of the lack of error checking.

imregionalmax matlab function's equivalent in opencv

I have an image of connected components(circles filled).If i want to segment them i can use watershed algorithm.I prefer writing my own function for watershed instead of using the inbuilt function in OPENCV.I have successfu How do i find the regionalmax of objects using opencv?

I wrote a function myself. My results were quite similar to MATLAB, although not exact. This function is implemented for CV_32F but it can easily be modified for other types.
I mark all the points that are not part of a minimum region by checking all the neighbors. The remaining regions are either minima, maxima or areas of inflection.
I use connected components to label each region.
I check each region for any point belonging to a maxima, if yes then I push that label into a vector.
Finally I sort the bad labels, erase all duplicates and then mark all the points in the output as not minima.
All that remains are the regions of minima.
Here is the code:
// output is a binary image
// 1: not a min region
// 0: part of a min region
// 2: not sure if min or not
// 3: uninitialized
void imregionalmin(cv::Mat& img, cv::Mat& out_img)
{
// pad the border of img with 1 and copy to img_pad
cv::Mat img_pad;
cv::copyMakeBorder(img, img_pad, 1, 1, 1, 1, IPL_BORDER_CONSTANT, 1);
// initialize binary output to 2, unknown if min
out_img = cv::Mat::ones(img.rows, img.cols, CV_8U)+2;
// initialize pointers to matrices
float* in = (float *)(img_pad.data);
uchar* out = (uchar *)(out_img.data);
// size of matrix
int in_size = img_pad.cols*img_pad.rows;
int out_size = img.cols*img.rows;
int x, y;
for (int i = 0; i < out_size; i++) {
// find x, y indexes
y = i % img.cols;
x = i / img.cols;
neighborCheck(in, out, i, x, y, img_pad.cols); // all regions are either min or max
}
cv::Mat label;
cv::connectedComponents(out_img, label);
int* lab = (int *)(label.data);
in = (float *)(img.data);
in_size = img.cols*img.rows;
std::vector<int> bad_labels;
for (int i = 0; i < out_size; i++) {
// find x, y indexes
y = i % img.cols;
x = i / img.cols;
if (lab[i] != 0) {
if (neighborCleanup(in, out, i, x, y, img.rows, img.cols) == 1) {
bad_labels.push_back(lab[i]);
}
}
}
std::sort(bad_labels.begin(), bad_labels.end());
bad_labels.erase(std::unique(bad_labels.begin(), bad_labels.end()), bad_labels.end());
for (int i = 0; i < out_size; ++i) {
if (lab[i] != 0) {
if (std::find(bad_labels.begin(), bad_labels.end(), lab[i]) != bad_labels.end()) {
out[i] = 0;
}
}
}
}
int inline neighborCleanup(float* in, uchar* out, int i, int x, int y, int x_lim, int y_lim)
{
int index;
for (int xx = x - 1; xx < x + 2; ++xx) {
for (int yy = y - 1; yy < y + 2; ++yy) {
if (((xx == x) && (yy==y)) || xx < 0 || yy < 0 || xx >= x_lim || yy >= y_lim)
continue;
index = xx*y_lim + yy;
if ((in[i] == in[index]) && (out[index] == 0))
return 1;
}
}
return 0;
}
void inline neighborCheck(float* in, uchar* out, int i, int x, int y, int x_lim)
{
int indexes[8], cur_index;
indexes[0] = x*x_lim + y;
indexes[1] = x*x_lim + y+1;
indexes[2] = x*x_lim + y+2;
indexes[3] = (x+1)*x_lim + y+2;
indexes[4] = (x + 2)*x_lim + y+2;
indexes[5] = (x + 2)*x_lim + y + 1;
indexes[6] = (x + 2)*x_lim + y;
indexes[7] = (x + 1)*x_lim + y;
cur_index = (x + 1)*x_lim + y+1;
for (int t = 0; t < 8; t++) {
if (in[indexes[t]] < in[cur_index]) {
out[i] = 0;
break;
}
}
if (out[i] == 3)
out[i] = 1;
}

The following listing is a function similar to Matlab's "imregionalmax". It looks for at most nLocMax local maxima above threshold, where the found local maxima are at least minDistBtwLocMax pixels apart. It returns the actual number of local maxima found. Notice that it uses OpenCV's minMaxLoc to find global maxima. It is "opencv-self-contained" except for the (easy to implement) function vdist, which computes the (euclidian) distance between points (r,c) and (row,col).
input is one-channel CV_32F matrix, and locations is nLocMax (rows) by 2 (columns) CV_32S matrix.
int imregionalmax(Mat input, int nLocMax, float threshold, float minDistBtwLocMax, Mat locations)
{
Mat scratch = input.clone();
int nFoundLocMax = 0;
for (int i = 0; i < nLocMax; i++) {
Point location;
double maxVal;
minMaxLoc(scratch, NULL, &maxVal, NULL, &location);
if (maxVal > threshold) {
nFoundLocMax += 1;
int row = location.y;
int col = location.x;
locations.at<int>(i,0) = row;
locations.at<int>(i,1) = col;
int r0 = (row-minDistBtwLocMax > -1 ? row-minDistBtwLocMax : 0);
int r1 = (row+minDistBtwLocMax < scratch.rows ? row+minDistBtwLocMax : scratch.rows-1);
int c0 = (col-minDistBtwLocMax > -1 ? col-minDistBtwLocMax : 0);
int c1 = (col+minDistBtwLocMax < scratch.cols ? col+minDistBtwLocMax : scratch.cols-1);
for (int r = r0; r <= r1; r++) {
for (int c = c0; c <= c1; c++) {
if (vdist(Point2DMake(r, c),Point2DMake(row, col)) <= minDistBtwLocMax) {
scratch.at<float>(r,c) = 0.0;
}
}
}
} else {
break;
}
}
return nFoundLocMax;
}

I do not know if it is what you want, but in my answer to this post, I gave some code to find local maxima (peaks) in a grayscale image (resulting from distance transform).
The approach relies on subtracting the original image from the dilated image and finding the zero pixels).
I hope it helps,
Good luck

I had the same problem some time ago, and the solution was to reimplement the imregionalmax algorithm in OpenCV/Cpp. It is not that complicated, because you can find the C++ source code of the function in the Matlab distribution. (somewhere in toolbox). All you have to do is to read carefully and understand the algorithm described there. Then rewrite it or remove the matlab-specific checks and you'll have it.

Looking for SLAB6 implementation

I'm looking to implement SLAB6 into my raycaster, especially the kv6 support for voxelmodels. However the SLAB6 source by Ken Silverman is totally unreadably (mostly ASM) so I was hoping someone could point me to a proper C / Java source to load kv6 models or maybe to explain me the workings in some pseudocode preferably (since I want to know how to support the kv6, I know how it works). Thanks, Kaj
EDIT: the implementation would be in Java.

I found some code in an application called VoxelGL (author not mentioned in sourcecode):
void CVoxelWorld::generateSlabFromData(unsigned char *data, VoxelData *vdata, Slab *slab)
{
int currentpattern = 1;
int i = 0;
int n, totalcount, v, count;
n = 0;
v = 0;
while (1)
{
while (data[i] == currentpattern)
{
if (currentpattern == 1)
v++;
i++;
if (i == 256)
break;
}
n++;
if (i == 256)
{
if (currentpattern == 0)
n--;
break;
}
currentpattern ^= 1;
}
slab->nentries = n;
if (slab->description != 0)delete [] slab->description;
if (slab->data != 0)delete [] slab->data;
slab->description = new int[n];
slab->data = new VoxelData[v];
totalcount = 0;
v = 0;
currentpattern = 1;
for (i = 0; i < n; i++)
{
count = 0;
while (data[totalcount] == currentpattern)
{
count++;
totalcount++;
if (totalcount == 256)
break;
}
slab->description[i] = count-1;
if (i % 2 == 0)
{
memcpy(slab->data + v, vdata + totalcount - count, 3 * count);
v += count;
}
currentpattern ^= 1;
}
}
And:
#define clustersize 8
Slab *CVoxelWorld::getSlab(int x, int z)
{
int xgrid = x / clustersize;
int ygrid = z / clustersize;
int clusteroffset = xgrid * 1024 * clustersize + ygrid * clustersize * clustersize;
return &m_data[clusteroffset + (x & (clustersize - 1)) + (z & (clustersize - 1)) * clustersize];
}
And:
int CVoxelWorld::isSolid(int x, int y, int z)
{
Slab *slab;
if (y < 0 || y > 256)
return 0;
slab = getSlab(x, z);
int counter = 0;
for (int i = 0; i < slab->nentries; i++)
{
int height = slab->description[i] + 1;
if (i % 2 == 0)
{
if (y >= counter && y < counter + height)
return 1;
}
counter += height;
}
return 0;
}

Teaching a Neural Net: Bipolar XOR

I'm trying to to teach a neural net of 2 inputs, 4 hidden nodes (all in same layer) and 1 output node. The binary representation works fine, but I have problems with the Bipolar. I can't figure out why, but the total error will sometimes converge to the same number around 2.xx. My sigmoid is 2/(1+ exp(-x)) - 1. Perhaps I'm sigmoiding in the wrong place. For example to calculate the output error should I be comparing the sigmoided output with the expected value or with the sigmoided expected value?
I was following this website here: http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html , but they use different functions then I was instructed to use. Even when I did try to implement their functions I still ran into the same problem. Either way I get stuck about half the time at the same number (a different number for different implementations). Please tell me if I have made a mistake in my code somewhere or if this is normal (I don't see how it could be). Momentum is set to 0. Is this a common 0 momentum problem? The error functions we are supposed to be using are:
if ui is an output unit
Error(i) = (Ci - ui ) * f'(Si )
if ui is a hidden unit
Error(i) = Error(Output) * weight(i to output) * f'(Si)
public double sigmoid( double x ) {
double fBipolar, fBinary, temp;
temp = (1 + Math.exp(-x));
fBipolar = (2 / temp) - 1;
fBinary = 1 / temp;
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// Initialize the weights to random values.
private void initializeWeights(double neg, double pos) {
for(int i = 0; i < numInputs + 1; i++){
for(int j = 0; j < numHiddenNeurons; j++){
inputWeights[i][j] = Math.random() - pos;
if(inputWeights[i][j] < neg || inputWeights[i][j] > pos){
print("ERROR ");
print(inputWeights[i][j]);
}
}
}
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenWeights[i] = Math.random() - pos;
if(hiddenWeights[i] < neg || hiddenWeights[i] > pos){
print("ERROR ");
print(hiddenWeights[i]);
}
}
}
// Computes output of the NN without training. I.e. a forward pass
public double outputFor ( double[] argInputVector ) {
for(int i = 0; i < numInputs; i++){
inputs[i] = argInputVector[i];
}
double weightedSum = 0;
for(int i = 0; i < numHiddenNeurons; i++){
weightedSum = 0;
for(int j = 0; j < numInputs + 1; j++){
weightedSum += inputWeights[j][i] * inputs[j];
}
hiddenActivation[i] = sigmoid(weightedSum);
}
weightedSum = 0;
for(int j = 0; j < numHiddenNeurons + 1; j++){
weightedSum += (hiddenActivation[j] * hiddenWeights[j]);
}
return sigmoid(weightedSum);
}
//Computes the derivative of f
public static double fPrime(double u){
double fBipolar, fBinary;
fBipolar = 0.5 * (1 - Math.pow(u,2));
fBinary = u * (1 - u);
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// This method is used to update the weights of the neural net.
public double train ( double [] argInputVector, double argTargetOutput ){
double output = outputFor(argInputVector);
double lastDelta;
double outputError = (argTargetOutput - output) * fPrime(output);
if(outputError != 0){
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenError[i] = hiddenWeights[i] * outputError * fPrime(hiddenActivation[i]);
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
hiddenWeights[i] += deltaHiddenWeights[i];
}
for(int in = 0; in < numInputs + 1; in++){
for(int hid = 0; hid < numHiddenNeurons; hid++){
lastDelta = deltaInputWeights[in][hid];
deltaInputWeights[in][hid] = learningRate * hiddenError[hid] * inputs[in] + (momentum * lastDelta);
inputWeights[in][hid] += deltaInputWeights[in][hid];
}
}
}
return 0.5 * (argTargetOutput - output) * (argTargetOutput - output);
}

General coding comments:
initializeWeights(-1.0, 1.0);
may not actually get the initial values you were expecting.
initializeWeights should probably have:
inputWeights[i][j] = Math.random() * (pos - neg) + neg;
// ...
hiddenWeights[i] = (Math.random() * (pos - neg)) + neg;
instead of:
Math.random() - pos;
so that this works:
initializeWeights(0.0, 1.0);
and gives you initial values between 0.0 and 1.0 rather than between -1.0 and 0.0.
lastDelta is used before it is declared:
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
I'm not sure if the + 1 on numInputs + 1 and numHiddenNeurons + 1 are necessary.
Remember to watch out for rounding of ints: 5/2 = 2, not 2.5!
Use 5.0/2.0 instead. In general, add the .0 in your code when the output should be a double.
Most importantly, have you trained the NeuralNet long enough?
Try running it with numInputs = 2, numHiddenNeurons = 4, learningRate = 0.9, and train for 1,000 or 10,000 times.
Using numHiddenNeurons = 2 it sometimes get "stuck" when trying to solve the XOR problem.
See also XOR problem - simulation

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Question on the dimension of cuda block indexing - matlab

Related

finding local mean in an image using mex-cuda

Cuda matrix multiplication results differs from MATLAB

imregionalmax matlab function's equivalent in opencv

Looking for SLAB6 implementation

Teaching a Neural Net: Bipolar XOR

Categories

Resources