Is there an efficient way to append values in an SSB in a compute shader with GLSL? - append

I have an OpenGL compute shader that generates an undefined number of vertices and stores them in a shader storage buffer (SSB). The SSB capacity is big enough so that the compute shader never generates a number of vertices that exceeds its capacity. I need the generated values to fill the buffer from the beginning and with no discontinuities (just like using push_back on a C++ vector). For that I'm using an atomic counter to count the index where to place the vertex values in the SSB when one is generated. This method seems to work but makes the compute shader run much more slower. Here is what the GLSL function looks like:
void createVertex(/*some parameters*/){
uint index = atomicCounterIncrement(numberOfVertices);
Vector vertex;
// some processing that calculates the coordinates of the vertex
vertices[index] = vertex;
}
Where vertices is a vec3 SSB defined by :
struct Vector
{
float x, y, z;
};
layout (std430, binding = 1) buffer vertexBuffer
{
Vector vertices[];
};
And numberOfVertices is an atomic counter buffer which value is initialized to 0 before running the shader.
Once the shader finished running I can load back the numberOfVertices variable on the CPU side to know the number of created vertices that are stored in the buffer in the range [0; numberOfVertices*3*sizeof(float)].
When measuring the time the shader took to run (with glBegin/EndQuery(GL_TIME_ELAPSED)), I get about 50ms. However when removing the atomicCounterIncrement line (and therefore also not assigning the vertex into the array) the measured time is around a few milliseconds. And that gap increases as I increase the number of workgroups.
I think the problem may be caused by the use of the atomic operation. So is there a better way to append values in an SSB ? In a way that would also give me the total number of added values once the shader has finished running ?
EDIT: After some refactoring and tests I noticed that it's actually the assignement of values inside the buffer (vertices[index] = vertex;) that slows all (about 40ms less when this line is removed). I should inform that the createVertex() function is called inside a for loop which number of loops is different between shader instances.

Related

Is there any way to set length of integer array with variable in hlsl?

firstly, I wish you guys can understand awkward grammar skills of next writing. English is not my first language.
Im currently using UnityEngine Now. what i wanna do is sending a number of rows to a shader, so that i can set a count of rows of stripes in Gameobject mesh using the shader which got the number of stripes.
And I made to send a number variable to a shader, but when i try to create int array in CG program part(which is HLSL) with size of the rows that i want using the number, unity Engine gives me this error message - "array dimensions must be literal scalar expressions".
This is the integer variable that i set in my unity shader script. This gets integer value from c# script function(this part doenst have any issue)
_LowCount ("LowCount", int) = 0
And this is CG Program part which im struggling with.
The variable below is declared in global field. It receives number value from the properties.
int _LowCount;
And this is fragment shader function part and it declares integer array in its local field setting the array size on integer variable - "_LowCount"
fixed4 frag(v2f i):COLOR{
fixed4 c=0;
int ColorsArray[_LowCount];
for(int aa=0;aa<_LowCount;aa++){
ColorsArray[aa]=0;
}
return c;
And below part from fragment shader function gives me the error that i mentioned in above.
int ColorsArray[_LowCount];
I searched this issue in google, then i realized i have to set array size with number value( not a variable). But I need an integer array with size of number variable that i can give any integer value anytime i want. Is there any solution?
*ps. I started to learn CG graphics from just 2 weeks ago. So I might be wrong in my understading and my knowledge. Thank you.
There is no way to define an hlsl array with variable size. From the docs:
Literal scalar expressions declared containing most other types
Either preallocate with the maximum array size int ColorsArray[maximum possible _LowCount];
It's not super clear what your end goal is, but another solution may be to execute a different shader for each object instead. See if you can update your question a little and I'll update the answer.

Why I can't use tex2D inside a loop in Unity ShaderLab?

I am trying to do something like
while (currentLayerDepth < currentDepth)
{
currentUV -= step;
currentDepth = tex2D(_HeightTex,currentUV).a;
currentLayerDepth += eachLayer;
}
It logged a error Shader error in 'Unlit/CustomParallax': unable to unroll loop, loop does not appear to terminate in a timely manner (1024 iterations) at line 76 (on metal)
So now I have two choices, one is to add [unroll(100)] to limit loop times and the other is using tex2Dlod instead of tex2D.
I'm curious why this happened?
Besides, why tex2Dlod can be used in a loop?
tex2D has to compute a local derivative to determine the correct LOD for the sample. Because of how derivatives are usually computed (as difference between neighbouring computation units), they can only be computed for predictable control flow.
Your loop doesn't predictably do the same number of calls to tex2D for neighbouring fragments, so the derivative can't be predictably computed.
For more details have a look at the GLSL specs. Search for "derivative" and "uniform control flow"

How to do a median projection of a large image stack in Matlab

I have a large stack of 800 16bit gray scale images with 2048x2048px. They are read from a single BigTIFF file and the whole stack barely fits into my RAM (8GB).
Now I need do a median projection. That means I want to compute the median of each pixel across all 800 frames. The Matlab median function fails because there is not enough memory left make a copy of the whole array for the function call. What would be an efficient way to compute the median?
I have tried using a for loop to compute the median one pixel at a time, but this is still terribly slow.
Iterating over blocks, as #Shai suggests, may be the most straightforward solution. If you do have this problem frequently, you may want to consider converting the image to a mat-file, so that you can access the pixels as n-d array directly from disk.
%# convert to mat file
matObj = matfile('dest.mat','w');
matObj.data(2048,2048,numSlices) = 0;
for t = 1:numSlices
matObj.data(:,:,t) = imread(tiffFile,'index',t);
end
%# load a block of the matfile to take median (run as part of a loop)
medianOfBlock = median(matObj.data(1:128,1:128,:),3);
I bet that the distributions of the individual pixel values over the stack (i.e. the histograms of the pixel jets) are sparse.
If that's the case, the amount of memory needed to keep all the pixel histograms is much less than 2K x 2K x 64k: you can use a compact hash map to represent each histogram, and update them loading the images one at a time. When all updates are done, you go through your histograms and compute the median of each.
If you have access to the Image Processing Toolbox, Matlab has a set of tool to handle large images called Blockproc
From the docs :
To avoid these problems, you can process large images incrementally: reading, processing, and finally writing the results back to disk, one region at a time. The blockproc function helps you with this process.
I will try my best to provide help (if any), because I don't have an 800-stack TIFF image, nor an 8GB computer, but I want to see if my thinkings can form a solution.
First, 800*2048*2048*8bit = 3.2GB, not including the headers. With your 8GB RAM it should not be too difficult to store it at once; there might be too many programs running and chopping up the contiguous memories. Anyway, let's treat the problem as Matlab can't load it as a whole into the memory.
As Jonas suggests, imread supports loading a TIFF image by index. It also supports a PixelRegion parameter, so you can also consider accessing parts of the image by this parameter if you want to utilize Shai's idea.
I came up with a median algo that doesn't use all the data at the same time; it barely scans through a sequence of un-ordered data, one at each time; but it does keep a memory of 256 counters.
_
data = randi([0,255], 1, 800);
bins = num2cell(zeros(256,1,'uint16'));
for ii = 1:800
bins{data(ii)+1} = bins{data(ii)+1} + 1;
end
% clearvars data
s = cumsum(cell2mat(bins));
if find(s==400)
med = ( find(s==400, 1, 'first') + ...
find(s>400, 1, 'first') ) /2 - 1;
else
med = find(s>400, 1, 'first') - 1;
end
_
It's not very efficient, at least because it uses a for loop. But the benefit is instead of keeping 800 raw data in memory, only 256 counters are kept; but the counters need uint16, so actually they are roughly equivalent to 512 raw data. But if you are confident that for any pixel the same grayscale level won't count for more than 255 times among the 800 samples, you can choose uint8, and hence reduce the memory by half.
The above code is for one pixel. I'm still thinking how to expand it to a 2048x2048 version, such as
for ii = 1:800
img_data = randi([0,255], 2048, 2048);
(do stats stuff)
end
By doing so, for each iteration, you only need these kept in memory:
One frame of image;
A set of counters;
A few supplemental variables, with size comparable to one frame of image.
I use a cell array to store the counters. According to this post, a cell array can be pre-allocated while its elements can still be stored in memory non-contigously. That means the 256 counters (512*2048*2048 bytes) can be stored separately, which is quite reasonable for your 8GB RAM. But obviously my sample code does not make use of it since bins = num2cell(zeros(....

Matlab for loop vectorization and memory

X,Y and z are coordinates representing surface. In order to calculate some quantity, lets call it flow, at point i,j of the surface, i need to calculate contibution from all other points (i0,j0). To do so i need for example to know cos of angles between point i0,j0 and all other points (alpha). Then all contirbutions from i0,j0 must be multiplied on some constants and added. zv0 at every point i,j is final needed result.
I came up with some code written below and it seems to be extremely unappropriate. First of all it slows down rest of the program and seems to use all of the available memory. My system has 4gb physical memory and 12gb swap file and it always runs out of memory, though all of variables sizes are not bigger then 10kb. Please help up with speed up/vectorization and memory problems.
parfor i0=2:1:length(x00);
for j0=2:1:length(y00);
zv=red3dfunc(X0,Y0,f,z0,i0,j0,st,ang,nx,ny,nz);
zv0=zv0+zv;
end
end
function[X,Y,z,zv]=red3dfunc(X,Y,f,z,i0,j0,st,ang,Nx,Ny,Nz)
x1=X(i0,j0);
y1=Y(i0,j0);
z1=z(i0,j0);
alpha=zeros(size(X));
betha=zeros(size(X));
r=zeros(size(X));
XXa=X-x1;
YYa=Y-y1;
ZZa=z-z1;
VEC=((XXa).^2+(YYa).^2+(ZZa).^2).^(1/2);
VEC(i0,j0)=VEC(i0-1,j0-1);
XXa=XXa./VEC;
YYa=YYa./VEC;
ZZa=ZZa./VEC;
alpha=-(Nx(i0,j0).*XXa+Ny(i0,j0).*YYa+Nz(i0,j0).*ZZa);
betha=Nx.*XXa+Ny.*YYa+Nz.*ZZb;
r=VEC;
zv=(1/pi)*st^2*ang.*f.*(alpha).*betha./r.^2;
The obvious thing to do this is to use Kroneker product. The matlab function is kron(A,B) for matricies of dimensions nAxmA and nBxmB. This function will return matrix of dimension (nA*nB)x(mA*mB), which will look something like
[a11*B a12*B ... a1mA*B;
.......................;
anA1*B ........ anAmA*B]
So your problem may be solved by introducing the matrix of ones I=ones(size(X)). You will then define your XXa, YYa, ZZa and VEC matricies without any loop as
XXa = kron(I,X)-kron(X,I);
YYa = kron(I,Y)-kron(Y,I);
ZZa = kron(I,Z)-kron(Z,I);
VEC=((XXa).^2+(YYa).^2+(ZZa).^2).^(1/2);
You will then find VEC for any i0,j0 as (if you define n and m as size components of X)
VEC((1+n*(i0-1)):(n*i0),(1+m*(j0-1)):(m*j0))

For iterator (loop)

I am trying to simulate throw of the ball under angles using simulink. I'm able to simulate it for one angle but I would like to simulate it using loop. This is what I want to do in simulink using FOR :
for i=-5:10:85
Here is picture of my simulink:
If I understand your question correctly, you essentially want to rerun your simulation multiple times for different values of the constant Degrees. Instead of using a For Iterator, you may be able to achieve effectively the same result by using vector operations. That is to say, change the value of the constant Degrees from being a scalar value to instead being a vector (in this particular case just set its value to be [5:10:85]). The outputs of your Simulink model (ie the x and y results) should now be vectors corresponding to the various Degree values.
Put all the blocks into the for-iterator subsystem. The For Iterator block will output the current iteration, you can use that index (which starts at 0/1) to cycle the angle from -5 to 85 (try to hook the For Iterator block up to a Gain and Sum block). At each iteration, all the blocks in the for-iterator subsystem will run, and the output of the For Iterator block will increment by one.
The previous solution to make the angles a vector will also work.
Using MATLAB's for reference page, I'd rewrite your line as:
for i=5:10:85
...
end