Fragment Shader - Average Luminosity - iphone

Does any body know how to find average luminosity for a texture in a fragment shader? I have access to both RGB and YUV textures the Y component in YUV is an array and I want to get an average number from this array.

I recently had to do this myself for input images and video frames that I had as OpenGL ES textures. I didn't go with generating mipmaps for these due to the fact that I was working with non-power-of-two textures, and you can't generate mipmaps for NPOT textures in OpenGL ES 2.0 on iOS.
Instead, I did a multistage reduction similar to mipmap generation, but with some slight tweaks. Each step down reduced the size of the image by a factor of four in both width and height, rather than the normal factor of two used for mipmaps. I did this by sampling from four texture locations that were in the middle of the four squares of four pixels each that made up a 4x4 area in the higher-level image. This takes advantage of hardware texture interpolation to average the four sets of four pixels, then I just had to average those four pixels to yield a 16X reduction in pixels in a single step.
I converted the image to luminance at the very first stage using a dot product of the RGB values with a vec3 of (0.2125, 0.7154, 0.0721). This allowed me to just read the red channel for each subsequent reduction stage, which really helps on iOS hardware. Note that you don't need this if you are starting with a Y channel luminance texture already, but I was dealing with RGB images.
Once the image had been reduced to a sufficiently small size, I read the pixels from that back onto the CPU and did a last quick iteration over the remaining few to arrive at the final luminosity value.
For a 640x480 video frame, this process yields a luminosity value in ~6 ms on an iPhone 4, and I think I can squeeze out a 1-2 ms reduction in that processing time with a little tuning. In my experience, that seems faster than the iOS devices normally generate mipmaps for power-of-two images at around that size, but I don't have solid numbers to back that up.
If you wish to see this in action, check out the code for the GPUImageLuminosity class in my open source GPUImage framework (and the GPUImageAverageColor superclass). The FilterShowcase example demonstrates this luminosity extractor in action.

You generally don't do this just with a shader.
One of the more common methods is to create a buffer texture with full mip-maps (down to 1x1, this is important). When you want to find luminosity, you copy the backbuffer to this buffer, then regenerate mips with a nearest neighbor algorithm. The bottom pixel will then have the average color of the entire surface and can be used to find average lum through something like (c.r * 0.6) + (c.g * 0.3) + (c.b * 0.1) (edit: if you have a YUV, then do similar and use the Y; the trick is just averaging the texture down to a single value, which is what mips do).
This isn't a precise technique, but is reasonably fast, especially on hardware that can generate mipmaps internally.

I'm presenting a solution for the RGB texture here as I'm not sure mip map generation would work with a YUV texture.
The first step is to create mipmaps for the texture, if not already present:
glGenerateMipmapOES(GL_TEXTURE_2D);
Now we can access the RGB value of the smallest mipmap level from the fragment shader by using the optional third argument of the sampler function texture2D, the "bias":
vec4 color = texture2D(sampler, vec2(0.5, 0.5), 8.0);
This will shift the mipmap level up eight levels, resulting in sampling a far smaller level.
If you have a 256x256 texture and render it with a scale of 1, a bias of 8.0 will effectively reduce the picked mipmap to the smallest 1x1 level (256 / 2^8 == 1). Of course you have to adjust the bias for your conditions to sample the smallest level.
OK, now we have the average RGB value of the whole image. The third step is to reduce RGB to a luminosity:
float lum = dot(vec3(0.30, 0.59, 0.11), color.xyz);
The dot product is just a fancy (and fast) way of calculating a weighted sum.

Related

I need help understanding the GGX normal distribution function

After reading the book "Real Time Rendering 4. Edition", I've decided to give online pbr a try and i chose the GGX algorithm for the normal distribution function. The equation shown in this book looks like in this image:
Now, h is the half vector created from the light and view directions L and V respectively.
X(nDotH)+ is 1 if nDotH is greater than 0, else 0.
alpha-g is the GGX roughness value between 0 and 1.
My question now is the following: As far as I understood the concept of NDF and roughness, a high alpha value would mean that the (micro)surface is very rough, and a low value smooth. So, if I want to render a smooth metallic surface such as the body of a car, I would set my alpha to a low value such as 0,1. By doing so, the result of my D(h) is so low that the object cant even be seen.
Am I missing something or did I not fully understand the value of alpha?
I implemented the NDF in MATLAB to analyse my results. I tried it with the coordinates of a cube placed at origin without transformations.
Given 2 coordinates (world space):
N = [0.0 1.0 0.0; 0.0 0.0 1.0] P = [-1.0 1.0 1.0; -1.0 1.0 1.0] L-Direction = [0.0 1.0 1.0] C-Position = [0.0 3.0 4.0] alpha = 0.1
Results:
D(h) for N1 = 8.6212e-03 D(h) for N2 = 1.7998e-02
As you can see, the values are so low, they aren't visible, specially for the first coordinate whose normal vector point straight up.
The root problem is that simple point lights often don't suffice for full PBR rendering. Consider the following two renderings of a smooth metallic sphere:
This is the top-left sphere from a glTF sample model rendered in Babylon Sandbox.
On the left side, the sphere is placed in a dark environment against a gray background, and a single point light illuminates the scene. The light is quite bright, but because the sphere is so smooth, and because the "point" nature of the light gives it essentially no radius, the reflection of this light is barely a few pixels, regardless of how bright it may be. The remainder of the sphere has the low D(h) values you mentioned, and is almost black.
On the right side, the same sphere again in the same rendering engine, but this time the engine is using its default environment, which comes from an HDR image. In the case of smooth metal, the resulting render is mostly a mirror reflection of the environment, but rougher and non-metallic surfaces can also have their appearance greatly influenced by colors and intensities in the surrounding environment. With a good quality environment, there's often no need to add point lights at all, and indeed there are no point lights in the right image.
In general, PBR, and particularly metallic PBR, looks best with a full HDRI environment, not just point lights. For some sample code and shaders showing some of this math in action, the Khronos glTF Sample Viewer might be a good place to start. [Disclaimer, I'm a contributor.]
As far as I understood the concept of NDF and roughness, a high alpha
value would mean that the (micro)surface is very rough, and a low
value smooth. So, if I want to render a smooth metallic surface such
as the body of a car, I would set my alpha to a low value such as 0,1.
By doing so, the result of my D(h) is so low that the object cant even
be seen. Am I missing something or did I not fully understand the
value of alpha?
It's true that the numerator of the equation goes to zero.
But denominator too. And it does so more rapidly.
Taking, as an example, n = h -> dot(n, h) will be one. And if alpha is 0.1:
0.1^2 / (3.141593 * (1 + (0.1^2 - 1))^2)
If you plug that into your calculator you will get ~32.83.
So, as you can see, the whole equation doesn't go to zero.
Actually, if you calculate the limit of the equation as alpha goes to zero, the equation goes to infinity. Which makes sense, because when roughness is zero, all the normals are concentrated in a single direction.

How does an image'alpha transfer so much information to other nodes in Unity Shadergraph?

I have a image:
The upper part of this image, which alpha value is 1 (or 255 in RGBA)
The lower part of this image, which alpha value is 0.3, I used it for shadow in game.
So When I import it to Unity ShaderGraph as a _MainTex, when I split it alpha, it looks like this:
imported alpha
My first questions is:
"alpha" is actually a VECTOR 1 type in Unity Documention, but as I could see from the preview, there are three colors, black indicates alpha's value 0, hard white for alpha's value 1 and soft white for alpha's value 0.3, how can one single value transfer so much messages?
My first understanding is:
each pixel's alpha value is stored in the images already, the "alpha" in the shadergraph is just
like a global parameter to control them based every pixel.[I dont know if this is correct]
but when I give alpha a smoothstep node, I
am going to set the pixels's alpha under 0.3 to 0, I found it worked like this:
smoothstep added to the alpha, as you can see, 0.3<0.99, so
the translucent of the image is removed!
So here comes my second question:
Since "alpha" in the input works like a global parameter, how does it affect a picture separately?
My second understanding is:
"alpha" is just like an one-dimensional array, it stores transparency likes this:
{1,1,1,0.3,0.3,0.3}
and when it calculated by smoothstep,its value will be changed like this:
{1,1,1,0,0,0}
But it comes to my first question, ALPHA IS A VECTOR1 TYPE, it only has one value to edit
in the node, it can not be an array!
So, How does an image'alpha transfer so much information to other nodes in Unity Shadergraph?
https://docs.unity3d.com/Packages/com.unity.shadergraph#6.9/manual/Data-Types.html
https://docs.unity3d.com/Packages/com.unity.shadergraph#6.9/manual/Smoothstep-Node.html
Someone who can help me really appreciated!
Shaders work in parallel: for any given vertex or pixel you only get data local to this element. Also critically here 'pixel' (or 'fragment') is a screen pixel, not a texel, which relates a texture's pixel.
In this context, the output of the texture node is a single rgba Vector4 (4 scalar values) at the provided coordinate. This is disconnected from how textures are stored: filtering, compression and mipmapping will come into play (and the control over this comes from the sampler, which you can also provide to the node even though it's most of the time implicit).
Smoothstep is a function that can remap a value - a vector (like the rgba output of the tex node), or a scalar (like the alpha) - into another range. More specifically it does it with smoothing both ends of the spectrum so that the slope is 0 at min and max. The linear equivalent is inverse lerp (which doesn't have a built in instruction in hlsl). You can read about the breakdown on the wikipedia page: https://www.wikiwand.com/en/Smoothstep

Does Unity pad textures into rectangles?

I already know that if Unity is using a texture with the dimensions 250X250 it will pad the texture to 256X256 so as to make the dimensions a power of 2. If I were to have a texture of size 512X256 would it pad to 512X512 to make the texture a square or would it stay at 512X256 as each side is already a power of 2?
It should keep the 512x256 resolution. Each side needs to be a power of two, they don't have to be equal.
Note, that it's not exactly true that Unity will change the dimensions of each your texture. You can change the texture properties to legacy GUI and then you can have a pixel perfect texture. It will be wasteful or slower a bit (depending on GPU drivers), but it will work quite well.

Why are UIView frames made up of floats?

Technically the x, y, width, and height represent a set of dimensions that relate to pixels. I
can't have 200.23422 pixels so why do they use floats instead of ints?
The reason for the floats are that modern CPUs and GPUs a are optimized to work with many floating point numbers in parallel. This is true for iOS as well as Mac.
With Quartz you don't address individual pixels, but everything you draw is always antialiased. When you have a coordinate 1.0, 1.0 then this actually adds color to the 2x2 pixels at the coordinate origin.
This is why you might get blurry lines if you draw at integer coordinates. On non-retina you habe to draw offset by 0.5. Technically you would need to offset by 0.25 to draw exact pixels on Retina displays. Though there it does not really matter that much because you don't really see it any more at that pixel size.
Long story short: you don't address pixels direcrly, but the Graphics engine maps between floating point coordinates and pixels for you.
Resolution independence.
You want to keep your mathematical representation of your UI as accurate as practicable, only translating to pixel int values when you actually need to draw to the output device (and even then, not really). That's so that you can apply any number of transformations to your views and still get an accurate result.
Moreover it is possible to render lines, for example, at half-pixel widths and even less with a visible result - the system uses intelligent antialiasing to display a fine line.
It's the same principle as vector drawing has been using for decades (Adobe's PostScript, SVG etc). In fact Quartz is based on PDF, which is the modern version of PostScript. NeXT used Display PostScript in it's time, and then it was considered pretty revolutionary.
The dimensions are actually points that on non-retina screens have a 1 to 1 relation to pixels, but for retina screens 1 point = 2 pixels. So on a retina screen you can actually increment by half a point.

How can I improve the performance of my custom OpenGL ES 2.0 depth texture generation?

I have an open source iOS application that uses custom OpenGL ES 2.0 shaders to display 3-D representations of molecular structures. It does this by using procedurally generated sphere and cylinder impostors drawn over rectangles, instead of these same shapes built using lots of vertices. The downside to this approach is that the depth values for each fragment of these impostor objects needs to be calculated in a fragment shader, to be used when objects overlap.
Unfortunately, OpenGL ES 2.0 does not let you write to gl_FragDepth, so I've needed to output these values to a custom depth texture. I do a pass over my scene using a framebuffer object (FBO), only rendering out a color that corresponds to a depth value, with the results being stored into a texture. This texture is then loaded into the second half of my rendering process, where the actual screen image is generated. If a fragment at that stage is at the depth level stored in the depth texture for that point on the screen, it is displayed. If not, it is tossed. More about the process, including diagrams, can be found in my post here.
The generation of this depth texture is a bottleneck in my rendering process and I'm looking for a way to make it faster. It seems slower than it should be, but I can't figure out why. In order to achieve the proper generation of this depth texture, GL_DEPTH_TEST is disabled, GL_BLEND is enabled with glBlendFunc(GL_ONE, GL_ONE), and glBlendEquation() is set to GL_MIN_EXT. I know that a scene output in this manner isn't the fastest on a tile-based deferred renderer like the PowerVR series in iOS devices, but I can't think of a better way to do this.
My depth fragment shader for spheres (the most common display element) looks to be at the heart of this bottleneck (Renderer Utilization in Instruments is pegged at 99%, indicating that I'm limited by fragment processing). It currently looks like the following:
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
const vec3 stepValues = vec3(2.0, 1.0, 0.0);
const float scaleDownFactor = 1.0 / 255.0;
void main()
{
float distanceFromCenter = length(impostorSpaceCoordinate);
if (distanceFromCenter > 1.0)
{
gl_FragColor = vec4(1.0);
}
else
{
float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter);
mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;
// Inlined color encoding for the depth values
float ceiledValue = ceil(currentDepthValue * 765.0);
vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - stepValues;
gl_FragColor = vec4(intDepthValue, 1.0);
}
}
On an iPad 1, this takes 35 - 68 ms to render a frame of a DNA spacefilling model using a passthrough shader for display (18 to 35 ms on iPhone 4). According to the PowerVR PVRUniSCo compiler (part of their SDK), this shader uses 11 GPU cycles at best, 16 cycles at worst. I'm aware that you're advised not to use branching in a shader, but in this case that led to better performance than otherwise.
When I simplify it to
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
void main()
{
gl_FragColor = vec4(adjustedSphereRadius * normalizedDepth * (impostorSpaceCoordinate + 1.0) / 2.0, normalizedDepth, 1.0);
}
it takes 18 - 35 ms on iPad 1, but only 1.7 - 2.4 ms on iPhone 4. The estimated GPU cycle count for this shader is 8 cycles. The change in render time based on cycle count doesn't seem linear.
Finally, if I just output a constant color:
precision mediump float;
void main()
{
gl_FragColor = vec4(0.5, 0.5, 0.5, 1.0);
}
the rendering time drops to 1.1 - 2.3 ms on iPad 1 (1.3 ms on iPhone 4).
The nonlinear scaling in rendering time and sudden change between iPad and iPhone 4 for the second shader makes me think that there's something I'm missing here. A full source project containing these three shader variants (look in the SphereDepth.fsh file and comment out the appropriate sections) and a test model can be downloaded from here, if you wish to try this out yourself.
If you've read this far, my question is: based on this profiling information, how can I improve the rendering performance of my custom depth shader on iOS devices?
Based on the recommendations by Tommy, Pivot, and rotoglup, I've implemented some optimizations which have led to a doubling of the rendering speed for the both the depth texture generation and the overall rendering pipeline in the application.
First, I re-enabled the precalculated sphere depth and lighting texture that I'd used before with little effect, only now I use proper lowp precision values when handling the colors and other values from that texture. This combination, along with proper mipmapping for the texture, seems to yield a ~10% performance boost.
More importantly, I now do a pass before rendering both my depth texture and the final raytraced impostors where I lay down some opaque geometry to block pixels that would never be rendered. To do this, I enable depth testing and then draw out the squares that make up the objects in my scene, shrunken by sqrt(2) / 2, with a simple opaque shader. This will create inset squares covering area known to be opaque in a represented sphere.
I then disable depth writes using glDepthMask(GL_FALSE) and render the square sphere impostor at a location closer to the user by one radius. This allows the tile-based deferred rendering hardware in the iOS devices to efficiently strip out fragments that would never appear onscreen under any conditions, yet still give smooth intersections between the visible sphere impostors based on per-pixel depth values. This is depicted in my crude illustration below:
In this example, the opaque blocking squares for the top two impostors do not prevent any of the fragments from those visible objects from being rendered, yet they block a chunk of the fragments from the lowest impostor. The frontmost impostors can then use per-pixel tests to generate a smooth intersection, while many of the pixels from the rear impostor don't waste GPU cycles by being rendered.
I hadn't thought to disable depth writes, yet leave on depth testing when doing the last rendering stage. This is the key to preventing the impostors from simply stacking on one another, yet still using some of the hardware optimizations within the PowerVR GPUs.
In my benchmarks, rendering the test model I used above yields times of 18 - 35 ms per frame, as compared to the 35 - 68 ms I was getting previously, a near doubling in rendering speed. Applying this same opaque geometry pre-rendering to the raytracing pass yields a doubling in overall rendering performance.
Oddly, when I tried to refine this further by using inset and circumscribed octagons, which should cover ~17% fewer pixels when drawn, and be more efficient with blocking fragments, performance was actually worse than when using simple squares for this. Tiler utilization was still less than 60% in the worst case, so maybe the larger geometry was resulting in more cache misses.
EDIT (5/31/2011):
Based on Pivot's suggestion, I created inscribed and circumscribed octagons to use instead of my rectangles, only I followed the recommendations here for optimizing triangles for rasterization. In previous testing, octagons yielded worse performance than squares, despite removing many unnecessary fragments and letting you block covered fragments more efficiently. By adjusting the triangle drawing as follows:
I was able to reduce overall rendering time by an average of 14% on top of the above-described optimizations by switching to octagons from squares. The depth texture is now generated in 19 ms, with occasional dips to 2 ms and spikes to 35 ms.
EDIT 2 (5/31/2011):
I've revisited Tommy's idea of using the step function, now that I have fewer fragments to discard due to the octagons. This, combined with a depth lookup texture for the sphere, now leads to a 2 ms average rendering time on the iPad 1 for the depth texture generation for my test model. I consider that to be about as good as I could hope for in this rendering case, and a giant improvement from where I started. For posterity, here is the depth shader I'm now using:
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
varying mediump vec2 depthLookupCoordinate;
uniform lowp sampler2D sphereDepthMap;
const lowp vec3 stepValues = vec3(2.0, 1.0, 0.0);
void main()
{
lowp vec2 precalculatedDepthAndAlpha = texture2D(sphereDepthMap, depthLookupCoordinate).ra;
float inCircleMultiplier = step(0.5, precalculatedDepthAndAlpha.g);
float currentDepthValue = normalizedDepth + adjustedSphereRadius - adjustedSphereRadius * precalculatedDepthAndAlpha.r;
// Inlined color encoding for the depth values
currentDepthValue = currentDepthValue * 3.0;
lowp vec3 intDepthValue = vec3(currentDepthValue) - stepValues;
gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);
}
I've updated the testing sample here, if you wish to see this new approach in action as compared to what I was doing initially.
I'm still open to other suggestions, but this is a huge step forward for this application.
On the desktop, it was the case on many early programmable devices that while they could process 8 or 16 or whatever fragments simultaneously, they effectively had only one program counter for the lot of them (since that also implies only one fetch/decode unit and one of everything else, as long as they work in units of 8 or 16 pixels). Hence the initial prohibition on conditionals and, for a while after that, the situation where if the conditional evaluations for pixels that would be processed together returned different values, those pixels would be processed in smaller groups in some arrangement.
Although PowerVR aren't explicit, their application development recommendations have a section on flow control and make a lot of recommendations about dynamic branches usually being a good idea only where the result is reasonably predictable, which makes me think they're getting at the same sort of thing. I'd therefore suggest that the speed disparity may be because you've included a conditional.
As a first test, what happens if you try the following?
void main()
{
float distanceFromCenter = length(impostorSpaceCoordinate);
// the step function doesn't count as a conditional
float inCircleMultiplier = step(distanceFromCenter, 1.0);
float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter * inCircleMultiplier);
mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;
// Inlined color encoding for the depth values
float ceiledValue = ceil(currentDepthValue * 765.0) * inCircleMultiplier;
vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - (stepValues * inCircleMultiplier);
// use the result of the step to combine results
gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);
}
Many of these points have been covered by others who have posted answers, but the overarching theme here is that your rendering does a lot of work that will be thrown away:
The shader itself does some potentially redundant work. The length of a vector is likely to be calculated as sqrt(dot(vector, vector)). You don’t need the sqrt to reject fragments outside of the circle, and you’re squaring the length to calculate the depth, anyway. Additionally, have you looked at whether or not explicit quantization of the depth values is actually necessary, or can you get away with just using the hardware’s conversion from floating-point to integer for the framebuffer (potentially with an additional bias to make sure your quasi-depth tests come out right later)?
Many fragments are trivially outside the circle. Only π/4 of the area of the quads you’re drawing produce useful depth values. At this point, I imagine your app is heavily skewed towards fragment processing, so you may want to consider increasing the number of vertices you draw in exchange for a reduction in the area that you have to shade. Since you’re drawing spheres through an orthographic projection, any circumscribing regular polygon will do, although you may need a little extra size depending on zoom level to make sure you rasterize enough pixels.
Many fragments are trivially occluded by other fragments. As others have pointed out, you’re not using hardware depth test, and therefore not taking full advantage of a TBDR’s ability to kill shading work early. If you’ve already implemented something for 2), all you need to do is draw an inscribed regular polygon at the maximum depth that you can generate (a plane through the middle of the sphere), and draw your real polygon at the minimum depth (the front of the sphere). Both Tommy’s and rotoglup’s posts already contain the state vector specifics.
Note that 2) and 3) apply to your raytracing shaders as well.
I'm no mobile platform expert at all, but I think that what bites you is that:
your depth shader is quite expensive
experience massive overdraw in your depth pass as you disable GL_DEPTH test
Wouldn't an additional pass, drawn before the depth test be helpful ?
This pass could do a GL_DEPTH prefill, for example by drawing each sphere represented as quad facing camera (or a cube, that may be easier to setup), and contained in the associated sphere. This pass could be drawn without color mask or fragment shader, just with GL_DEPTH_TEST and glDepthMask enabled. On desktop platforms, these kind of passes get drawn faster than color + depth passes.
Then in you depth computation pass, you could enable GL_DEPTH_TEST and disable glDepthMask, this way your shader would not be executed on pixels that are hidden by nearer geometry.
This solution would involve issuing another set of draw calls, so this may not be beneficial.