I am working on drawing large directed acyclic graphs in WebGL using the gwt-g3d library as per the technique shown here: http://www-graphics.stanford.edu/papers/h3/
At this point, I have a simple two-level graph rendering:
Performance is terrible -- it takes about 1.5-2 seconds to render this thing. I'm not an OpenGL expert, so here is the general approach I am taking. Maybe somebody can point out some optimizations that will get this rendering quicker.
I am astonished how long it takes to push the MODELVIEW matrix and buffers to the graphics card. This is where the lion's share of the time is wasted. Should I instead be doing MODELVIEW transformations in the vertex shader?
This leads me to believe that manipulating the MODELVIEW matrix and pushing it once for each node shouldn't be a bad practice, but the timings don't lie:
https://gamedev.stackexchange.com/questions/27042/translate-the-modelview-matrix-or-change-vertex-coordinates
Group nodes in larger chunks instead of rendering them separately. Do background caching of all geometry with applied transformations that most likely will not be modified and store it in one buffer and render in one call.
Another solution: Store nodes(box + line) in one buffer(You can store more than you need at current time) and their transformations in texture. apply transformations in vertex shader based on node index(texture coordinates) It should be faster drastically faster.
To test support use this site. I have MAX_VERTEX_TEXTURE_IMAGE_UNITS | 4
Best solution will be Geometry Instancing but it currently isn't supported in WebGL.
Related
I'm trying to implement a state preserving particle system on the iPhone using OpenGL ES 2.0. By state-preserving, I mean that each particle is integrated forward in time, having a unique velocity and position vector that changes with time and can not be calculated from the initial conditions at every rendering call.
Here's one possible way I can think of.
Setup particle initial conditions in VBO.
Integrate particles in vertex shader, write result to texture in fragment shader. (1st rendering call)
Copy data from texture to VBO.
Render particles from data in VBO. (2nd rendering call)
Repeat 2.-4.
The only thing I don't know how to do efficiently is step 3. Do I have to go through the CPU? I wonder if is possible to do this entirely on the GPU with OpenGL ES 2.0. Any hints are greatly appreciated!
I don't think this is possible without simply using glReadPixels -- ES2 doesn't have the same flexible buffer management that OpenGL has to allow you to copy buffer contents using the GPU (where, for example, you could copy data between the texture and vbo, or use simply use transform feedback which is basically designed to do exactly what you want).
I think your only option if you need to use the GPU is to use glReadPixels to copy the framebuffer contents back out after rendering. You probably also want to check and use EXT_color_buffer_float or related if available to make sure you have high precision values (RGBA8 is probably not going to be sufficient for your particles). If you're intermixing this with normal rendering, you probably want to build in a bunch of buffering (wait a frame or two) so you don't stall the CPU waiting for the GPU (this would be especially bad on PowerVR since it buffers a whole frame before rendering).
ES3.0 will have support for transform feedback, which doesn't help but hopefully gives you some hope for the future.
Also, if you are running on an ARM cpu, it seems like it'd be faster to use NEON to quickly update all your particles. It can be quite fast and will skip all the overhead you'll incur from the CPU+GPU method.
I am working on making a sprite class in OpenGL ES 2.0 and have succeeded to a point. Currently I have a render method for the sprite and it's called by the render method in my EAGL layer at intervals. I was creating new vertex buffer and index buffer every time render was called but it isn't efficient so I called glremovebuffer. Unfortunately when I do that the frame-rate is slowed down significantly.
So currently I have the vbo and ibo created at initialization which works fine in terms of frame-rate and memory consumption but is unable to update position.
I'm at a bit of a loss as I'm just beggining with OpenGL, any help is appreciated.
Typically you want to create your sprite with VBOs and IBOs once, located at the model origin. To translate, rotate, and scale, you would then use the model matrix to transform your sprite into a desired location.
I'm fairly certain that iphone sdk provides some nice functions to do that, but I don't know any of them :) Basically, in your shader, you take your position coordinates and you multiply it by one or more matrices, one of those matrices is the model matrix, which you can change to be a translate, rotate, scale, or any combination of those matrices (in fact, it can be any matrix you want and it will produce different results).
There's a lot of resources out there that explain these transformation matrices. Here's one for instance:
http://db-in.com/blog/2011/04/cameras-on-opengl-es-2-x/
My advise is to find a tutorial that speaks on the same level as your understand and learn from there...
I just finished implementing VBO's in my 3D app and saw a roughly 5-10x speed increase in rendering. What used to render at 1-2 frames per second now renders at 10-11 frames per second.
My question is, are there any further improvements I can make to increase rendering speed? Will triangle strips make a big difference? Currently vertices are not being shared between faces, each faces vertices are unique but overlapping.
My Device Utilization is 100%, Tiler Utilization is 100%, Renderer Utilization is 11%, and resource bytes is 114819072. This is rendering 912,120 faces on a CAD model.
Any suggestions?
A Tiler Utilization of 100% indicates that your bottleneck is in the size of the geometry being sent to the GPU. Whatever you can do to shrink the geometry size can lead to an almost linear reduction in rendering time, in my experience. These tuning steps have worked for me in the past:
If you're not already, you could look at using indexing, which might cut down on geometry by eliminating some redundant vertices. The PowerVR GPUs in the iOS devices are optimized for using indexed geometry, as well.
Try using a smaller data type for your vertex information. I found that I could use GLshort instead of GLfloat for my vertices and normals without losing much precision in the rendering. This will significantly compact your geometry and lead to a nice speed boost in rendering.
Bin similarly colored vertices and render them as one group at a set color, rather than supplying per-vertex color information. The overhead from the few extra draw calls this requires will be vastly outweighed by the speedup you get from not having to send all that color information. I saw a ~18% reduction in rendering time by binning the colors in one of my larger models.
You're already using VBOs, so you've taken advantage of that optimization.
Don't halt the rendering pipeline at any point. Cut out anything that reads the current state, like all glGet* calls, because they really mess with the flow of the PowerVR GPUs.
There are other things you can do that will lead to smaller performance improvements, like using interleaved vertex, normal, texture data in your VBOs, aligning your data to 4 byte boundaries, etc., but the ones above are what I've found to have the largest impact in the tuning of my own OpenGL ES 1.1 application.
Most of these points are covered well in the "Best Practices for Working with Vertex Data" section of Apple's OpenGL ES Programming Guide for iOS.
I'm developing a game for iPhone in OpenGL ES 1.1; I have a lot of textured quads in a data structure where each node has a list of children nodes. So I traverse the structure from the root, and do the render of each quad, then its childs and so on.
The thing is, for each quad I'm calling glVertexPointer to set the vertices.
Should I avoid calling it for each quad? Will improve performance calling just once for example?
glVertexPointer copies the vertices to GPU memory or just saves the pointer?
Trying to minimize the number of calls will not be easy since each node may have a different quad. I have a lot of equal sprites with the same vertex data, but I'm not necessarily rendering one after another since I may be drawing a different sprite between them.
Thanks.
glVertexPointer keeps just the pointer, but incurs a state change in the OpenGL driver and an explicit synchronisation, so costs quite a lot. Normally when you say 'here's my data, please draw', the GPU starts drawing and continues to do so in parallel to whatever is going on on the CPU for as long as it can. When you change rendering state, it needs to finish whatever it was doing in the old state. So by changing once per quad, you're effectively forcing what could be concurrent processing to be consecutive. Hence, avoiding glVertexPointer (and, presumably, a glDrawArrays or glDrawElements?) per quad should give you a significant benefit.
An immediate optimisation is simply to keep a count of the number of quads in total in the data structure, allocate a single target buffer for vertices that is at least that size and have all quads copy their geometry into the target buffer rather than calling glVertexPointer each time. Then call glVertexPointer and your drawing calls (condensed to just one call also, hopefully) with the one big array at the end. It's a bit more costly on the CPU side but the parallelism and lack of repeated GPU/CPU synchronisations should save you a lot.
While tiptoeing around topics currently under NDA, I strongly suggest you look at the Xcode 4 beta. Amongst other features Apple have stated publicly to be present is an OpenGL ES profiler. So you can easily compare approaches.
To copy data to the GPU, you need to use a vertex buffer object. That means creating a buffer with glGenBuffers, pushing data to it with glBufferData and then posting a glVertexPointer with an address of e.g. 0 if the first byte in the data you uploaded is the first byte of your vertices. In ES 1.x, you can upload data as GL_DYNAMIC_DRAW to flag that you intend to update it quite often and draw from it quite often. It's probably worth doing if you can get into a position where you're drawing more often than you're uploading.
If you ever switch to ES 2.x there's also GL_STREAM_DRAW, which may be worth investigating but isn't directly relevant to your question. I mention it as it'll likely come up if you Google for vertex buffer objects, being available on desktop OpenGL. Options for ES 1.x are only GL_STATIC_DRAW and GL_DYNAMIC_DRAW.
I've just recently worked on an iPad ES 1.x application with objects that change every frame but are drawn twice per the rendering pipeline in use. There are only five such objects on screen, each 40 vertices, but switching from the initial implementation to the VBO implementation cut 20% off my total processing time.
I'd like to hear what people think the optimal draw calls are for Open GL ES (on the iphone).
Specifically I've read in many places that it is best to minimise the number of calls to glDrawArrays/glDrawElements - I think Apple say 10 should be the max in their recent WWDC presentation. As I understand it to do this you need to put all the vertices into one array if possible, so you only need to make the drawArrays call once.
But I am confused because this surely means you can't use the translate, rotate, scale functions, because it would apply across the whole geometry. Which is fine except doesn't that mean you need to pre-calculate every vertex position yourself, rather than getting open gl to do it?
Also, doesn't it mean you can't use any of the fan/strip settings unless you just have a continuous shape?
These drawbacks make me think I'm not understanding something correctly, so I guess I'm looking for confirmation that I should:
Be trying to make an uber array of all triangles to draw.
Resign myself to the fact I'll have to work out all the vertex positions myself.
Forget about push'ing and pop'ing each thing to draw into it's desired location
Is that what others do?
Thanks
Vast question, batching is always a matter of compromise.
The ideal structure for performance would be, as you mention, to one single array containing all triangles to draw.
Starting from here, we can start adding constraints :
One additional constraint is that
having vertex indices in 16bits saves
bandwidth and memory, and probably
the fast path for your platform. So
you could consider grouping triangles
in chunks of 65536 vertices.
Then, if you want to switch the
shader/material/glState used to draw
geometry, you have no choice (*) but
to emit one draw call per
shader/material/glState. So grouping
triangles could consider grouping by
shaderID/materialID/glStateID.
Next, if you want to animate things,
you have no choice (*) but to
transmit your transform matrix to GL,
and then issue a draw call. So
grouping triangles could consider
grouping triangles by 'transform
groups', for example, all static
geometry together, animated geometry
that have common transforms can be
grouped too.
In these cases, you'd have to transform the vertices yourself (using CPU) before merging the meshes together.
Regarding triangle strips, you can transform any mesh in strips, even if it has discontinuities in its topology, by introducing degenerate triangles. So this is a technique that always apply.
All in all, reducing draw calls is a game of compromises, some techniques might work well for a 3d model, while others may be more suited for other 3d models. IMHO, the key is to be creative and to carefully benchmark your application to see if your changes actually improve performance on your target platform.
HTH, cheers,
(*) actually there are techniques that allow to reduce the number of draw calls in these cases, such as :
texture atlases to group different textures in a single one, to prevent
switching textures in GL, thus
allowing to limit draw calls
(pseudo) hardware instancing that allow shaders to fetch transforms
from various sources to transform
mesh instances in different ways.
...