Advice on speeding up OpenGL ES 1.1 on the iPhone - iphone

I'm working on an iPhone App that relies heavily on OpenGL. Right now it runs a bit slow on the iPhone 3G, but looks snappy on the new 32G iPod Touch. I assume this is hardware related. Anyway, I want to get the iPhone performance to resemble the iPod Touch performance. I believe I'm doing a lot of things sub-optimally in OpenGL and I'd like advice on what improvements will give me the most bang for the buck.
My scene rendering goes something like this:
Repeat 35 times
glPushMatrix
glLoadIdentity
glTranslate
Repeat 7 times
glBindTexture
glVertexPointer
glNormalPointer
glTexCoordPointer
glDrawArrays(GL_TRIANGLES, ...)
glPopMatrix
My Vertex, Normal and Texture Coords are already interleaved.
So, what steps should I take to speed this up? What step would you try first?
My first thought is to eliminate all those glBindTexture() calls by using a Texture Atlas.
What about some more efficient matrix operations? I understand the gl*() versions aren't too efficient.
What about VBOs?
Update
There are 8260 triangles.
Texture sizes are 64x64 pngs. There are 58 different textures.
I have not run instruments.
Update 2
After running the OpenGL ES Instrument on the iPhone 3G I found that my Tiler Utilization is in the 90-100% range, and my Render Utilization is in the 30% range.
Update 3
Texture Atlasing had no noticeable affect on the problem. Utilization ranges are still as noted above.
Update 4
Converting my Vertex and Normal pointers to GL_SHORT seemed to improve FPS, but the Tiler Utilization is still in the 90% range a lot of the time. I'm still using GL_FLOAT for my texture coordinates. I suppose I could knock those down to GL_SHORT and save four more bytes per vertex.
Update 5
Converting my texture coordinates to GL_SHORT yielded another performance increase. I'm now consistently getting >30 FPS. Tiler Utilization is still around 90%, but frequently drops down in the the 70-80% range. The Renderer Utilization is hovering around 50%. I suppose this might have something to do with scaling the texture coordinates from GL_TEXTURE Matrix Mode.
I'm still seeking additional improvements. I'd like to get closer to 40 FPS, as that's what my iPod Touch gets and it's silky smooth there. If anyone is still paying attention, what other low-hanging fruit can I pick?

With a tiler utilization still above 90%, you’re likely still vertex throughput-bound. Your renderer utilization is higher because the GPU is rendering more frames. If your primary focus is improving performance on older devices, then the key is still to cut down on the amount of vertex data needed per triangle. There are two sides to this:
Reducing the amount of data per vertex: Now that all of your vertex attributes are already GL_SHORTs, the next thing to pursue is finding a way to do what you want using fewer attributes or components. For example, if you can live without specular highlights, using DOT3 lighting instead of OpenGL ES fixed-function lighting would replace your 3 shorts (+ 1 short of padding) for normals with 2 shorts for an extra texture coordinate. As an additional bonus, you’d be able to light your models per-pixel.
Reducing the number of vertices needed per triangle: When drawing with indexed triangles, you should make sure that your indices are sorted for maximum reuse. Running your geometry through Imagination Technologies’ PVRTTriStrip tool would probably be your best bet here.

If you only have 58 different 64x64 textures, a texture atlas seems like a good idea, since they'd all fit in a single 512x512 texture... if you don't rely on texture wrap modes, I'd certainly at least try this.
What format are your textures in? You might try using a compressed PVRTC texture; I think that's less load on the Tiler, and I've been pleasantly surprised by the image quality even for 2-bit-per-pixel textures. (Good for natural images, not good if you're doing something that looks like an 8-bit video game)

The first thing I would do is run Instruments profiling on the hardware device that is slow. It should show you pretty quickly where the bottlenecks are for your particular case.
Update after instruments results:
This question has a similar result in Instruments to you, perhaps the advice is also applicable in your case (basically reducing number vertex data)

The biggest win in graphics programming comes down to this:
Batch, Batch, Batch
TextureAtlasing will make a bigger difference than most anything else you can do. Switching textures is like stopping a speeding train to let on new passengers every time.
Combine all those textures into an atlas and cut your draw calls down a lot.
This web-based tool may be helpful: http://zwoptex.zwopple.com/

Have you looked over the "OpenGL ES Programming Guide for iPhone OS" in the dev center? There are sections on Best Practices for Vertex Data and Texture Data.
Is your data formatted to be able to use triangle strips?
In terms of least effort, the modification sequence for you would probably be:
Reducing vertex attribute size
VBOs
Note that when you do these, you need to make sure that components are aligned on their native alignment, i.e. the floats or full ints are on 4-byte boundaries, the shorts are on 2-byte boundaries. If you don't do this it will tank your performance. It might be helpful to mentally map it by typing out your attribute ordering as a struct definition so you can sanity check your layout and alignment.
making sure your data is stripped to share vertices
using a texture atlas to reduce texture swaps

To try converting your textures to 16-bit RGB565 format, see this code in Apple's venerable Texture2D.m, search for kTexture2DPixelFormat_RGB565
http://code.google.com/p/cocos2d-iphone/source/browse/branches/branch-0.1/OpenGLSupport/Texture2D.m
(this code loads PNGs and converts them to RGB565 at texture creation time; I don't know if there's an RGB565 file format as such)
For more information on PVRTC compressed textures (which looked way better than I expected when I used them, even at 2 bits per pixel) see Apple's PVRTextureLoader sample:
http://developer.apple.com/iPhone/library/samplecode/PVRTextureLoader/index.html
it has both the code for loading PVRTC textures in your app and also instructions for using the texturetool to convert your .png files into .pvr files.

Related

Why are mipmaps not boosting sprite performance in this 2.5D scenario?

I have a situation where I have loads of high-resolution bushes.
The issue is, these bushes are far too high detail and thus cause performance issues (partly because of shadows). A smooth solution to this would possibly be mipmaps, allowing the bushes to become lower resolution when further away from the camera. However, this did not work as anticipated.
Scene with mipmaps (as you can see the sprites further away are blurry):
Scene without mipmaps:
This is the performance difference.
With mipmaps
Without mipmaps
Why is there no performance increase?
Your mipmaps are not the solution for your performance issue. They only reduce the texture resolution of the objects further away. The performance difference is there, but it is not what you expected.
737 Batches is alot of draw calls batched together.
To reduce the draw calls and to gain the desired performance boost
you have several options:
Reduce the Triangle Count. You did not show the wireframe as suggested, but what you want to do for your grass is have a simple square with 2 triangles and use the alpha mask based texture like you already did.
Make your material more efficient: for example do not use 2 sided rendering, if you use a shader graph, reduce the amount of processing.
Make them grass objects static and bake your lighting. as the renderer will need 1 draw call for the mesh and 1 draw call for the light, this will drastically reduce your draw calls.
Additionally, you could also group several grass objects into clusters. Texture them with one material, as each individual material also efeccts your performance.
To understand the difference between draw calls and batches, read this:
https://support.unity.com/hc/en-us/articles/207061413-Why-are-my-batches-draw-calls-so-high-What-does-that-mean-
Very informative forum post on polycount where Joe Wilson shares some knowledge. Worth a read:
https://polycount.com/discussion/206507/the-cost-of-a-texture-draw-call-quantity-vs-resolution

How many maximum triangles can be drawn on ipad using opengl es in 1 frame?

How many maximum triangles can be drawn on ipad in a single frame. Also, is there a limit to the number of gl calls used to draw those triangles?
The only limit on total triangles that you'll run into on the iPad is in terms of memory size and how quickly you wish for this to render. The more vertices you send, the more memory your application will use, and the slower it will render.
For example, in my benchmarks I was able to push over 1,800,000 triangles per second on an iPad 1 using OpenGL ES 1.1 smooth shading, a single light source, geometry stored in vertex buffer objects (VBOs), and vertices represented by GLshorts in order to minimize total size. The iPad 2 is significantly faster than that, especially when you start doing more complex operations in your fragment shaders. From that number, I can estimate that I'd want to have fewer than 30,000 triangles in my scene geometry if I wanted to render at 60 FPS on the iPad 1.
OpenGL ES 2.0 shaders make things more complicated because of their varying complexity, but they enable new effects and may allow you to use fewer triangles to achieve the same image quality as the fixed function pipeline.
For another example, in this question Davido has a model with about 900,000 triangles that he's able to render at nearly 10 FPS on an iPad 2. I also present some geometry optimization techniques in my answer there that I've found to have a significant impact on OpenGL ES 1.1 rendering when you are maxing out tiler utilization on the device.

How can I optimize the rendering of a large model in OpenGL ES 1.1?

I just finished implementing VBO's in my 3D app and saw a roughly 5-10x speed increase in rendering. What used to render at 1-2 frames per second now renders at 10-11 frames per second.
My question is, are there any further improvements I can make to increase rendering speed? Will triangle strips make a big difference? Currently vertices are not being shared between faces, each faces vertices are unique but overlapping.
My Device Utilization is 100%, Tiler Utilization is 100%, Renderer Utilization is 11%, and resource bytes is 114819072. This is rendering 912,120 faces on a CAD model.
Any suggestions?
A Tiler Utilization of 100% indicates that your bottleneck is in the size of the geometry being sent to the GPU. Whatever you can do to shrink the geometry size can lead to an almost linear reduction in rendering time, in my experience. These tuning steps have worked for me in the past:
If you're not already, you could look at using indexing, which might cut down on geometry by eliminating some redundant vertices. The PowerVR GPUs in the iOS devices are optimized for using indexed geometry, as well.
Try using a smaller data type for your vertex information. I found that I could use GLshort instead of GLfloat for my vertices and normals without losing much precision in the rendering. This will significantly compact your geometry and lead to a nice speed boost in rendering.
Bin similarly colored vertices and render them as one group at a set color, rather than supplying per-vertex color information. The overhead from the few extra draw calls this requires will be vastly outweighed by the speedup you get from not having to send all that color information. I saw a ~18% reduction in rendering time by binning the colors in one of my larger models.
You're already using VBOs, so you've taken advantage of that optimization.
Don't halt the rendering pipeline at any point. Cut out anything that reads the current state, like all glGet* calls, because they really mess with the flow of the PowerVR GPUs.
There are other things you can do that will lead to smaller performance improvements, like using interleaved vertex, normal, texture data in your VBOs, aligning your data to 4 byte boundaries, etc., but the ones above are what I've found to have the largest impact in the tuning of my own OpenGL ES 1.1 application.
Most of these points are covered well in the "Best Practices for Working with Vertex Data" section of Apple's OpenGL ES Programming Guide for iOS.

Optimizing OpenGL ES application. Should I avoid calling glVertexPointer when possible?

I'm developing a game for iPhone in OpenGL ES 1.1; I have a lot of textured quads in a data structure where each node has a list of children nodes. So I traverse the structure from the root, and do the render of each quad, then its childs and so on.
The thing is, for each quad I'm calling glVertexPointer to set the vertices.
Should I avoid calling it for each quad? Will improve performance calling just once for example?
glVertexPointer copies the vertices to GPU memory or just saves the pointer?
Trying to minimize the number of calls will not be easy since each node may have a different quad. I have a lot of equal sprites with the same vertex data, but I'm not necessarily rendering one after another since I may be drawing a different sprite between them.
Thanks.
glVertexPointer keeps just the pointer, but incurs a state change in the OpenGL driver and an explicit synchronisation, so costs quite a lot. Normally when you say 'here's my data, please draw', the GPU starts drawing and continues to do so in parallel to whatever is going on on the CPU for as long as it can. When you change rendering state, it needs to finish whatever it was doing in the old state. So by changing once per quad, you're effectively forcing what could be concurrent processing to be consecutive. Hence, avoiding glVertexPointer (and, presumably, a glDrawArrays or glDrawElements?) per quad should give you a significant benefit.
An immediate optimisation is simply to keep a count of the number of quads in total in the data structure, allocate a single target buffer for vertices that is at least that size and have all quads copy their geometry into the target buffer rather than calling glVertexPointer each time. Then call glVertexPointer and your drawing calls (condensed to just one call also, hopefully) with the one big array at the end. It's a bit more costly on the CPU side but the parallelism and lack of repeated GPU/CPU synchronisations should save you a lot.
While tiptoeing around topics currently under NDA, I strongly suggest you look at the Xcode 4 beta. Amongst other features Apple have stated publicly to be present is an OpenGL ES profiler. So you can easily compare approaches.
To copy data to the GPU, you need to use a vertex buffer object. That means creating a buffer with glGenBuffers, pushing data to it with glBufferData and then posting a glVertexPointer with an address of e.g. 0 if the first byte in the data you uploaded is the first byte of your vertices. In ES 1.x, you can upload data as GL_DYNAMIC_DRAW to flag that you intend to update it quite often and draw from it quite often. It's probably worth doing if you can get into a position where you're drawing more often than you're uploading.
If you ever switch to ES 2.x there's also GL_STREAM_DRAW, which may be worth investigating but isn't directly relevant to your question. I mention it as it'll likely come up if you Google for vertex buffer objects, being available on desktop OpenGL. Options for ES 1.x are only GL_STATIC_DRAW and GL_DYNAMIC_DRAW.
I've just recently worked on an iPad ES 1.x application with objects that change every frame but are drawn twice per the rendering pipeline in use. There are only five such objects on screen, each 40 vertices, but switching from the initial implementation to the VBO implementation cut 20% off my total processing time.

What does the Tiler Utilization statistic mean in the iPhone OpenGL ES instrument?

I have been trying to perform some OpenGL ES performance optimizations in an attempt to boost the number of triangles per second that I'm able to render in my iPhone application, but I've hit a brick wall. I've tried converting my OpenGL ES data types from fixed to floating point (per Apple's recommendation), interleaving my vertex buffer objects, and minimizing changes in drawing state, but none of these changes have made a difference in rendering speed. No matter what, I can't seem to push my application above 320,000 triangles / s on an iPhone 3G running the 3.0 OS. According to this benchmark, I should be able to hit 687,000 triangles/s on this hardware with the smooth shading I'm using.
In my testing, when I run the OpenGL ES performance tool in Instruments against the running device, I'm seeing the statistic "Tiler Utilization" reaching nearly 100% when rendering my benchmark, yet the "Renderer Utilization" is only getting to about 30%. This may be providing a clue as to what the bottleneck is in the display process, but I don't know what these values mean, and I've not found any documentation on them. Does someone have a good description of what this and the other statistics in the iPhone OpenGL ES instrument stand for? I know that the PowerVR MBX Lite in the iPhone 3G is a tile-based deferred renderer, but I'm not sure what the difference would be between the Renderer and Tiler in that architecture.
If it helps in any way, the (BSD-licensed) source code to this application is available if you want to download and test it yourself. In the current configuration, it starts a little benchmark every time you load a new molecular structure and outputs the triangles / s to the console.
The Tiler Utilization and Renderer Utilization percentages measure the duty cycle of the vertex and fragment processing hardware, respectively. On the MBX, Tiler Utilization typically scales with the amount of vertex data being sent to the GPU (in terms of both the number of vertices and the size of the attributes sent per-vertex), and Fragment Utilization generally increases with overdraw and texture sampling.
In your case, the best thing would be to reduce the size of each vertex you’re sending. For starters, I’d try binning your atoms and bonds by color, and sending each of these bins using a constant color instead of an array. I’d also suggest investigating if shorts are suitable for your positions and normals, given appropriate scaling. You might also have to bin by position in this case, if shorts scaled to provide sufficient precision aren’t covering the range you need. These sorts of techniques might require additional draw calls, but I suspect the improvement in vertex throughput will outweigh the extra per-draw call CPU overhead.
Note that it’s generally beneficial (on MBX and elsewhere) to ensure that each vertex attribute begins on a 32-bit boundary, which implies that you should pad your positions and normals out to 4 components if you switch them to shorts. The peculiarities of the MBX platform also make it such that you want to actually include the W component of the position in the call to glVertexPointer in this case.
You might also consider pursuing alternate lighting methods like DOT3 for your polygon data, particularly the spheres, but this requires special care to make sure that you aren’t making your rendering fragment-bound, or inadvertently sending more vertex data than before.
Great answer, #Pivot! For reference, this Apple doc defines these terms:
Renderer Utilization %. The percentage of time the GPU spent performing fragment processing.
Tiler Utilization %. The percentage of time the GPU spent performing vertex processing and tiling.
Device Utilization %. The percentage of time the GPU spent doing any tiling or rendering work.