Why is this OpenGL ES code slow on iPhone?

Why is this OpenGL ES code slow on iPhone? - iphone

I've slightly modified the iPhone SDK's GLSprite example while learning OpenGL ES and it turns out to be quite slow. Even in the simulator (on the hw worst) so I must be doing something wrong since it's only 400 textured triangles.
const GLfloat spriteVertices[] = {
0.0f, 0.0f,
100.0f, 0.0f,
0.0f, 100.0f,
100.0f, 100.0f
};
const GLshort spriteTexcoords[] = {
0,0,
1,0,
0,1,
1,1
};
- (void)setupView {
glViewport(0, 0, backingWidth, backingHeight);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrthof(0.0f, backingWidth, backingHeight,0.0f, -10.0f, 10.0f);
glMatrixMode(GL_MODELVIEW);
glClearColor(0.3f, 0.0f, 0.0f, 1.0f);
glVertexPointer(2, GL_FLOAT, 0, spriteVertices);
glEnableClientState(GL_VERTEX_ARRAY);
glTexCoordPointer(2, GL_SHORT, 0, spriteTexcoords);
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
// sprite data is preloaded. 512x512 rgba8888
glGenTextures(1, &spriteTexture);
glBindTexture(GL_TEXTURE_2D, spriteTexture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, spriteData);
free(spriteData);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glEnable(GL_TEXTURE_2D);
glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA);
glEnable(GL_BLEND);
}
- (void)drawView {
..
glClear(GL_COLOR_BUFFER_BIT);
glLoadIdentity();
glTranslatef(tx-100, ty-100,10);
for (int i=0; i<200; i++) {
glTranslatef(1, 1, 0);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
}
..
}
drawView is called every time the screen is touched or the finger on the screen is moved and tx,ty are set to the x,y coordinates where that touch happened.
I've also tried using GLBuffer, when translation was pre-generated and there was only one DrawArray but gave the same performance (~4 FPS).
===EDIT===
Meanwhile I've modified this so that much smaller quads are used (sized: 34x20) and much less overlapping is done. There are ~400 quads->800 triangles spread on the whole screen. Texture size is 512x512 atlas and RGBA_8888 while the texture coordinates are in float.
The code is very ugly in terms of API efficiency: there are two MatrixMode change along with two loads and two translation then a drawarrays for a triangle strip (quad).
Now this produces ~45 FPS.

(I know this is very late, but I couldn't resist. I'll post anyway, in case other people come here looking for advice.)
This has nothing to do with the texture size. I don't know why people rated up Nils. He seems to have a fundamental misunderstanding of the OpenGL pipeline. He seems to think that for a given triangle, the entire texture is loaded and mapped onto that triangle. The opposite is true.
Once the triangle has been mapped into the viewport, it is rasterized. For every on-screen pixel the your triangle covers, the fragment shader is called. The default fragment shader (OpenGL ES 1.1, which you are using) will lookup the texel that most closely maps (GL_NEAREST) to the pixel you are drawing. It might look up 4 texels since you are using the higher quality GL_LINEAR method to average the best texel. Still, if the pixel count in your triangle is, say 100, then the most texture bytes you will have to read is 4(lookups) * 100(pixels) * 4(bytes per color. Far far less than what Nils was saying. It's amazing that he can make it sound like he actually knows what he's talking about.
WRT the tiled architecture, this is common in embedded OpenGL devices to preserve locality of reference. I believe that each tile gets exposed to each drawing operation, quickly culling most of them. Then the tile decides what to draw on itself. This is going to be much slower when you have blending turned on, as you do. Because you are using large triangles that might overlap and blend with other tiles, the GPU has to do a lot of extra work. If, instead of rendering the example square with alpha edges, you were to render an actual shape (instead of a square picture of the shape), then you could turn off blending for this part of the scene and I bet that would speed things up tremendously.
If you want to try it, just turn off blending and see how much things speed up, even if the don't look right. glDisable(GL_BLEND);

Your texture is 512*512*4 bytes per pixel. That's a megabyte of data. If you render it 200 times per frame you generate a bandwidth load of 200 megabytes per frame.
With roughly 4 fps you consume 800mb/second just for texture reads alone. Frame- and Zbuffer writes need bandwidth as well. Then there is the CPU, and don't underestimate the bandwidth requirements of the display as well.
RAM on embedded systems (e.g. your iphone) is not as fast as on a Desktop-PC. What you see here is a bandwidth starvation effect. The RAM simply can't handle the data faster.
How to cure this problem:
pick a sane texture-size. On average you should have 1 texel per pixel. This gives crisp looking textures. I know - it's not always possible. Use common sense.
use mipmaps. This takes up 33% of extra space but allows the graphic chip to pick use a lower resolution mipmap if possible.
Try smaller texture formats. Maybe you can use the ARGB4444 format. This would double the rendering speed. Also take a look at the compressed texture formats. Decompression does not cause a performance drop as it's done in hardware. Infact the opposite is true: Due to the smaller size in memory the graphic chip can read the texture-data faster.

I guess my first try was just a bad (or very good) test.
iPhone has a PowerVR MBX Lite which has a tile based graphics processor. It subdivides the screen into smaller tiles and renders them parallel. Now in the first case above the subdivision might got a bit exhausted because of the very high overlapping. More over, they couldn't be clipped because of the same distance and so all texture coordinates had to calculated (This could be easily tested by changing the translation in the loop).
Also because of the overlapping the parallelism couldn't be exploited and some tiles were sitting doing nothing and the rest (1/3) were working a lot.
So I think, while memory bandwidth could be a bottleneck, this wasn't the case in this example. The problem is more because of how the graphics HW works and the setup of the test.

I'm not familiar with the iPhone, but if it doesn't have dedicated hardware for handling floating point numbers (I suspect it doesn't) then it'd be faster to use integers whenever possible.
I'm currently developing for Android (which uses OpenGL ES as well) and for instance my vertex array is int instead of float. I can't say how much of a difference it makes, but I guess it's worth a try.

Apple is very tight-lipped about the specific hardware specs of the iPhone, which seems very strange to those of us coming from a console background. But people have been able to determine that the CPU is a 32-bit RISC ARM1176JZF. The good news is that it have a full floating-point unit, so we can continue writing math and physics code the way we do in most platforms.
http://gamesfromwithin.com/?p=239

Related

Testing point in the alpha channel

Is there a way to detect if the alpha of a pixel after drawing is not 0 when using OpenGLES on the iphone?
I would like to test multiple points to see id they are inside the area of a random polygon drawn by the user. If you know Flash, something equivalent to BitmapData::getPixel32 is what I'm looking for.

The framebuffer is kept by the GPU and is not immediately CPU accessible. I think the thing you'd most likely want from full OpenGL is the occlusion query; you can request geometry be drawn and be told how many pixels were actually plotted. Sadly that isn't available on the iPhone.
I think what you probably want is glReadPixels, which can be used to read a single pixel if you prefer, e.g. (written here, as I type, not tested)
GLubyte pixelValue[4];
glReadPixels(x, y, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, pixelValue);
NSLog(#"alpha was %d", pixelValue[3]);
Using glReadPixels causes a pipeline flush, so is generally a bad idea from a GL performance point of view, but it'll do what you want. Unlike iOS, OpenGL uses graph paper order for pixel coordinates, so (0, 0) is the lower left corner.

Tiled backgrounds on iPhone using OpenGL ES

I want to make a 2D tiled background system on the iPhone. Something that takes a tilemap and tileset image(s) and converts it into the full map on the screen.
Just doing some messing around, my first approach was to create a polygon for each tile. This worked fine until I started testing it for 400 polygons or so, then it started running very slowly. I'm just wondering - is this method of several polygons just not the way to go? Or am I doing something wrong with it? I'll post code later if needed but my main question is "Would 400 small polygons run slowly on the iPhone or am I just doing something wrong?"
I also considered another way which was to, during initialization, create the map texture by code out of the tilemap/tilesets, and then stick that on ONE large polygon. So yeah...any feedback on how I should go about something like this?
I know someone will mention this - I gave consideration to trying cocos2d, but I've got my reasons for not going that route.

Your problem is almost certainly that you're binding textures 400 times, and not anything else. You should have all your tiles in one big texture atlas / sprite sheet and instead of rebinding your textures you should just bind your atlas once and then draw small parts of it. If you do this, you should be able to draw thousands of tiles with no real slowdown.
You can draw your sprite like this:
//Push the matrix so we can keep it as it was previously.
glPushMatrix();
//Store the coordinates/dimensions from a rectangle.
float x = CGRectGetMinX(rect);
float y = CGRectGetMinY(rect);
float w = CGRectGetWidth(rect);
float h = CGRectGetHeight(rect);
float xOffset = x;
float yOffset = y;
if (rotation != 0.0f)
{
//Translate the OpenGL context to the center of the sprite for rotation.
glTranslatef(x+w/2, y+h/2, 0.0f);
//Apply the rotation over the Z axis.
glRotatef(rotation, 0.0f, 0.0f, 1.0f);
//Have an offset for the top left corner.
xOffset = -w/2;
yOffset = -h/2;
}
// Set up an array of values to use as the sprite vertices.
GLfloat vertices[] =
{
xOffset, yOffset,
xOffset, yOffset+h,
xOffset+w, yOffset+h,
xOffset+w, yOffset,
};
// Set up an array of values for the texture coordinates.
GLfloat texcoords[] =
{
CGRectGetMinX(clippingRect), CGRectGetMinY(clippingRect),
CGRectGetMinX(clippingRect), CGRectGetHeight(clippingRect),
CGRectGetWidth(clippingRect), CGRectGetHeight(clippingRect),
CGRectGetWidth(clippingRect), CGRectGetMinY(clippingRect),
};
//If the image is flipped, flip the texture coordinates.
if (flipped)
{
texcoords[0] = CGRectGetWidth(clippingRect);
texcoords[2] = CGRectGetWidth(clippingRect);
texcoords[4] = CGRectGetMinX(clippingRect);
texcoords[6] = CGRectGetMinX(clippingRect);
}
//Render the vertices by pointing to the arrays.
glVertexPointer(2, GL_FLOAT, 0, vertices);
glTexCoordPointer(2, GL_FLOAT, 0, texcoords);
// Set the texture parameters to use a linear filter when minifying.
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
//Allow transparency and blending.
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
//Enable 2D textures.
glEnable(GL_TEXTURE_2D);
//Bind this texture.
if ([Globals getLastTextureBound] != texture)
{
glBindTexture(GL_TEXTURE_2D, texture);
}
//Finally draw the arrays.
glDrawArrays(GL_TRIANGLE_FAN, 0, 4);
//Restore the model view matrix to prevent contamination.
glPopMatrix();
The two CGRect's I used are just for ease's sake. You can specify the X, Y, width, and height to draw the image, and you can specify where in the image you want to draw using the clippingRect. With the clipping rect, (0, 0, 1, 1) is the entire image, whereas (0, 0, 0.25, 0.25) would only draw the top left corner. By changing the clipping rect, you can put all sorts of different tiles in the same texture, then you only need to bind once. Way cheaper.

Scott, the TexParameter setup only needs to be done once per texture. However, that is not the source of your slowdown.
You'll be much better off building up a list of indexes, and calling glDrawArrays once for the entire set of tiles. The goal of vertex arrays are to allow you to draw as much as possible in one step.
glDrawTex should be avoided, because it forces you into the very inefficient one-at-a-time mindset.

Using the glDrawTex extension may also be a possibility.

Stanford iTune University has a podcast on Optimizing OpenGL for iPhone.
But the basic idea are these:
Batch Geometry, combining various vertex array into a single big vertice array. This should reduce x number of glPointer calls into a single glPointer call.
Texture Atlases, using a single texture for all the different tiles, differences being the regions to use for each tile. Just bind once to the texture for all tile drawing.
Interleaved Arrays, combining various parts of a point (eg. vertex, texture coordinates, color) into a single array. This should reduce gl*Pointer calls to a single call.
Indexed triangles, allowing you to reuse geometry information
Using Short instead of Float if possible for geometery information, as it is smaller.
That's just a general opengl optimization guidelines. As for tile engine, well..
Do your own culling before sending the data to opengl. What you don't draw, you save.
I think that's what I can think of so far.

What i did to speed my app up. Was after i load my level. I created an atlas out of the tiles on the map. Then every frame i check to see if the camera did move. If it did then i just pass an glTranslatef and move the entire map at once. If only dynamic objects move on the map then i just update that object in the vertex array atlas. This system is very effiecient as i am able to draw tons of tiles with no framerate drop.

Client states should be enabled only at initialization, also the glTexParameteri functions should be called when creating the texture object.
All glEnable functions are not cached, meaning it will set the state even if it is already set to that value.
All these small things can add up and slow you down.
BR

Positioning elements in 2D space with OpenGL ES

In my spare time I like to play around with game development on the iPhone with OpenGL ES. I'm throwing together a small 2D side-scroller demo for fun, and I'm relatively new to OpenGL, and I wanted to get some more experienced developers' input on this.
So here is my question: does it make sense to specify the vertices of each 2D element in model space, then translate each element to it's final view space each time a frame is drawn?
For example, say I have a set of blocks (squares) that make up the ground in my side-scroller. Each square is defined as:
const GLfloat squareVertices[] = {
-1.0, 1.0, -6.0, // Top left
-1.0, -1.0, -6.0, // Bottom left
1.0, -1.0, -6.0, // Bottom right
1.0, 1.0, -6.0 // Top right
}
Say I have 10 of these squares that I need to draw together as the ground for the next frame. Should I do something like this, for each square visible in the current scene?
glPushMatrix();
{
glTranslatef(currentSquareX, currentSquareY, 0.0);
glVertexPointer(3, GL_FLOAT, 0, squareVertices);
glEnableClientState(GL_VERTEX_ARRAY);
// Do the drawing
}
glPopMatrix();
It seems to me that doing this for every 2D element in the scene, for every frame, gets a bit intense and I would imagine the smarter people who use OpenGL much more than I do may have a better way of doing this.
That all being said, I'm expecting to hear that I should profile the code and see where any bottlenecks may be: to those people, I say: I haven't written any of this code yet, I'm simply in the process of wrapping my mind around it so that when I do go to write it it goes smoother.
On the subject of profiling and optimization, I'm really not trying to prematurely optimize here, I'm just trying to wrap my mind around how one would set up a 2D scene and render it. Like I said, I'm relatively new to OpenGL and I'm just trying to get a feel for how things are done. If anyone has any suggestions on a better way to do this, I'd love to hear your thoughts.
Please keep in mind that I'm not interested in 3D, just 2D for now. Thanks!

You are concerned with the overhead it takes to transform a model (in this case a square) from model coordinates to world coordinates when you have a lot of models. This seems like an obvious optimization for static models.
If you build your square's vertices in world coordinates, then of course it is going to be faster as each square will avoid the extra cost of these three functions (glPushMatrix, glPopMatrix, and glTranslatef) since there is no need to translate from model to world coordinates at render time. I have no idea how much faster this will be, I suspect that it won't be a humongous optimization, and you lose the modularity of keeping the squares in model coordinates: What if in the future you decide you want these squares to be moveable? That will be a lot harder if you're keeping their vertices in world coordinates.
In short, it's a tradeoff:
World Coordinates
More Memory - each square needs its
own set of vertices.
Less computation - no need to perform
glPushMatrix, glPopMatrix, or
glTranslatef for each square at render time.
Less flexible - lacks support (or
complicates) for dynamically moving these squares
Model Coordinates
Less memory - the squares can share the same vertex data
More Computation - each square must
perform three extra functions at
render time.
More Flexible - squares can easily be
moved by manipulating the
glTranslatef call.
I guess the only way to know what is the right decision is by doing and profiling. I know you said you haven't written this yet, but I suspect that whether your squares are in model or world coordinates it won't make much of a difference - and if it does, I can't imagine an architecture that you could create where it would be hard to switch your squares from model to world coordinates or vice-versa.
Good luck to you and your adventures in iPhone game development!

If you are only using screen aligned quads it might be easier to use the OES Draw Texture extension. Then you can use a single texture to hold all your game "sprites". First specify the crop rectangle by setting the GL_TEXTURE_CROP_RECT_OES TexParameter. This is the boundry of the sprite within the larger texture. To render, call glDrawTexiOES passing in the desired position & size in viewport coordinates.
int rect[4] = {0, 0, 16, 16};
glBindTexture(GL_TEXTURE_2D, sprites);
glTexParameteriv(GL_TEXTURE_2D, GL_TEXTURE_CROP_RECT_OES, rect);
glDrawTexiOES(x, y, z, width, height);
This extension isn't available on all devices, but it works great on the iPhone.

You might also consider using a static image and just scrolling that instead of drawing each individual block of the floor, and translating its position, etc.

OpenGL ES iPhone - drawing anti aliased lines

Normally, you'd use something like:
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
glEnable(GL_BLEND);
glEnable(GL_LINE_SMOOTH);
glLineWidth(2.0f);
glVertexPointer(2, GL_FLOAT, 0, points);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_LINE_STRIP, 0, num_points);
glDisableClientState(GL_VERTEX_ARRAY);
It looks good in the iPhone simulator, but on the iPhone the lines get extremely thin and w/o any anti aliasing.
How do you get AA on iPhone?

One can achieve the effect of anti aliasing very cheaply using vertices with opacity 0.
Here's an image example to explain:
Comparison with AA:
You can read a paper about this here:
http://research.microsoft.com/en-us/um/people/hoppe/overdraw.pdf
You could do something along this way:
// Colors is a pointer to unsigned bytes (4 per color).
// Should alternate in opacity.
glColorPointer(4, GL_UNSIGNED_BYTE, 0, colors);
glEnableClientState(GL_COLOR_ARRAY);
// points is a pointer to floats (2 per vertex)
glVertexPointer(2, GL_FLOAT, 0, points);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_TRIANGLE_STRIP, 0, points_count);
glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_COLOR_ARRAY);

Starting in iOS Version 4.0 you have an easy solution, it's now possible to use Antialiasing for the whole OpenGL ES scene with just a few lines of added code. (And nearly no performance loss, at least on the SGX GPU).
For the code please read the following Apple Dev-Forum Thread.
There are also some sample pictures how it looks for me on my blog.

Using http://answers.oreilly.com/topic/1669-how-to-render-anti-aliased-lines-with-textures-in-ios-4/ as a starting point, I was able to get anti-aliased lines like these:
They aren't perfect nor are they as nice as the ones that I had been drawing with Core Graphics, but they are pretty good. I am actually drawing same lines (vertices) twice - once with bigger texture and color, then with smaller texture and translucent white.
There are artifacts when lines overlap too tightly and alphas start to accumulate.

One approach around this limitation is tessellating your lines into textured triangle strips (as seen here).

The problem is that on the iPhone OpenGl renders to a frame buffer object rather than the main frame buffer and as I understand it FBO's don't support multisampling.
There are various tricks that can be done, such as rendering to another FBO at twice the display size and then relying on texture filtering to smooth things out, not something that I've tried though so can't comment on how well this works.

I remember very specifically that I tried this and there is no simple way to do this using OpenGL on the iPhone. You can draw using CGPaths and a CGContextRef, but that will be significantly slower.

Put this in your render method and setUpFrame buffer...
You will get anti-aliased appearance.
/*added*/
//[_context presentRenderbuffer:GL_RENDERBUFFER];
//Bind both MSAA and View FrameBuffers.
glBindFramebuffer(GL_READ_FRAMEBUFFER_APPLE, msaaFramebuffer);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER_APPLE, framebuffer );
// Call a resolve to combine both buffers
glResolveMultisampleFramebufferAPPLE();
// Present final image to screen
glBindRenderbuffer(GL_RENDERBUFFER, _colorRenderBuffer);
[_context presentRenderbuffer:GL_RENDERBUFFER];
/*added*/

512x512 Texture causing huge GPU stress on iPhone, despite tiling

I'm testing my simple OpenGL ES implementation (a 2D game) on the iPhone and I notice a high render utilization while using the profiler. These are the facts:
I'm displaying only one preloaded large texture (512x512 pixels) at 60fps and the render utilization is around 40%.
My texture is blended using GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, the only GL function I'm using.
I've tried to make the texture smaller and tiling it, which made no difference.
I'm using a PNG texture atlas of 1024x1024 pixels
I find it very strange that this one texture is causing such an intense GPU usage.
Is this to be expected? What am I doing wrong?
EDIT: My code:
// OpenGL setup is identical to OpenGL ES template
// initState is called to setup
// timer is initialized, drawView is called by the timer
- (void) initState
{
//usual init declarations have been omitted here
glEnable(GL_BLEND);
glBlendFunc(GL_ONE,GL_ONE_MINUS_SRC_ALPHA);
glEnableClientState (GL_VERTEX_ARRAY);
glVertexPointer (2,GL_FLOAT,sizeof(Vertex),&allVertices[0].x);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (2,GL_FLOAT,sizeof(Vertex),&allVertices[0].tx);
glEnableClientState (GL_COLOR_ARRAY);
glColorPointer (4,GL_UNSIGNED_BYTE,sizeof(Vertex),&allVertices[0].r);
}
- (void) drawView
{
[EAGLContext setCurrentContext:context];
glBindFramebufferOES(GL_FRAMEBUFFER_OES, viewFramebuffer);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
GLfloat width = backingWidth /2.f;
GLfloat height = backingHeight/2.f;
glOrthof(-width, width, -height, height, -1.f, 1.f);
glMatrixMode(GL_MODELVIEW);
glClearColor(0.f, 0.f, 0.f, 1.f);
glClear(GL_COLOR_BUFFER_BIT);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
glBindRenderbufferOES(GL_RENDERBUFFER_OES, viewRenderbuffer);
[context presentRenderbuffer:GL_RENDERBUFFER_OES];
[self checkGLError];
}
EDIT: I've made a couple of improvements, but none managed to lower the render utilization. I've divided the texture in parts of 32x32, changed the type of the coordinates and texture coordinates from GLfloat to GLshort and added extra vertices for degenerative triangles.
The updates are:
initState:
(vertex and texture pointer are now GL_SHORT)
glMatrixMode(GL_TEXTURE);
glScalef(1.f / 1024.f, 1.f / 1024.f, 1.f / 1024.f);
glMatrixMode(GL_MODELVIEW);
glScalef(1.f / 16.f, 1.f/ 16.f, 1.f/ 16.f);
drawView:
glDrawArrays(GL_TRIANGLE_STRIP, 0, 1536); //(16*16 parts * 6 vertices)

I'm writing an app which displays five 512x512 textures on top of each other in a 2D environment using GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, and I can get about 14fps. Do you really need 60fps? For a game, I'd think 24-30 would be fine. Also, use PVR texture compression if at all possible. There's an example that does it included with the SDK.

I hope you didn't forget to disable GL_BLEND when you don't need it already.
You can make an attempt at memory bandwidth optimization - use 16 bpp formats or PVRTC. IMHO with your texture size texture cache doesn't help at all.
Don't forget that your framebuffer is being used as texture by iPhone UI. If it is created as 32 bit RGBA it will be alpha-blended one more time. For optimal performance 16 bit 565 framebuffers are the best (but graphics quality suffers).
I don't know all the details, such as cache size, but, I suppose, textures pixels are already swizzled when uploaded into video memory and triangles are split by PVR tile engine. Therefore your own splitting appears to be redundant.
And finally. This is only a mobile low-power GPU, not designed for huge screens and high fillrates. Alpha-blending is costly, maybe 3-4 times difference on PowerVR chips.

Read this post.
512x512 is probably a little over optimistic for the iPhone to deal with.
EDIT:
I assume you have already read this, but if not check Apples guide to optimal OpenGl ES performance on iPhone.

What is exactly is the problem?
You're getting your 60fps, which is silky smooth.
Who cares if render utilization is 40%?

The issue could be because of the iPhone's texture cache size. It may simply come down to how much of the texture is on each individual triangle, quad, or tristrip, depending on how you're setting state.
Try this: subdivide your quad and repeat your tests. So if you're 1 quad, make it 4. Then 16. and so on, and see if that helps. The key is to reduce the actual number of pixels that each primitive references.
When the texture cache gets blown, then the hardware will thrash texture lookups from main memory into whatever vram is set aside for the texture buffer for each pixel. This can kill performance mighty quick.
OR - I am completely wrong because I really don't know the iPhone hardware, and I also know that the PVR chip is a strange beast in comparison to what I'm used to (PS2, PSP). Still it's an easy test to try and I'm curious if it helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse