I'm having some performance issues calling CGContextDrawLayerAtPoint in iOS 4 that didn't seem to exist in previous version of the OS.
I'm copying a layer obtained from a bitmap context created with CGBitmapContextCreate to my view's context during a drawRect call. The view and the bitmap are the same size.
The bitmap was created with:
CGBitmapContextCreate(NULL, width, height, 8, width * 4, genericRGBSpace, kCGBitmapByteOrder32Host | kCGImageAlphaNoneSkipFirst);
Instruments in indicating I'm spending more time in CGContextDrawLayerAtPoint than I was on devices running OS 3.2. In fact the stack trace indicates the following stack trace taking a higher percentage of time:
argb32_sample_argb32
argb32_image_mark
argb32_image
ripl_Mark
ripl_BltImage
RIPLayerBltImage
ripc_RenderImage
ripc_DrawLayer
CGContextDelegateDrawlayer
CGContextDrawLayerAtPoint
whereas the same code running under 3.2 shows
argb32_image_mark_rgb32
argb32_image
ripl_Mark
ripl_BltImage
ripc_RenderImage
ripc_DrawLayer
CGContextDelegateDrawLayer
CGContextDrawLayerAtPoint
with a much lower percentage of time. I'm not sure exactly why argb32_sample_argb32 is being called on iOS 4 and what it's doing - sampling the existing data into a new buffer? I'm not sure why this would be necessary. The view and the bitmap are the same size and no scaling has been performed.
Any insight into this would be appreciated.
Well, no one had any insight in to this, so I'll recount our theories and what we ended up doing. The theory is that since iOS 4 now has to deal with different screen resolutions, it's taking a more "generalized" approach to blitting and is a bit more inefficient than it was.
In the end, it doesn't really matter - it's slower and we have to deal with it. We discovered that there were places where we were copying redundantly, so we were able to reduce the amount copying we had to do. This bought us the speed we needed.
Related
I have a CATextLayer from the size of 3000 * 3000 with a Big Text in it.
Text is say "Hello"
I add this CATextlayer to my Superlayer.
I have set shouldrasterize to false.
When moving the superlayer with translation, then I observe a huge Memory Usage till the app crashes.
Why does this take that much memory ? How can I avoid that ?
I assume, there will be a bitmap stored in memory ? But why ?
My sample is an extreme case, which is not really my productive App, so please don't ask why are you doing this. Its only an extreme case for trying to understand, what is going on.
The reason why it is consuming too much memory is obvious as its dimension is quite huge. To quote documentation-
In iOS 2.x, the maximum size of a UIView object is 1024 x 1024 points.
In iOS 3.0 and later, views are no longer restricted to this maximum
size but are still limited by the amount of memory they consume. It is
in your best interests to keep view sizes as small as possible.
Regardless of which version of iOS is running, you should consider
tiling any content that is significantly larger than the dimensions of
the screen.
I'm developing a game for iPhone in OpenGL ES 1.1; I have a lot of textured quads in a data structure where each node has a list of children nodes. So I traverse the structure from the root, and do the render of each quad, then its childs and so on.
The thing is, for each quad I'm calling glVertexPointer to set the vertices.
Should I avoid calling it for each quad? Will improve performance calling just once for example?
glVertexPointer copies the vertices to GPU memory or just saves the pointer?
Trying to minimize the number of calls will not be easy since each node may have a different quad. I have a lot of equal sprites with the same vertex data, but I'm not necessarily rendering one after another since I may be drawing a different sprite between them.
Thanks.
glVertexPointer keeps just the pointer, but incurs a state change in the OpenGL driver and an explicit synchronisation, so costs quite a lot. Normally when you say 'here's my data, please draw', the GPU starts drawing and continues to do so in parallel to whatever is going on on the CPU for as long as it can. When you change rendering state, it needs to finish whatever it was doing in the old state. So by changing once per quad, you're effectively forcing what could be concurrent processing to be consecutive. Hence, avoiding glVertexPointer (and, presumably, a glDrawArrays or glDrawElements?) per quad should give you a significant benefit.
An immediate optimisation is simply to keep a count of the number of quads in total in the data structure, allocate a single target buffer for vertices that is at least that size and have all quads copy their geometry into the target buffer rather than calling glVertexPointer each time. Then call glVertexPointer and your drawing calls (condensed to just one call also, hopefully) with the one big array at the end. It's a bit more costly on the CPU side but the parallelism and lack of repeated GPU/CPU synchronisations should save you a lot.
While tiptoeing around topics currently under NDA, I strongly suggest you look at the Xcode 4 beta. Amongst other features Apple have stated publicly to be present is an OpenGL ES profiler. So you can easily compare approaches.
To copy data to the GPU, you need to use a vertex buffer object. That means creating a buffer with glGenBuffers, pushing data to it with glBufferData and then posting a glVertexPointer with an address of e.g. 0 if the first byte in the data you uploaded is the first byte of your vertices. In ES 1.x, you can upload data as GL_DYNAMIC_DRAW to flag that you intend to update it quite often and draw from it quite often. It's probably worth doing if you can get into a position where you're drawing more often than you're uploading.
If you ever switch to ES 2.x there's also GL_STREAM_DRAW, which may be worth investigating but isn't directly relevant to your question. I mention it as it'll likely come up if you Google for vertex buffer objects, being available on desktop OpenGL. Options for ES 1.x are only GL_STATIC_DRAW and GL_DYNAMIC_DRAW.
I've just recently worked on an iPad ES 1.x application with objects that change every frame but are drawn twice per the rendering pipeline in use. There are only five such objects on screen, each 40 vertices, but switching from the initial implementation to the VBO implementation cut 20% off my total processing time.
I have created a whole heap of overlays using MKPolygon and created into a MKPolygonView. This works fine but one of the overlays has a butt load of points (about 800 points) and this causes memory and performance issues. I tried shouldRasterize on the MKPolygonView but this had the opposite affect which I am not surprised.
Is there any other thing I can do to increase the performance of it besides lowing the amount of points (which I am in the process of doing)?
This is an issue that is known by Apple but unlikely to change. Basically anything more then a couple of MKOverlayViews you will have performance issues no matter what your hardware. What you have to basically do is to subclass MKPolygonView and merge all the MKPolygons into one MKPolygonView.
Code is available on Apple Forums but as I didn't write it I don't think I should post it here.
I would look at reducing the number of points in the polygon. depending on wher you got it from. Most geopatial manipulation data has functions that will alow you to reduce the number of points in a polygon. (all you need to do is supply an accuracy measurement.)
I'm working on an iPhone App that relies heavily on OpenGL. Right now it runs a bit slow on the iPhone 3G, but looks snappy on the new 32G iPod Touch. I assume this is hardware related. Anyway, I want to get the iPhone performance to resemble the iPod Touch performance. I believe I'm doing a lot of things sub-optimally in OpenGL and I'd like advice on what improvements will give me the most bang for the buck.
My scene rendering goes something like this:
Repeat 35 times
glPushMatrix
glLoadIdentity
glTranslate
Repeat 7 times
glBindTexture
glVertexPointer
glNormalPointer
glTexCoordPointer
glDrawArrays(GL_TRIANGLES, ...)
glPopMatrix
My Vertex, Normal and Texture Coords are already interleaved.
So, what steps should I take to speed this up? What step would you try first?
My first thought is to eliminate all those glBindTexture() calls by using a Texture Atlas.
What about some more efficient matrix operations? I understand the gl*() versions aren't too efficient.
What about VBOs?
Update
There are 8260 triangles.
Texture sizes are 64x64 pngs. There are 58 different textures.
I have not run instruments.
Update 2
After running the OpenGL ES Instrument on the iPhone 3G I found that my Tiler Utilization is in the 90-100% range, and my Render Utilization is in the 30% range.
Update 3
Texture Atlasing had no noticeable affect on the problem. Utilization ranges are still as noted above.
Update 4
Converting my Vertex and Normal pointers to GL_SHORT seemed to improve FPS, but the Tiler Utilization is still in the 90% range a lot of the time. I'm still using GL_FLOAT for my texture coordinates. I suppose I could knock those down to GL_SHORT and save four more bytes per vertex.
Update 5
Converting my texture coordinates to GL_SHORT yielded another performance increase. I'm now consistently getting >30 FPS. Tiler Utilization is still around 90%, but frequently drops down in the the 70-80% range. The Renderer Utilization is hovering around 50%. I suppose this might have something to do with scaling the texture coordinates from GL_TEXTURE Matrix Mode.
I'm still seeking additional improvements. I'd like to get closer to 40 FPS, as that's what my iPod Touch gets and it's silky smooth there. If anyone is still paying attention, what other low-hanging fruit can I pick?
With a tiler utilization still above 90%, you’re likely still vertex throughput-bound. Your renderer utilization is higher because the GPU is rendering more frames. If your primary focus is improving performance on older devices, then the key is still to cut down on the amount of vertex data needed per triangle. There are two sides to this:
Reducing the amount of data per vertex: Now that all of your vertex attributes are already GL_SHORTs, the next thing to pursue is finding a way to do what you want using fewer attributes or components. For example, if you can live without specular highlights, using DOT3 lighting instead of OpenGL ES fixed-function lighting would replace your 3 shorts (+ 1 short of padding) for normals with 2 shorts for an extra texture coordinate. As an additional bonus, you’d be able to light your models per-pixel.
Reducing the number of vertices needed per triangle: When drawing with indexed triangles, you should make sure that your indices are sorted for maximum reuse. Running your geometry through Imagination Technologies’ PVRTTriStrip tool would probably be your best bet here.
If you only have 58 different 64x64 textures, a texture atlas seems like a good idea, since they'd all fit in a single 512x512 texture... if you don't rely on texture wrap modes, I'd certainly at least try this.
What format are your textures in? You might try using a compressed PVRTC texture; I think that's less load on the Tiler, and I've been pleasantly surprised by the image quality even for 2-bit-per-pixel textures. (Good for natural images, not good if you're doing something that looks like an 8-bit video game)
The first thing I would do is run Instruments profiling on the hardware device that is slow. It should show you pretty quickly where the bottlenecks are for your particular case.
Update after instruments results:
This question has a similar result in Instruments to you, perhaps the advice is also applicable in your case (basically reducing number vertex data)
The biggest win in graphics programming comes down to this:
Batch, Batch, Batch
TextureAtlasing will make a bigger difference than most anything else you can do. Switching textures is like stopping a speeding train to let on new passengers every time.
Combine all those textures into an atlas and cut your draw calls down a lot.
This web-based tool may be helpful: http://zwoptex.zwopple.com/
Have you looked over the "OpenGL ES Programming Guide for iPhone OS" in the dev center? There are sections on Best Practices for Vertex Data and Texture Data.
Is your data formatted to be able to use triangle strips?
In terms of least effort, the modification sequence for you would probably be:
Reducing vertex attribute size
VBOs
Note that when you do these, you need to make sure that components are aligned on their native alignment, i.e. the floats or full ints are on 4-byte boundaries, the shorts are on 2-byte boundaries. If you don't do this it will tank your performance. It might be helpful to mentally map it by typing out your attribute ordering as a struct definition so you can sanity check your layout and alignment.
making sure your data is stripped to share vertices
using a texture atlas to reduce texture swaps
To try converting your textures to 16-bit RGB565 format, see this code in Apple's venerable Texture2D.m, search for kTexture2DPixelFormat_RGB565
http://code.google.com/p/cocos2d-iphone/source/browse/branches/branch-0.1/OpenGLSupport/Texture2D.m
(this code loads PNGs and converts them to RGB565 at texture creation time; I don't know if there's an RGB565 file format as such)
For more information on PVRTC compressed textures (which looked way better than I expected when I used them, even at 2 bits per pixel) see Apple's PVRTextureLoader sample:
http://developer.apple.com/iPhone/library/samplecode/PVRTextureLoader/index.html
it has both the code for loading PVRTC textures in your app and also instructions for using the texturetool to convert your .png files into .pvr files.
resizing a camera UIImage returned by the UIImagePickerController takes a ridiculously long time if you do it the usual way as in this post.
[update: last call for creative ideas here! my next option is to go ask Apple, I guess.]
Yes, it's a lot of pixels, but the graphics hardware on the iPhone is perfectly capable of drawing lots of 1024x1024 textured quads onto the screen in 1/60th of a second, so there really should be a way of resizing a 2048x1536 image down to 640x480 in a lot less than 1.5 seconds.
So why is it so slow? Is the underlying image data the OS returns from the picker somehow not ready to be drawn, so that it has to be swizzled in some fashion that the GPU can't help with?
My best guess is that it needs to be converted from RGBA to ABGR or something like that; can anybody think of a way that it might be possible to convince the system to give me the data quickly, even if it's in the wrong format, and I'll deal with it myself later?
As far as I know, the iPhone doesn't have any dedicated "graphics" memory, so there shouldn't be a question of moving the image data from one place to another.
So, the question: is there some alternative drawing method besides just using CGBitmapContextCreate and CGContextDrawImage that takes more advantage of the GPU?
Something to investigate: if I start with a UIImage of the same size that's not from the image picker, is it just as slow? Apparently not...
Update: Matt Long found that it only takes 30ms to resize the image you get back from the picker in [info objectForKey:#"UIImagePickerControllerEditedImage"], if you've enabled cropping with the manual camera controls. That isn't helpful for the case I care about where I'm using takePicture to take pictures programmatically. I see that that the edited image is kCGImageAlphaPremultipliedFirst but the original image is kCGImageAlphaNoneSkipFirst.
Further update: Jason Crawford suggested CGContextSetInterpolationQuality(context, kCGInterpolationLow), which does in fact cut the time from about 1.5 sec to 1.3 sec, at a cost in image quality--but that's still far from the speed the GPU should be capable of!
Last update before the week runs out: user refulgentis did some profiling which seems to indicate that the 1.5 seconds is spent writing the captured camera image out to disk as a JPEG and then reading it back in. If true, very bizarre.
Seems that you have made several assumptions here that may or may not be true. My experience is different than yours. This method seems to only take 20-30ms on my 3Gs when scaling a photo snapped from the camera to 0.31 of the original size with a call to:
CGImageRef scaled = CreateScaledCGImageFromCGImage([image CGImage], 0.31);
(I get 0.31 by taking the width scale, 640.0/2048.0, by the way)
I've checked to make sure the image is the same size you're working with. Here's my NSLog output:
2009-12-07 16:32:12.941 ImagePickerThing[8709:207] Info: {
UIImagePickerControllerCropRect = NSRect: {{0, 0}, {2048, 1536}};
UIImagePickerControllerEditedImage = <UIImage: 0x16c1e0>;
UIImagePickerControllerMediaType = "public.image";
UIImagePickerControllerOriginalImage = <UIImage: 0x184ca0>;
}
I'm not sure why the difference and I can't answer your question as it relates to the GPU, however I would consider 1.5 seconds and 30ms a very significant difference. Maybe compare the code in that blog post to what you are using?
Best Regards.
Use Shark, profile it, figure out what's taking so long.
I have to work a lot with MediaPlayer.framework and when you get properties for songs on the iPod, the first property request is insanely slow compared to subsequent requests, because in the first property request MobileMediaPlayer packages up a dictionary with all the properties and passes it to my app.
I'd be willing to bet that there is a similar situation occurring here.
EDIT: I was able to do a time profile in Shark of both Matt Long's UIImagePickerControllerEditedImage situation and the generic UIImagePickerControllerOriginalImage situation.
In both cases, a majority of the time is taken up by CGContextDrawImage. In Matt Long's case, the UIImagePickerController takes care of this in between the user capturing the image and the image entering 'edit' mode.
Scaling the percentage of time taken to CGContextDrawImage = 100%, CGContextDelegateDrawImage then takes 100%, then ripc_DrawImage (from libRIP.A.dylib) takes 100%, and then ripc_AcquireImage (which it looks like decompresses the JPEG, and takes up most of its time in _cg_jpeg_idct_islow, vec_ycc_bgrx_convert, decompress_onepass, sep_upsample) takes 93% of the time. Only 7% of the time is actually spent in ripc_RenderImage, which I assume is the actual drawing.
I have had the same problem and banged my head against it for a long time. As far as I can tell, the first time you access the UIImage returned by the image picker, it's just slow. As an experiment, try timing any two operations with the UIImage--e.g., your scale-down, and then UIImageJPEGRepresentation or something. Then switch the order. When I've done this in the past, the first operation gets a time penalty. My best hypothesis is that the memory is still on the CCD somehow, and transferring it into main memory to do anything with it is slow.
When you set allowsImageEditing=YES, the image you get back is resized and cropped down to about 320x320. That makes it faster, but it's probably not what you want.
The best speedup I've found is:
CGContextSetInterpolationQuality(context, kCGInterpolationLow)
on the context you get back from CGBitmapContextCreate, before you do CGContextDrawImage.
The problem is that your scaled-down images might not look as good. However, if you're scaling down by an integer factor--e.g., 1600x1200 to 800x600--then it looks OK.
Here's a git project that I've used and it seems to work well. The usage is pretty clean as well - one line of code.
https://github.com/AliSoftware/UIImage-Resize
DO NOT USE CGBitmapImageContextCreate in this case! I spent almost a week in the same situation you are in. Performance will be absolutely terrible and it will eat up memory like crazy. Use UIGraphicsBeginImageContext instead:
// create a new CGImage of the desired size
UIGraphicsBeginImageContext(desiredImageSize);
CGContextRef c = UIGraphicsGetCurrentContext();
// clear the new image
CGContextClearRect(c, CGRectMake(0,0,desiredImageSize.width, desiredImageSize.height));
// draw in the image
CGContextDrawImage(c, rect, [image CGImage]);
// return the result to our parent controller
UIImage * result = UIGraphicsGetImageFromCurrentImageContext();
UIGraphicsEndImageContext();
In the above example (from my own image resize code), "rect" is significantly smaller than the image. The code above runs very fast, and should do exactly what you need.
I'm not entirely sure why UIGraphicsBeginImageContext is so much faster, but I believe it has something to do with memory allocation. I've noticed that this approach requires significantly less memory, implying that the OS has already allocated space for an image context somewhere.