Multithreading with Metal

Multithreading with Metal - swift

I'm new to Apple's Metal API and graphics programming in general. I'm slowly building a game engine of sorts, starting with UI. My UI is based on nodes each with their own list of child nodes. So if I have a 'menu' node with three 'buttons' as children, calling render(:MTLDrawable:CommandQueue) the menu will render itself to the drawable by committing a command buffer to the queue, and then call the same method for all of its children with the same drawable and queue, until the entire node tree has been rendered from top to bottom. I want a separate subthread to be spawned for he rendering of every node in the tree--can I just wrap each render function in a dispatch-async call? Is the command queue inherently thread-safe? What is the accepted solution for concurrently rendering multiple objects to a single texture before presenting it using Metal? All I've seen in any Metal tutorial so far is a single thread that renders everything in order using a single command buffer per frame, calling presentDrawable() and then commit() at the end of each frame.
Edit
When I say I want to use multithreading, it applies only to command encoding, not execution itself. I don't want to end up with the buttons in my theoretical menu being drawn and then covered up with the menu background as a result of bad execution order. I just want each object's render operation to be encoded on a separate thread, before being handed to the command queue.

Using a separate command buffer and render pass for each UI element is extreme overkill, even if you want to use CPU-side concurrency to do your rendering. I would contend that you should start out by writing the simplest thing that could possibly work, then optimize from there. You can set a lot of state and issue a lot of draw calls before the CPU becomes your bottleneck, which is why people start with a simple, single-threaded approach.
Dispatching work to threads isn't free. It introduces overhead, and that overhead will likely dominate the work you're doing to issue the commands for drawing any given element, especially once you factor in the bandwidth required to repeatedly load and store your render targets.
Once you've determined you're CPU-bound (probably once you're issuing thousands of draw calls per frame), you can look at splitting the encoding up across threads with an MTLParallelRenderCommandEncoder, or multipass solutions. But well before you reach that point, you should probably introduce some kind of batching system that removes the responsibility of issuing draw calls from your UI elements, because although that seems tidy from an OOP perspective, it's likely to be a large architectural misstep if you care about performance at scale.
For one example, you could take a look at this Metal implementation of a backend renderer for the popular dear imgui project to see how to architect a system that does draw call batching in the context of UI rendering.

Related

In the dev tools timeline, what are the empty green rectangles?

In the Chrome dev tools timeline, what is the difference between the filled, green rectangles (which represent paint operations) and the empty, green rectangles (which apparently also represent something about paint operations...)?

Painting is really two tasks: draw calls and rasterization.
Draw calls. This is a list of things you'd like to draw, and its derived from the CSS applied to your elements. Ultimately there is a list of draw calls not dissimilar to the Canvas element's: moveTo, lineTo, fillRect (though they have slightly different names in Skia, Chrome's painting backend, it's a similar concept.)
Rasterization. The process of stepping through those draw calls and filling out actual pixels into buffers that can be uploaded to the GPU for compositing.
So, with that background, here we go:
The solid green blocks are the draw calls being recorded by Chrome. These are done on the main thread alongside JavaScript, style calculations, and layout. These draw calls are grouped together as a data structure and passed to the compositor thread.
The empty green blocks are the rasterization. These are handled by a worker thread spawned by the compositor.
Essentially, then, both are paint, they just represent different sub-tasks of the overall job. If you're having performance issues (and from the grab you provided you appear to be paint bound), then you may need to examine what properties you're changing via CSS or JavaScript and see if there is a compositor-only way to achieve the same ends. CSS Triggers can probably help here.

to drawRect or not to drawRect (when should one use drawRect/Core Graphics vs subviews/images and why?)

To clarify the purpose of this question: I know HOW to create complicated views with both subviews and using drawRect. I'm trying to fully understand the when's and why's to use one over the other.
I also understand that it doesn't make sense to optimize that much ahead of time, and do something the more difficult way before doing any profiling. Consider that I'm comfortable with both methods, and now really want a deeper understanding.
A lot of my confusion comes from learning how to make table view scroll performance really smooth and fast. Of course the original source of this method is from the author behind twitter for iPhone (formerly tweetie). Basically it says that to make table scrolling buttery smooth, the secret is to NOT use subviews, but instead do all the drawing in one custom uiview. Essentially it seems that using lots of subviews slows rendering down because they have lots of overhead, and are constantly re-composited over their parent views.
To be fair, this was written when the 3GS was pretty brand spankin new, and iDevices have gotten much faster since then. Still this method is regularly suggested on the interwebs and elsewhere for high performance tables. In fact it's a suggested method in Apple's Table Sample Code, has been suggested in several WWDC videos (Practical Drawing for iOS Developers), and many iOS programming books.
There are even awesome looking tools to design graphics and generate Core Graphics code for them.
So at first I'm lead to believe "there’s a reason why Core Graphics exists. It’s FAST!"
But as soon as I think I get the idea "Favor Core Graphics when possible", I start seeing that drawRect is often responsible for poor responsiveness in an app, is extremely expensive memory wise, and really taxes the CPU. Basically, that I should "Avoid overriding drawRect" (WWDC 2012 iOS App Performance: Graphics and Animations)
So I guess, like everything, it's complicated. Maybe you can help myself and others understand the When's and Why's for using drawRect?
I see a couple obvious situations to use Core Graphics:
You have dynamic data (Apple's Stock Chart example)
You have a flexible UI element that can't be executed with a simple resizable image
You are creating a dynamic graphic, that once rendered is used in multiple places
I see situations to avoid Core Graphics:
Properties of your view need to be animated separately
You have a relatively small view hierarchy, so any perceived extra effort using CG isn't worth the gain
You want to update pieces of the view without redrawing the whole thing
The layout of your subviews needs to update when the parent view size changes
So bestow your knowledge. In what situations do you reach for drawRect/Core Graphics (that could also be accomplished with subviews)? What factors lead you to that decision? How/Why is drawing in one custom view recommended for buttery smooth table cell scrolling, yet Apple advises drawRect against for performance reasons in general? What about simple background images (when do you create them with CG vs using a resizable png image)?
A deep understanding of this subject may not be needed to make worthwhile apps, but I don't love choosing between techniques without being able to explain why. My brain gets mad at me.
Question Update
Thanks for the information everyone. Some clarifying questions here:
If you are drawing something with core graphics, but can accomplish the same thing with UIImageViews and a pre-rendered png, should you always go that route?
A similar question: Especially with badass tools like this, when should you consider drawing interface elements in core graphics? (Probably when the display of your element is variable. e.g. a button with 20 different color variations. Any other cases?)
Given my understanding in my answer below, could the same performance gains for a table cell possibly be gained by effectively capturing a snapshot bitmap of your cell after your complex UIView render's itself, and displaying that while scrolling and hiding your complex view? Obviously some pieces would have to be worked out. Just an interesting thought I had.

Stick to UIKit and subviews whenever you can. You can be more productive, and take advantage of all the OO mechanisms that should things easier to maintain. Use Core Graphics when you can't get the performance you need out of UIKit, or you know trying to hack together drawing effects in UIKit would be more complicated.
The general workflow should be to build the tableviews with subviews. Use Instruments to measure the frame rate on the oldest hardware your app will support. If you can't get 60fps, drop down to CoreGraphics. When you've done this for a while, you get a sense for when UIKit is probably a waste of time.
So, why is Core Graphics fast?
CoreGraphics isn't really fast. If it's being used all the time, you're probably going slow. It's a rich drawing API, which requires its work be done on the CPU, as opposed to a lot of UIKit work that is offloaded to the GPU. If you had to animate a ball moving across the screen, it would be a terrible idea to call setNeedsDisplay on a view 60 times per second. So, if you have sub-components of your view that need to be individually animated, each component should be a separate layer.
The other problem is that when you don't do custom drawing with drawRect, UIKit can optimize stock views so drawRect is a no-op, or it can take shortcuts with compositing. When you override drawRect, UIKit has to take the slow path because it has no idea what you're doing.
These two problems can be outweighed by benefits in the case of table view cells. After drawRect is called when a view first appears on screen, the contents are cached, and the scrolling is a simple translation performed by the GPU. Because you're dealing with a single view, rather than a complex hierarchy, UIKit's drawRect optimizations become less important. So the bottleneck becomes how much you can optimize your Core Graphics drawing.
Whenever you can, use UIKit. Do the simplest implementation that works. Profile. When there's an incentive, optimize.

The difference is that UIView and CALayer essentially deal in fixed images. These images are uploaded to the graphics card (if you know OpenGL, think of an image as a texture, and a UIView/CALayer as a polygon showing such a texture). Once an image is on the GPU, it can be drawn very quickly, and even several times, and (with a slight performance penalty) even with varying levels of alpha transparency on top of other images.
CoreGraphics/Quartz is an API for generating images. It takes a pixel buffer (again, think OpenGL texture) and changes individual pixels inside it. This all happens in RAM and on the CPU, and only once Quartz is done, does the image get "flushed" back to the GPU. This round-trip of getting an image from the GPU, changing it, then uploading the whole image (or at least a comparatively large chunk of it) back to the GPU is rather slow. Also, the actual drawing that Quartz does, while really fast for what you are doing, is way slower than what the GPU does.
That's obvious, considering the GPU is mostly moving around unchanged pixels in big chunks. Quartz does random-access of pixels and shares the CPU with networking, audio etc. Also, if you have several elements that you draw using Quartz at the same time, you have to re-draw all of them when one changes, then upload the whole chunk, while if you change one image and then let UIViews or CALayers paste it onto your other images, you can get away with uploading much smaller amounts of data to the GPU.
When you don't implement -drawRect:, most views can just be optimized away. They don't contain any pixels, so can't draw anything. Other views, like UIImageView, only draw a UIImage (which, again, is essentially a reference to a texture, which has probably already been loaded onto the GPU). So if you draw the same UIImage 5 times using a UIImageView, it is only uploaded to the GPU once, and then drawn to the display in 5 different locations, saving us time and CPU.
When you implement -drawRect:, this causes a new image to be created. You then draw into that on the CPU using Quartz. If you draw a UIImage in your drawRect, it likely downloads the image from the GPU, copies it into the image you're drawing to, and once you're done, uploads this second copy of the image back to the graphics card. So you're using twice the GPU memory on the device.
So the fastest way to draw is usually to keep static content separated from changing content (in separate UIViews/UIView subclasses/CALayers). Load static content as a UIImage and draw it using a UIImageView and put content generated dynamically at runtime in a drawRect. If you have content that gets drawn repeatedly, but by itself doesn't change (I.e. 3 icons that get shown in the same slot to indicate some status) use UIImageView as well.
One caveat: There is such a thing as having too many UIViews. Particularly transparent areas take a bigger toll on the GPU to draw, because they need to be mixed with other pixels behind them when displayed. This is why you can mark a UIView as "opaque", to indicate to the GPU that it can just obliterate everything behind that image.
If you have content that is generated dynamically at runtime but stays the same for the duration of the application's lifetime (e.g. a label containing the user name) it may actually make sense to just draw the whole thing once using Quartz, with the text, the button border etc., as part of the background. But that's usually an optimization that's not needed unless the Instruments app tells you differently.

I'm going to try and keep a summary of what I'm extrapolating from other's answers here, and ask clarifying questions in an update to the original question. But I encourage others to keep answers coming and vote up those who have provided good information.
General Approach
It's quite clear that the general approach, as Ben Sandofsky mentioned in his answer, should be "Whenever you can, use UIKit. Do the simplest implementation that works. Profile. When there's an incentive, optimize."
The Why
There are two main possible bottlenecks in an iDevice, the CPU and GPU
CPU is responsible for the initial drawing/rendering of a view
GPU is responsible for a majority of animation (Core Animation), layer effects, compositing, etc.
UIView has a lot of optimizations, caching, etc, built in for handling complex view hierarchies
When overriding drawRect you miss out on a lot of the benefits UIView's provide, and it's generally slower than letting UIView handle the rendering.
Drawing cells contents in one flat UIView can greatly improve your FPS on scrolling tables.
Like I said above, CPU and GPU are two possible bottlenecks. Since they generally handle different things, you have to pay attention to which bottleneck you are running up against. In the case of scrolling tables, it's not that Core Graphics is drawing faster, and that's why it can greatly improve your FPS.
In fact, Core Graphics may very well be slower than a nested UIView hierarchy for the initial render. However, it seems the typical reason for choppy scrolling is you are bottlenecking the GPU, so you need to address that.
Why overriding drawRect (using core graphics) can help table scrolling:
From what I understand, the GPU is not responsible for the initial rendering of the views, but is instead handed textures, or bitmaps, sometimes with some layer properties, after they have been rendered. It is then responsible for compositing the bitmaps, rendering all those layer affects, and the majority of animation (Core Animation).
In the case of table view cells, the GPU can be bottlenecked with complex view hierarchies, because instead of animating one bitmap, it is animating the parent view, and doing subview layout calculations, rendering layer effects, and compositing all the subviews. So instead of animating one bitmap, it is responsible for the relationship of bunch of bitmaps, and how they interact, for the same pixel area.
So in summary, the reason drawing your cell in one view with core graphics can speed up your table scrolling is NOT because it's drawing faster, but because it is reducing the load on the GPU, which is the bottleneck giving you trouble in that particular scenario.

I am a game developer, and I was asking the same questions when my friend told me that my UIImageView based view hierarchy was going to slow down my game and make it terrible. I then proceeded to research everything I could find about whether to use UIViews, CoreGraphics, OpenGL or something 3rd party like Cocos2D. The consistent answer I got from friends, teachers, and Apple engineers at WWDC was that there won't be much of a difference in the end because at some level they are all doing the same thing. Higher-level options like UIViews rely on the lower level options like CoreGraphics and OpenGL, just they are wrapped in code to make it easier for you to use.
Don't use CoreGraphics if you are just going to end up re-writing the UIView. However, you can gain some speed from using CoreGraphics, as long as you do all your drawing in one view, but is it really worth it? The answer I have found is usually no. When I first started my game, I was working with the iPhone 3G. As my game grew in complexity, I began to see some lag, but with the newer devices it was completely unnoticeable. Now I have plenty of action going on, and the only lag seems to be a drop in 1-3 fps when playing in the most complex level on an iPhone 4.
Still I decided to use Instruments to find the functions that were taking up the most time. I found that the problems were not related to my use of UIViews. Instead, it was repeatedly calling CGRectMake for certain collision sensing calculations and loading image and audio files separately for certain classes that use the same images, rather than having them draw from one central storage class.
So in the end, you might be able to achieve a slight gain from using CoreGraphics, but usually it will not be worth it or may not have any effect at all. The only time I use CoreGraphics is when drawing geometric shapes rather than text and images.

Is OpenGL threadsafe for multiple threads with distinct contexts?

I know that sharing a single context between threads is bad news. I know that I can safely create and use a context with an offscreen framebuffer on a secondary thread when nothing is happening with GL on the main thread.
I haven't yet been able to find a definitive answer to the question of whether I can safely create two contexts on two different threads (say, a main thread drawing to the screen, and a secondary thread doing offscreen drawing work) and have them both making GL function calls simultaneously.
In other words, as long as the contexts are different, can two threads "share" the C API and thus the GPU? Or is that inherently something that is unshareable? Or is this implementation-specific?
Asking specifically for OpenGL ES on iOS, but it's probably a general GL question.

Yes, you need to use one context for each thread you want to use OpenGL with, also you can share objects between the contexts. This is the way to go :)

Option 1: If you don't use the context by two threads simultaneously, one context is enough.
Option 2: If you need to use OpenGL by several threads simultaneously, you need more than one context. Then, if the contexts share their Sharegroup, they share their OpenGL content like textures. This way you can load textures or do heavy framebuffer processing on a background thread.
Have a look at the last section about Sharegroups here: http://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/WorkingwithOpenGLESContexts/WorkingwithOpenGLESContexts.html
Option 3: GLKit provides some built-in background processing, for example asynchronous texture loading via GLKTextureLoaders - textureWithContentsOfFile. I don't know all the options, but it definitely simplifies some of uses cases of asynchronous OpenGL.

Draw calls take WAY longer when targeting an offscreen renderbuffer (iPhone GL ES)

I'm using OpenGL ES 1.1 to render a large view in an iPhone app. I have a "screenshot"/"save" function, which basically creates a new GL context offscreen, and then takes exactly the same geometry and renders it to the offscreen context. This produces the expected result.
Yet for reasons I don't understand, the amount of time (measured with CFAbsoluteTimeGetCurrent before and after) that the actual draw calls take when sending to the offscreen buffer is more than an order of magnitude longer than when drawing to the main framebuffer that backs an actual UIView. All of the GL state is the same for both, and the geometry list is the same, and the sequence of calls to draw is the same.
Note that there happens to be a LOT of geometry here-- the order of magnitude is clearly measurable and repeatable. Also note that I'm not timing the glReadPixels call, which is the thing that I believe actually pulls data back from the GPU. This is just a mesaure of the time spent in e.g. glDrawArrays.
I've tried:
Render that geometry to the screen again just after doing the offscreen render: takes the same quick time for the screen draw.
Render the offscreen thing twice in a row-- both times show the same slow draw speed.
Is this an inherent limitation of offscreen buffers? Or might I be missing something fundamental here?
Thanks for your insight/explanation!

Your best bet is probably to sample both your offscreen rendering and window system rendering each running in a tight loop with the CPU Sampler in Instruments and compare the results to see what differences there are.
Also, could you be a bit more clear about what exactly you mean by “render the offscreen thing twice in a row?” You mentioned at the beginning of the question that you “create a new GL context offscreen”—do you mean a new framebuffer and renderbuffer, or a completely new EAGLContext? Depending on how many new resources and objects you’re creating in order to do your offscreen rendering, the driver may need to do a lot of work to set up these resources the first time you use them in a draw call. If you’re just screenshotting the same content you were putting onscreen, you shouldn’t even need to do any of this—it should be sufficient to call glReadPixels before -[EAGLContext presentRenderbuffer:], since the backbuffer contents will still be defined at that point.

Could offscreen rendering be forcing the GPU to flush all its normal state, then do your render, flush the offscreen context, and have to reload all the normal stuff back in from CPU memory? That could take a lot longer than any rendering using data and frame buffers that stays completely on the GPU.

I'm not an expert on the issue but from what I understand graphics accelerators are used for sending data off to the screen so normally the path is Code ---vertices---> Accelerator ---rendered-image---> Screen. In your case you are flushing the framebuffer back into main memory which might be hitting some kind of bottleneck in bandwidth in the memory controller or something or other.

How does glClear() improve performance?

Apple's Technical Q&A on addressing flickering (QA1650) includes the following paragraph. (Emphasis mine.)
You must provide a color to every pixel on the screen. At the beginning of your drawing code, it is a good idea to use glClear() to initialize the color buffer. A full-screen clear of each of your color, depth, and stencil buffers (if you're using them) at the start of a frame can also generally improve your application's performance.
On other platforms, I've always found it to be an optimization to not clear the color buffer if you're going to draw to every pixel. (Why waste time filling the color buffer if you're just going to overwrite that clear color?)
How can a call to glClear() improve performance?

It is most likely related to tile-based rendering, which divides the whole viewport to tiles (smaller windows, can be typically of size 32x32) and these tiles are kept in faster memories. The copy operations between this smaller memory and the real framebuffer can take some time (memory operations are a lot slower than arithmetic operations). By issuing a glClear command, you are telling the hardware that you do not need previous buffer content, thus it does not need to copy the color/depth/whatever from the framebuffer to the smaller tile memory.

With the long perspective of official comments on iOS 4, which postdates the accepted answer...
I think that should be read in conjunction with Apple's comments about the GL_EXT_discard_framebuffer extension, which should always be used at the end of a frame if possible (and indeed elsewhere). When you discard a frame buffer you put its contents into an undefined state. The benefit of that is that when you next bind some other frame buffer, there's never any need to store the current contents of your buffer out to somewhere, and similarly when you next restore your buffer there's no need to retrieve them. Those should all be GPU memory copies and so quite cheap, but they're far from free and the iPhone's shared memory architecture presumably means that even more complicated considerations can crop up.
Based on the compositing model of iOS it's reasonable to assume that that even if your app doesn't bind and unbind frame buffers in its context, the GPU has to do those tasks implicitly at least once for each of your frames.
I would dare guess that the driver is smart enough that if the first thing you do is a clear, you get half the benefit of the discard extension without actually using it.

Try decreasing the size of your framebuffer in the glSurface attributes that you pass to ChooseConfig.
For example, set attributes to 0 for minimum or omit them completely to use defaults unless you have a specific requirement.

I can definitely confirm that you have to provide a color to every pixel on the screen.
I've verified this in a bare bones test app built on XCode's iPhone OpenGL template:
[context presentRenderbuffer:GL_RENDERBUFFER];
glClear( GL_COLOR_BUFFER_BIT );
If I leave out the glClear line (or move it further down in the loop, after some other OpenGL calls), the thread (running via CADisplayLink) is hardly getting any updates anymore. It seems as if CPU/GPU synchronization goes haywire and the thread gets blocked.
Pretty scary stuff if you ask me, and totally not in line with my expectations.
BTW, you don't neccessarily have to use glClear(), just drawing a fullscreen quad seems to have the same effect (obviously, a textured quad is more expensive). It seems you just have to invalidate all the tiles.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse