I just started working with OpenAL and everything seems fine so far, i've tried some tutorials, managed to load and play some sounds, but before i start implementing something more complex, i'd like to make sure i understand how OpenAL works.
Basically my goal is to make a simple system, where i can give commands such as "play this and this, stop playing that" without having to care about anything. Let's assume that the program has to work with 150 sounds, which are 250 MB in total when decompressed to PCM and all of it is available at the beginning.
In OpenAL, there are sources and buffers. I understand that i'm supposed to have a pool of sources and reuse them. What i don't understand and haven't been able to find anywhere, is what the buffers actually represent. Are they a limited resource or just regular storage, perhaps converted to a format that can be played easily, but still in regular memory?
In the situation i described, should i:
A) create 150 buffers in the beginning and fill them all with 250 MB of data, some of which might be an hour long sounds, keep it for the whole duration of the program and play them when needed, or
B) load the 250 MB into memory and load each sound into a buffer right before playback, in small chunks, and release them right after?
If A, what is the purpose of streaming data into small buffers and then queueing them?
If B, how many buffers and how much data in them is safe? Should i also make a buffer pool and reuse them?
Also, i have an additional question about buffer queueing. I understand that i can either set a source's buffer (as a property) or queue it (doing both would be an error right?). I was surprised though, that apparently i also have to unqueue them. Why doesn't this happen automatically as soon as the buffer is played? And do i have to unqueue them in the same order they were queued?
Thank you for helping me understand this.
Related
Is there a way to create a custom DispatchQueue quality of service with its own custom "speed"? For example, I want a QoS that's twice as slow as .utility.
Ideas on how to solve it
Somehow telling the CPU/GPU that we want to run the task every X operation cycles? Not sure if that's directly possible with iOS.
This is a really bad hack which produces messy code and doesn't really solve the issue if 1 line of code runs for several seconds, but we can introduce a wait after every line of code.
In SpriteKit/SceneKit, it's possible to slow down time. Is there a way to utilize that somehow to slow down an arbitrary piece of code?
Blocking the thread every X seconds so that it slows down - not sure if possible without sacrificing app speed
There is no mechanism in iOS or any other Cocoa platform to control the "speed" (for any meaning of that word) of a work item. The only tool offered us is some control over scheduling. Once your work item is scheduled, it will get 100% (*) of the CPU core until it ends or is preempted. There is no way to be asked to be preempted more often (and it would be expensive to allow that, since context switches are expensive).
The way to manage how much work is done is to directly manage the work, not preemption. The best way is to split up the work into small pieces, and schedule them over time and combine them at the end. If your algorithm doesn't support that kind of input segmentation, then the algorithm's main "loop" needs to limit the number of iterations it performs (or the amount of time it spends iterating), and return at that point to be scheduled later.
If you don't control the algorithm code, and you cannot work with whoever does, and you cannot slice your data into smaller pieces, this may be an unsolvable problem.
(*) With the rise of "performance" cores and other such CPU advances, this isn't completely true, but for this question it's close enough.
Technically you cannot alter the speed on the QoS such as .background or .utility or any other Qos.
The way to handle this is to choose the right QoS based on the task you want to perform.
The higher the QoS is, the more resources the OS will spend on it and descends when you use a lower one.
Do we have any benchmarks for a range of descriptors from 1 to 50 or so? Most benchmarks I see are for large number of descriptors 100s..1000s...
I am currently using poll with 16 descriptors and thinking of using epoll if that will improve speed of app.
Please advise in 3 scenarios with 16 socket descriptors in the set for poll/epoll:
1. most of the sockets are active...>both should be same performance?
2. half active half idle....what is better here?
3. mostly idle...> clearly epoll is better ?
I would very much suspect that switching from poll() to epoll() will not make any difference in the performance of your application. The main advantage of epoll() crops up when you have many file descriptors (hundreds or thousands) where a standard poll() requires a little more work to be done on every call, whereas epoll() does the setup in advance - as long as you don't change the set of file descriptors you're watching, each call is very slightly quicker. But generally this difference is only noticeable for many, many file descriptors.
Bear in mind that if the set of file descriptors you're watching changes very frequently, epoll()'s main advantage is lost because you still need to do the work of passing new file descriptors into the kernel. So, if you're handling lots of short-lived connections then it's even less compelling to switch to it.
Another difference is that epoll() can be edge-triggered, where the call only returns when new activity occurs on a descriptor, or level-triggered, where the call returns while the descriptor is read/write-ready. The standard poll() call is always level-triggered. For most people, however, level-triggered is what they want - edge-triggered interfaces are occasionally useful, but in most cases they lead to race conditions where data arrives on a socket after reading but before entering the epoll() call. My advice is stay well away from edge-triggered code unless you really, really know what you're doing.
The price you pay for epoll() is the lack of portability - both poll() and select() are standard POSIX interfaces, so your code will be much more portable by using them. The epoll() call, on the other hand, is only available on Linux. Some other Unix variants also have their own equivalent mechanisms, such as kqueue on FreeBSD, but you have to write different code for each platform in that case.
My advice is until you reach a point where you're using many file descriptors, don't even worry about epoll() - seriously, there are almost certainly many other places in your code to make far bigger performance improvements and it's entirely possible that epoll() may not be faster for your use-case anyway.
If you do reach a stage where you're handling many connections and the rest of your code is already pretty optimal then you should first consider something like libev which is a cross-platform interface which uses the best performance calls on each particular platform. It performs very well and it's probably rather less hassle overall than directly using epoll() even if you only want to support Linux.
I haven't referred to the three scenarios you mention so far because I don't believe any of them will perform any differently for a low number of file descriptors such as 16. For a large number of file descriptors, epoll() should outperform poll() particularly where there are mostly idle file descriptors. If all file descriptors are always active, both calls require iterating through every connection to handle it. However, as the proportion of idle connections increases, epoll() gives better performance as it only returns the active connections - with poll() you still have to iterate through everything and most of them will be skipped, but epoll() returns you only the ones you need to handle (up to a maximum limit you can specify).
To spell that out explicitly (and this is only relevant for large numbers of connections, as I mentioned above):
Most of the sockets are active: Both calls broadly comparable, perhaps epoll() still slightly ahead.
Half active half idle: Would expect epoll() to be somewhat better here.
Mostly idle: Would expect epoll() to definitely be better here.
EDIT:
You might like to see this graph which is from the libevent author and shows the relative overhead of handling an event as the number of file descriptors changes. Note how all the lines are converging around the origin, demonstrating that all the mechanisms achieve comparable performance for a small number of descriptors.
I'm using select() to listen for data on multiple sockets. When I'm notified that there is data available, how much should I read()?
I could loop over read() until there is no more data, process the data, and then return back to the select-loop. However, I can imagine that the socket recieves so much data so fast that it temporarily 'starves' the other sockets. Especially since I am thinking of using select also for inter-thread communication (message-passing style), I'd like to keep latency low. Is this an issue in reality?
The alternative would be to always read a fixed size of bytes, and then return to the loop. The downside here would be added overhead when there is more data available than fits into my buffer.
What's the best practice here?
Not sure how this is implemented on other platforms, but on Windows the ioctlsocket(FIONREAD) call tells you how many bytes can be read by a single call to recv(). More bytes could be in the socket's queue by the time you actually call recv(). The next call to select() will report the socket is still readable, though.
The too-common approach here is to read everything that's pending on a given socket, especially if one moves to platform-specific advanced polling APIs like kqueue(2) and epoll(7) enabling edge-triggered events. But, you certainly don't have to! Flip a bit associated with that socket somewhere once you think you got enough data (but not everything), and do more recv(2)'s later, say at the very end of the file-descriptor checking loop, without calling select(2) again.
Then the question is too general. What are your goals? Low latency? Hight throughput? Scalability? There's no single answer to everything (well, except for 42 :)
I'm working on a game sim and want to speed up the match simulation bit. On a given date there may be 50+ matches that need simulating. Currently I loop through each and tell them to simulate themselves, but this can take forever. I was hoping
1) Overlay a 'busy' screen
2) Launch a thread for each
3) When the last thread exits, remove the overlay.
Now I can do 1 & 2, but I cannot figure out how to tell when the last thread is finished, because the last thread I detach may not be the last thread finished. What's the best way to do that?
Also, usually threads are used so that work can be done in the background while the user does other stuff, I'm using it slightly different. My app is a core-data app and I want to avoid the user touching the store in other ways while i'm simulating the matches. So I want single-threading most of the time, but then multithreading for this situation because of how long the sim engine takes. If someone has other ideas for this approach I'm open.
Rob
Likely you want to use NSOperation and NOT 50 threads - 50 threads is not healthy on an iPhone, and NSOperations are easier to boot. It may be that you are killing performance (it would be my guess) by trying to run 50 at once. NSOperation is designed for exactly this problem. Plus its easy to code.
My only problem with NSOperation is that they don't have a standard way to tell the caller that they are done.
You could periodically poll the NSOperationQueue - when its count is 0 there are none left. You could also make each operation increment some counter - when the count is 50 you are done. Or each operation could post a notification using performSelectorOnMainThread on the main thread that its done.
You should see a boost in performance with even a single core - there are lots of times that the main thread is blocked waiting for user input/graphics drawing/etc. Plus multicore phones and iPads will likely be out within a year (total guess - but they are coming).
Also make sure you look at the operation with Instruments. It may be that you can speed the calculations up be a factor of 2 or even 10x!
You're on a single core, so threading probably won't help much, and the overhead may even slow things down.
The first thing to do is use Instruments to profile your code and see what can be sped up. Once you've done that you can look at some specific optimizations for the bottle necks.
One simple approach (if you can use GCD) is dispatch_apply(), which'll let you loop over your matches, automatically thread them in the best manner for your hardware, and doesn't return until all are complete.
Most straightforward solution would be to have all your threads 'performSelectorOnMainThread' to a particular method that decrements a counter, before they exit. And let the method remove the overlay screen when the counter it decremented reaches zero.
Simulating all the matches concurrently may not necessarily improve performance though.
You may get the solution for your specific question from #drowntoge, but generally, I want to give you advice about multithreading:
1/ It is not always speed up your program like Graham said. Your iPhone only has single core.
2/ If your program has some big IO, database or networking process that takes time, then you may consider multithreading because now data processing does not take all the time, it needs to wait for loading data. In this case, multithreading will significantly boost up your performance. But you still need to be careful because thread switching has overhead.
Maybe you only need 1 thread for IO processing and then has a cache layer to share the images/data. Then, you only need the main thread to loop and do simulation
3/ If you want 50 simulation seems to happen at the same time for user to watch, multithreading is also required:)
If you use threading, you won't know in what order the CPU is doing your tasks, and you could potentially be consuming a lot of thread scheduling resources. Better to use an NSOperationQueue and signal completion of each task using performSelectorOnMainThread. Decrementing a counter has already been mentioned, which may be useful for displaying a progress bar. But you could also maintain an array of 50 busy flags and clear them on completion, which might help debugging whether any particular task is slow or stuck if you mark completion with a time stamp.
I'm writing my first multithreaded iPhone apps using NSOperationQueue because I've heard it's loads better and potentially faster than managing my own thread dispatching.
I'm calculating the outcome of a Game of Life board by splitting the board into seperate pieces and having seperate threads calculate each board and then splicing them back together, to me this seems like a faster way even with the tremendous overhead of splitting and splicing. I'm creating a NSInvocationOperation object for each board and then sending them to the OperationQueue. After I've sent all the pieces of the board I sit and wait for them all to finish calculating with the waitUntilAllOperationsAreFinished call to the OperationQueue.
This seems like it should work, and it does it works just fine BUT the threads get called out very slooooowwwlllyyyyyy and so it actually ends up taking longer for the multithreaded version to calculate than the single threaded version! OH NOES! I monitored the creation and termination of the NSOperations sent to the NSOperationQueue and found that some just sit in the Operation Queue do-diddly-daddlin for awhile before they get called much later on. At first I thought "Hey maybe the queue can only process so many threads at a time" and then bumped the Queues maxConcurrentOperationCount up to some arbitrary high number (well above the amount of board-pieces) but I experienced the same thing!
I was wondering if maybe someone can tell me how to kick NSOperationQueue into "overdrive" so to say so that it dispatches its queues as quickly as possible, either that or tell me whats going on!
Threads do not magically make your processor run faster.
On a single-processor machine, if your algorithm takes a million instructions to execute, splitting it up into 10 chunks of 100,000 instructions each and running it on 10 threads is still going to take just as long. Actually, it will take longer, because you've added the overhead of splitting, merging, and context switching among the threads.
The queue is still fundamentally limited by the processing power of the phone. If the phone can only run two processes simultaneously, you will get (at most) close to a two-fold increase in speed by splitting up the task. Anything more than that, and you are just adding overhead for no gain.
This is especially true if you are running a processor and memory intensive routine like your board calculation. NSOperationQueue makes sense if you have several operations that need to wait for extended periods of time. User interface loops and network downloads would be excellent examples. In those cases, other operations can complete while the inactive ones are waiting for input.
For something like your board, the operation for each piece of the grid never has any wait condition. It is always churning away at full speed until done.
See also: iPhone Maximum thread limit? and concurrency application design