NSOperationQueue dispatching threads slowly? - iphone

I'm writing my first multithreaded iPhone apps using NSOperationQueue because I've heard it's loads better and potentially faster than managing my own thread dispatching.
I'm calculating the outcome of a Game of Life board by splitting the board into seperate pieces and having seperate threads calculate each board and then splicing them back together, to me this seems like a faster way even with the tremendous overhead of splitting and splicing. I'm creating a NSInvocationOperation object for each board and then sending them to the OperationQueue. After I've sent all the pieces of the board I sit and wait for them all to finish calculating with the waitUntilAllOperationsAreFinished call to the OperationQueue.
This seems like it should work, and it does it works just fine BUT the threads get called out very slooooowwwlllyyyyyy and so it actually ends up taking longer for the multithreaded version to calculate than the single threaded version! OH NOES! I monitored the creation and termination of the NSOperations sent to the NSOperationQueue and found that some just sit in the Operation Queue do-diddly-daddlin for awhile before they get called much later on. At first I thought "Hey maybe the queue can only process so many threads at a time" and then bumped the Queues maxConcurrentOperationCount up to some arbitrary high number (well above the amount of board-pieces) but I experienced the same thing!
I was wondering if maybe someone can tell me how to kick NSOperationQueue into "overdrive" so to say so that it dispatches its queues as quickly as possible, either that or tell me whats going on!

Threads do not magically make your processor run faster.
On a single-processor machine, if your algorithm takes a million instructions to execute, splitting it up into 10 chunks of 100,000 instructions each and running it on 10 threads is still going to take just as long. Actually, it will take longer, because you've added the overhead of splitting, merging, and context switching among the threads.

The queue is still fundamentally limited by the processing power of the phone. If the phone can only run two processes simultaneously, you will get (at most) close to a two-fold increase in speed by splitting up the task. Anything more than that, and you are just adding overhead for no gain.
This is especially true if you are running a processor and memory intensive routine like your board calculation. NSOperationQueue makes sense if you have several operations that need to wait for extended periods of time. User interface loops and network downloads would be excellent examples. In those cases, other operations can complete while the inactive ones are waiting for input.
For something like your board, the operation for each piece of the grid never has any wait condition. It is always churning away at full speed until done.
See also: iPhone Maximum thread limit? and concurrency application design

Related

Custom DispatchQueue quality of service

Is there a way to create a custom DispatchQueue quality of service with its own custom "speed"? For example, I want a QoS that's twice as slow as .utility.
Ideas on how to solve it
Somehow telling the CPU/GPU that we want to run the task every X operation cycles? Not sure if that's directly possible with iOS.
This is a really bad hack which produces messy code and doesn't really solve the issue if 1 line of code runs for several seconds, but we can introduce a wait after every line of code.
In SpriteKit/SceneKit, it's possible to slow down time. Is there a way to utilize that somehow to slow down an arbitrary piece of code?
Blocking the thread every X seconds so that it slows down - not sure if possible without sacrificing app speed
There is no mechanism in iOS or any other Cocoa platform to control the "speed" (for any meaning of that word) of a work item. The only tool offered us is some control over scheduling. Once your work item is scheduled, it will get 100% (*) of the CPU core until it ends or is preempted. There is no way to be asked to be preempted more often (and it would be expensive to allow that, since context switches are expensive).
The way to manage how much work is done is to directly manage the work, not preemption. The best way is to split up the work into small pieces, and schedule them over time and combine them at the end. If your algorithm doesn't support that kind of input segmentation, then the algorithm's main "loop" needs to limit the number of iterations it performs (or the amount of time it spends iterating), and return at that point to be scheduled later.
If you don't control the algorithm code, and you cannot work with whoever does, and you cannot slice your data into smaller pieces, this may be an unsolvable problem.
(*) With the rise of "performance" cores and other such CPU advances, this isn't completely true, but for this question it's close enough.
Technically you cannot alter the speed on the QoS such as .background or .utility or any other Qos.
The way to handle this is to choose the right QoS based on the task you want to perform.
The higher the QoS is, the more resources the OS will spend on it and descends when you use a lower one.

Figure out when a context switch is happening in Swift

To be honest, I don't know if there might be a solution to my question but I'd like to catch, in Swift, when a context switch is happening.
I was imaging a func which takes a long time in order to be completed such as a write operation on a remote server and I was thinking if there might be a way to understand when (which line at least) the thread which is executing that task is performing a context switch because another task waiting for a long time has to be executed.
Sorry if for you might seem a stupid question or if I made mistakes whilst trying to explain the above
EDIT:
I'm talking about context switches that are happening automatically requested by the scheduler.. So imagine again we are in the middle of this long function which does tons of operations and the scheduler gave this task an amount of seconds, for example 10 seconds in order to make it complete. If the process runs out of time and doesnt end the task, it will be suspended and for example the thread will execute another task of another process. When it ends, scheduler might think to give another try to the suspended job and the execution will be resumed starting where it has been suspended ( so will read the value from PC register and will keep going on )
You can absolutely get what you've asked for, but I have a feeling it's going to be much less useful to you than you may believe. The vast (vast, vast!) majority of context switches are going to occur at points that you probably would think of as "uninteresting." lstat64(), mach_vm_map_trap(), mach_msg_trap(), mach_port_insert_member_trap(), kevent_id(), the list goes on. Most threads spend most of their time deep in the OS stack. "A write operation on a remote server" isn't going to block after it takes some long period of time. It's going to proactively block itself because it knows it's going to take a (mind-bogglingly) long period of time.
Even so, you can certainly explore this with Instruments. Just choose the System Trace profile and it'll show you all the threads and the system cores and how your threads and all the other threads on the device are getting scheduled, every system call, etc etc etc. It's a huge amount of information, so you usually only profile for a few seconds at a time. But it'll look something like this:
This is useful information if you're at the point where context switches are a major bottleneck. This might happen if you're dealing with excessive lock contention, or if you're thrashing your L1 cache because you keep getting interrupted by some other thread. So if you have some thread that you expect to stay running pretty continuously, and it's getting blocked, this is really valuable information. Or if you have two threads that you think should work back and forth smoothly, but they seem to be fighting (switching rapidly), then that's something you could work on. (But this is rarely one of the first places you'd look for performance tuning unless you're working on quite low-level code.)
From your description, I think you may have the wrong idea about the scheduler. Nothing in the scheduler is going to be on the order of 10 seconds. In the scheduler world, milliseconds are a lot of time. You should be thinking about things that take microseconds and even nanoseconds. If you're working on code that assumes fetching data from RAM is free, then you're on the wrong time scale. A network call is so ludicrously slow that you can basically estimate it as "forever." At the context-switch level you're looking at events like:
00:00.770.247 Virtual memory Zero Fill took 10.50 µs small_free_list_add_ptr ← (31 other frames)
I think its a kinda cool question but not clear.The answer is all about my understanding on your question. Normally if you implement C++ or C program for context switching, consider you write code with mutexes or semaphores.In these sections processes or threads work in critical section and sometimes there executing the context switching manually or interruption way. In iOS there are same identical implementation on concurrency such as DispatchSemaphore.(Actually mutex is a Semaphore that working with lock system.) You can read the documentation from here.
For starting to this, this is the Semaphore class definition in Swift.
class DispatchSemaphore : DispatchObject
You can init it with int value such as
let semaphore = DispatchSemaphore(value: intValue)
And if you want to use mutex variant you can easily use lock variable. Such as
var lock = Lock()
When you lock the thread with correct implementation, you will be in critical section and you can switch it with unlock or etc.
It's obviously look similar with POSIX pthread_lock_t
You can handle the context switching within the lock or semaphore critical section like that
lock.lock()
// critical section
// handle the calculation, if its longer unlock, or let it do its job
lock.unlock()
semaphore.wait(timeout: DispatchTime.distantFuture)
// critical section
// handle the calculation, if its too long signal to exit it in here, or let it do its job
// if it is a normal job, it will be signal for exit already.
semaphore.signal()
The handled answer is contains context switching in threads.
Addition to my question, when you ask about the context switching, its natural means is that changing the threads or processes basically. So when you get the data from background thread and then implement it for the UITableView' you probably will do reloadData method call in DispatchQueue.main.async . Its a common usage of context switching.

Is there any advantage with RIO of IOCP over events?

RIO here stands for the Windows8 'Registered I/O' Networking extensions. From looking at example code, it seems that regardless of whether you are using RIONotify with events or IO Completion Ports, you basically end up writing the same loop, and would have nearly identical performance characteristics. The loop body is:
RIONotify() [event or IOCP]
Wait [on the event, or using GetQueuedCompletionStatus()]
RIODequeueCompletion()
// Process the dequeued events
Basically it seems like the usage of IO Completion Port is providing no additional functionality over 'event' notify/waiting, since the actual queue of messages is done with RIODequeueCompletion. So it doesn't matter whether you use events, or IOCP. My question is, am I overlooking any interesting or important difference between the models?
RIO is about registering buffers with the kernel to save overhead and about more efficient queue management. It's not a fundamental shift. Just a lot less overhead.
IOCP is not about increasing the performance of individual actions. It is about using less threads and about context switching less. RIO takes it a little further.
If you're using RIO with IOCP you could scale across multiple threads by calling RIODequeueCompletion() to dequeue multiple completions at a time, then immediately calling RIONotify() to allow more notifications and then letting your thread process the completions that you retrieved with RIODequeueCompletion(). If there are more completions available another of your threads will become active and can do the same thing. This may or may not give you improved performance over a single threaded polling loop.
I have some example code for this here, and some thoughts on the relative performance of various RIO designs and 'standard' UDP server designs here.

NSThreading for speed

I'm working on a game sim and want to speed up the match simulation bit. On a given date there may be 50+ matches that need simulating. Currently I loop through each and tell them to simulate themselves, but this can take forever. I was hoping
1) Overlay a 'busy' screen
2) Launch a thread for each
3) When the last thread exits, remove the overlay.
Now I can do 1 & 2, but I cannot figure out how to tell when the last thread is finished, because the last thread I detach may not be the last thread finished. What's the best way to do that?
Also, usually threads are used so that work can be done in the background while the user does other stuff, I'm using it slightly different. My app is a core-data app and I want to avoid the user touching the store in other ways while i'm simulating the matches. So I want single-threading most of the time, but then multithreading for this situation because of how long the sim engine takes. If someone has other ideas for this approach I'm open.
Rob
Likely you want to use NSOperation and NOT 50 threads - 50 threads is not healthy on an iPhone, and NSOperations are easier to boot. It may be that you are killing performance (it would be my guess) by trying to run 50 at once. NSOperation is designed for exactly this problem. Plus its easy to code.
My only problem with NSOperation is that they don't have a standard way to tell the caller that they are done.
You could periodically poll the NSOperationQueue - when its count is 0 there are none left. You could also make each operation increment some counter - when the count is 50 you are done. Or each operation could post a notification using performSelectorOnMainThread on the main thread that its done.
You should see a boost in performance with even a single core - there are lots of times that the main thread is blocked waiting for user input/graphics drawing/etc. Plus multicore phones and iPads will likely be out within a year (total guess - but they are coming).
Also make sure you look at the operation with Instruments. It may be that you can speed the calculations up be a factor of 2 or even 10x!
You're on a single core, so threading probably won't help much, and the overhead may even slow things down.
The first thing to do is use Instruments to profile your code and see what can be sped up. Once you've done that you can look at some specific optimizations for the bottle necks.
One simple approach (if you can use GCD) is dispatch_apply(), which'll let you loop over your matches, automatically thread them in the best manner for your hardware, and doesn't return until all are complete.
Most straightforward solution would be to have all your threads 'performSelectorOnMainThread' to a particular method that decrements a counter, before they exit. And let the method remove the overlay screen when the counter it decremented reaches zero.
Simulating all the matches concurrently may not necessarily improve performance though.
You may get the solution for your specific question from #drowntoge, but generally, I want to give you advice about multithreading:
1/ It is not always speed up your program like Graham said. Your iPhone only has single core.
2/ If your program has some big IO, database or networking process that takes time, then you may consider multithreading because now data processing does not take all the time, it needs to wait for loading data. In this case, multithreading will significantly boost up your performance. But you still need to be careful because thread switching has overhead.
Maybe you only need 1 thread for IO processing and then has a cache layer to share the images/data. Then, you only need the main thread to loop and do simulation
3/ If you want 50 simulation seems to happen at the same time for user to watch, multithreading is also required:)
If you use threading, you won't know in what order the CPU is doing your tasks, and you could potentially be consuming a lot of thread scheduling resources. Better to use an NSOperationQueue and signal completion of each task using performSelectorOnMainThread. Decrementing a counter has already been mentioned, which may be useful for displaying a progress bar. But you could also maintain an array of 50 busy flags and clear them on completion, which might help debugging whether any particular task is slow or stuck if you mark completion with a time stamp.

Does it make sense to use a pool of Actors?

I'm just learning, and really liking, the Actor pattern. I'm using Scala right now, but I'm interested in the architectural style in general, as it's used in Scala, Erlang, Groovy, etc.
The case I'm thinking of is where I need to do things concurrently, such as, let's say "run a job".
With threading, I would create a thread pool and a blocking queue, and have each thread poll the blocking queue, and process jobs as they came in and out of the queue.
With actors, what's the best way to handle this? Does it make sense to create a pool of actors, and somehow send messages to them containing or the jobs? Maybe with a "coordinator" actor?
Note: An aspect of the case which I forgot to mention was: what if I want to constrain the number of jobs my app will process concurrently? Maybe with a config setting? I was thinking that a pool might make it easy to do this.
Thanks!
A pool is a mechanism you use when the cost of creating and tearing down a resource is high. In Erlang this is not the case so you should not maintain a pool.
You should spawn processes as you need them and destroy them when you have finished with them.
Sometimes, it makes sense to limit how many working processes you have operating concurrently on a large task list, as the task the process is spawned to complete involve resource allocations. At the very least processes use up memory, but they could also keep open files and/or sockets which tend to be limited to only thousands and fail miserably and unpredictable once you run out.
To have a pull-driven task pool, one can spawn N linked processes that ask for a task, and one hand them a function they can spawn_monitor. As soon as the monitored process has ended, they come back for the next task. Specific needs drive the details, but that is the outline of one approach.
The reason I would let each task spawn a new process is that processes do have some state and it is nice to start off a clean slate. It's a common fine-tuning to set the min-heap size of processes adjusted to minimize the number of GCs needed during its lifetime. It is also a very efficient garbage collection to free all memory for a process and start on a new one for the next task.
Does it feel weird to use twice the number of processes like that? It's a feeling you need to overcome in Erlang programming.
There is no best way for all cases. The decision depends on the number, duration, arrival, and required completion time of the jobs.
The most obvious difference between just spawning off actors, and using pools is that in the former case your jobs will be finished nearly at the same time, while in the latter case completion times will be spread in time. The average completion time will be the same though.
The advantage of using actors is the simplicity on coding, as it requires no extra handling. The trade-off is that your actors will be competing for your CPU cores. You will not be able to have more parallel jobs than CPU cores (or HT's, whatever), no matter what programming paradigm you use.
As an example, imagine that you need to execute 100'000 jobs, each taking one minute, and the results are due next month. You have four cores. Would you spawn off 100'000 actors having each compete over the resources for a month, or would you just queue your jobs up, and have execute four at a time?
As a counterexample, imagine a web server running on the same machine. If you have five requests, would you prefer to serve four users in T time, and one in 2T, or serve all five in 1.2T time ?