NSOperationQueue worse performance than single thread on computation task

NSOperationQueue worse performance than single thread on computation task - swift

My first question!
I am doing CPU-intensive image processing on a video feed, and I wanted to use OperationQueue. However, the results are absolutely horrible. Here's an example—let's say I have a CPU intensive operation:
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = data[i] &+ 1 }
NSLog("\(DispatchTime.now().uptimeNanoseconds - startTime)")
}
It takes about 40ms on my laptop to execute. I time a hundred runs:
(1...100).forEach { i in run(i) }
They average about 42ms each, for about 4200ms total. I have 4 physical cores, so I try to run it on an OperationQueue:
var q = OperationQueue()
(1...100).forEach { i in
q.addOperation {
run(i)
}
}
q.waitUntilAllOperationsAreFinished()
Interesting things happen depending on q.maxConcurrentOperationCount:
concurrency single operation total
1 45ms 4500ms
2 100-250ms 8000ms
3 100-300ms 7200ms
4 250-450ms 9000ms
5 250-650ms 9800ms
6 600-800ms 11300ms
I use the default QoS of .background and can see that the thread priority is default (0.5). Looking at the CPU utilization with Instruments, I see a lot of wasted cycles (the first part is running it on main thread, the second is running with OperationQueue):
I wrote a simple thread queue in C and used that from Swift and it scales linearly with the cores, so I'm able to get my 4x speed increase. But what am I doing wrong with Swift?
Update: I think we have concluded that this is a legitimate bug in DispatchQueue. Then the question actually is what is the correct channel to ask about issues in DispatchQueue code?

You seem to measure the wall-clock time of each run execution. This does not seem to be the right metric. Parallelizing the problem does not signify that each run will execute faster... it just means that you can do several runs at once.
Anyhow, let me verify your results.
Your function run seems to take a parameter some of the time only. Let me define a similar function for clarity:
func increment(_ offset : Int) {
for i in data.indices { data[i] = data[i] &+ offset }
}
On my test machine, in release mode, this code takes 0.68 ns per entry or about 2.3 cycles (at 3.4 GHz) per addition. Disabling bound checking helps a bit (down to 0.5 ns per entry).
Anyhow. So next let us parallelize the problem as you seem to suggest:
var q = OperationQueue()
for i in 1...queues {
q.addOperation {
increment(i)
}
}
q.waitUntilAllOperationsAreFinished()
That does not seem particular safe but is it fast?
Well, it is faster... I hit 0.3 ns per entry.
Source code : https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/swift/opqueue

.background Will run the threads with the lowest priority. If you are looking for fast execution, consider .userInitiated and make sure you are measuring the performance with compiler optimizations turned on.
Also consider using DispatchQueue instead of OperationQueue. It might have less overhead and better performance.
Update based on your comments: try this. It goes from 38s on my laptop to 14 or so.
Notable changes:
I made the queue explicitly concurrent
I run the thing in release mode
Replaced the inner loop calculation with random number, the original got optimized out
QoS set to higher level: QoS now works as expected and .background runs forever
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = Int(arc4random_uniform(1000)) }
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
}
let startTime = DispatchTime.now().uptimeNanoseconds
var g = DispatchGroup()
var q = DispatchQueue(label: "myQueue", qos: .userInitiated, attributes: [.concurrent])
(1...100).forEach { i in
q.async(group: g) {
run()
}
}
g.wait()
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
Something is still wrong though - serial queue runs 3x faster even though it does not use all cores.

For the sake of future readers, two observations on multithreaded performance:
There is a modest overhead introduced by multithreading. You need to make sure that there is enough work on each thread to offset this overhead. As the old Concurrency Programming Guide says
You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25.
And goes on to say:
Although dispatch queues have very low overhead, there are still costs to scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs. Exactly how much work you need to do is something you have to measure using the performance tools.
A simple way to increase the amount of work in each loop iteration is to use striding. With striding, you rewrite your block code to perform more than one iteration of the original loop.
You should be wary of using either operations or GCD dispatches to achieve multithreaded algorithms. This can lead to “thread explosion”. You should use DispatchQueue.concurrentPerform (previously known as dispatch_apply). This is a mechanism for performing loops in parallel, while ensuring that the degree of concurrency will not exceed the capabilities of the device.

Related

Parallelism within concurrentPerform closure

I am looking to implement concurrency inside part of my app in order to speed up processing. The input array can be a large array, that I need to check multiple things related to it. This would be some sample code.
EDITED:
So this is helpful for looking at striding through the array, which was something else I was looking at doing, but I think the helpful answers are sliding away from the original question, due to the fact that I already have a DispatchQueue.concurrentPerform present in the code.
Within a for loop multiple times, I was looking to implement other for loops, due to having to relook at the same data multiple times. The inputArray is an array of structs, so in the outer loop, I am looking at one value in the struct, and then in the inner loops I am looking at a different value in the struct. In the change below I made the two inner for loops function calls to make the code a bit more clear. But in general, I would be looking to make the two funcA and funcB calls, and wait until they are both done before continuing in the main loop.
//assume the startValues and stop values will be within the bounds of the
//array and wont under/overflow
private func funcA(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 1_000_000 {
return true
}
}
return false
}
private func funcB(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 10 {
return true
}
}
return false
}
private func testFunc(inputArray: [Int]) {
let dataIterationArray = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
DispatchQueue.concurrentPerform(iterations: dataIterationArray.count) { index in
//I want to do these two function calls starting roughly one after another,
//to work them in parallel, but i want to wait until both are complete before
//moving on. funcA is going to take much longer than funcB in this case,
//just because there are more values to check.
let funcAResult = funcA(inputArray: dataIterationArray, startValue: 10, endValue: 2_000_000)
let funcBResult = funcB(inputArray: dataIterationArray, startValue: 5, endValue: 9)
//Wait for both above to finish before continuing
if funcAResult && funcBResult {
print("Yup we are good!")
} else {
print("Nope")
}
//And then wait here until all of the loops are done before processing
}
}

In your revised question, you contemplated a concurrentPerform loop where each iteration called funcA and then funcB and suggested that you wanted them “to work them in parallel”.
Unfortunately. that is not how concurrentPerform works. It runs the separate iterations in parallel, but the code within the closure should be synchronous and run sequentially. If the closure introduces additional parallelism, that will adversely affect how the concurrentPerform can reason about how many worker threads it should use.
Before we consider some alternatives, let us see what will happen if funcA and funcB remain synchronous. In short, you will still enjoy parallel execution benefits.
Below, I logged this with “Points of Interest” intervals in Instruments, and you will see that funcA (in green) never runs concurrently with funcB (in purple) for the same iteration (i.e., for the same range of start and end indices). In this example, I am processing an array with 180 items, striding 10 items at a time, ending up with 18 iterations running on an iPhone 12 Pro Max with six cores:
But, as you can see, although funcB for a given range of indices will not start until funcA finishes for the same range of indices, it does not really matter, because we are still enjoying full parallelism on the device, taking advantage of all the CPU cores.
I contend that, given that we are enjoying parallelism, that there is little benefit to contemplate making funcA and funcB run concurrently with respect to each other, too. Just let the individual iterations run parallel to each other, but let A and B run sequentially, and call it a day.
If you really want to have funcA and funcB run parallel with each other, as well, you will need to consider a different pattern. The concurrentPerform simply is not designed for launching parallel tasks that, themselves, are asynchronous. You could consider:
Have concurrentPerform launch, using my example, 36 iterations, half of which do funcA and half of which do funcB.
Or you might consider using OperationQueue with a reasonable maxConcurrentOperationCount (but you do not enjoy the dynamic limitation of the degree concurrency to the device’s CPU cores).
Or you might use an async-await task group, which will limit itself to the cooperative thread pool.
But you will not want to have concurrentPerform have a closure that launches asynchronous tasks or introduces additional parallel execution.
And, as I discuss below, the example provided in the question is not a good candidate for parallel execution. Mere tests of equality are not computationally intensive enough to enjoy parallelism benefits. It will undoubtedly just be slower than the serial pattern.
My original answer, below, outlines the basic concurrentPerform considerations.
The basic idea is to “stride” through the values. So calculate how many “iterations” are needed and calculate the “start” and “end” index for each iteration:
private func testFunc(inputArray: [Int]) {
DispatchQueue.global().async {
let array = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
// calculate how many iterations will be needed
let count = array.count
let stride = 10
let (quotient, remainder) = count.quotientAndRemainder(dividingBy: stride)
let iterations = remainder == 0 ? quotient : quotient + 1
// now iterate
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
// calculate the `start` and `end` indices
let start = stride * iteration
let end = min(start + stride, count)
// now loop through that range
for index in start ..< end {
let value = array[index]
print("iteration =", iteration, "index =", index, "value =", value)
}
}
// you won't get here until they're all done; obviously, if you
// want to now update your UI or model, you may want to dispatch
// back to the main queue, e.g.,
//
// DispatchQueue.main.async {
// ...
// }
}
}
Note, if something is so slow that it merits concurrentPerform, you probably want to dispatch the whole thing to a background queue, too. Hence the DispatchQueue.global().async {…} shown above. You would probably want to add a completion handler to this method, now that it runs asynchronously, but I will leave that to the reader.
Needless to say, there are quite a few additional considerations:
The stride should be large enough to ensure there is enough work on each iteration to offset the modest overhead introduced by multithreading. Some experimentation is often required to empirically determine the best striding value.
The work done in each thread must be significant (again, to justify the multithreading overhead). I.e., simply printing values is obviously not enough. (Worse, print statements compound the problem by introducing a hidden synchronization.) Even building a new array with some simple calculation will not be sufficient. This pattern really only works if you are doing something very computationally intensive.
You have a “sync” queue, which suggests that you understand that you need to synchronize the combination of the results of the various iterations. That is good. I will point out, though, that you will want to minimize the total number of synchronizations you do. E.g. let’s say you have 1000 values and you end up doing 10 iterations, each striding through 100 values. You generally want to have each iteration build a local result and do a single synchronization for each iteration. Using my example, you should strive to end up with only 10 total synchronizations, not 1000 of them, otherwise excessive synchronization can easily negate any performance gains.
Bottom line, making a routine execute in parallel is complicated and you can easily find that the process is actually slower than the serial rendition. Some processes simply don’t lend themselves to parallel execution. We obviously cannot comment further without understanding what your processes entail. Sometimes other technologies, such as Accelerate or Metal can achieve better results.

I will explain it here, since comment is too small, but will delete later if it doesn't answer the question.
Instead of looping over iterations: dataIterationArray.count, have number of iterations based on number of desired parallel streams of work, not based on array size. For example as you mentioned you want to have 3 streams of work, then you should have 3 iterations, each iteration processing independent part of work:
DispatchQueue.concurrentPerform(iterations: 3) { iteration in
switch iteration {
case 0:
for i in 1...10{
print ("i \(i)")
}
case 1:
for j in 11...20{
print ("j \(j)")
}
case 2:
for k in 21...30{
print ("k \(k)")
}
}
}
And the "And then wait here until all of the loops are done before processing" will happen automatically, this is what concurrentPerform guarantees.

Buffering slow subscribers in Swift combine

I'm currently struggling to get a desired behaviour when using Combine. I've previously used RX framework and believe (from what I remember) that the described scenario is possible by specifying backpressure strategies for buffering.
So the issue I have is that I have a publisher that publishes values very rapidly, I have two subscribers to it, one which can react just as fast as the values are published (cool beans), but then a second subscriber that runs some CPU expensive processing.
I know in order to support the second slower subscriber that I need to afford buffering of values, but don't seem to be be able to make this happen, here is what I have so far:
let subject = PassthroughSubject<Int, Never>()
// publish some values
Task {
for i in 0... {
subject.send(i)
}
}
subject
.print("fast")
.sink { _ in }
subject
.map { n -> Int in
sleep(1) // CPU intensive work here
return n
}
.print("slow")
.sink { _ in }
Originally I thought I could use .buffer(..) on the slow subscriber but this doesn't appear to be the use case, what seems to happen is that the subject dispatches to each subscriber and only after the subscriber finishes, does it then demand more from the publisher, and in this case that seems to block the .send(..) call of the publishing loop.
Any advice would be greatly appreciated 👍

How to achieve single threaded concurrency with blocking functions in Rust? [duplicate]

I read the tokio documentation and I wonder what is the best approach for encapsulating costly synchronous I/O in a future.
With the reactor framework, we get the advantage of a green threading model: a few OS threads handle a lot of concurrent tasks through an executor.
The future model of tokio is demand driven, which means the future itself will poll its internal state to provide informations about its completion; allowing backpressure and cancellation capabilities. As far as I understand, the polling phase of the future must be non-blocking to work well.
The I/O I want to encapsulate can be seen as a long atomic and costly operation. Ideally, an independent task would perform the I/O and the associated future would poll the I/O thread for the completion status.
The two only options I see are:
Include the blocking I/O in the poll function of the future.
spawn an OS thread to perform the I/O and use the future mechanism to poll its state, as shown in the documentation
As I understand it, neither solution is optimal and don't get the full advantage of the green-threading model (first is not advised in documentation and second don't pass through the executor provided by reactor framework). Is there another solution?

Ideally, an independent task would perform the I/O and the associated future would poll the I/O thread for the completion status.
Yes, this is the recommended approach for asynchronous execution. Note that this is not restricted to I/O, but is valid for any long-running synchronous task!
Futures crate
The ThreadPool type was created for this1.
In this case, you spawn work to run in the pool. The pool itself performs the work to check to see if the work is completed yet and returns a type that fulfills the Future trait.
use futures::{
executor::{self, ThreadPool},
future,
task::{SpawnError, SpawnExt},
}; // 0.3.1, features = ["thread-pool"]
use std::{thread, time::Duration};
async fn delay_for(pool: &ThreadPool, seconds: u64) -> Result<u64, SpawnError> {
pool.spawn_with_handle(async {
thread::sleep(Duration::from_secs(3));
3
})?
.await;
Ok(seconds)
}
fn main() -> Result<(), SpawnError> {
let pool = ThreadPool::new().expect("Unable to create threadpool");
let a = delay_for(&pool, 3);
let b = delay_for(&pool, 1);
let c = executor::block_on(async {
let (a, b) = future::join(a, b).await;
Ok(a? + b?)
});
println!("{}", c?);
Ok(())
}
You can see that the total time is only 3 seconds:
% time ./target/debug/example
4
real 3.010
user 0.002
sys 0.003
1 — There's some discussion that the current implementation may not be the best for blocking operations, but it suffices for now.
Tokio
Here, we use task::spawn_blocking
use futures::future; // 0.3.15
use std::{thread, time::Duration};
use tokio::task; // 1.7.1, features = ["full"]
async fn delay_for(seconds: u64) -> Result<u64, task::JoinError> {
task::spawn_blocking(move || {
thread::sleep(Duration::from_secs(seconds));
seconds
})
.await?;
Ok(seconds)
}
#[tokio::main]
async fn main() -> Result<(), task::JoinError> {
let a = delay_for(3);
let b = delay_for(1);
let (a, b) = future::join(a, b).await;
let c = a? + b?;
println!("{}", c);
Ok(())
}
See also CPU-bound tasks and blocking code in the Tokio documentation.
Additional points
Note that this is not an efficient way of sleeping, it's just a placeholder for some blocking operation. If you actually need to sleep, use something like futures-timer or tokio::time::sleep. See Why does Future::select choose the future with a longer sleep period first? for more details
neither solution is optimal and don't get the full advantage of the green-threading model
That's correct - because you don't have something that is asynchronous! You are trying to combine two different methodologies and there has to be an ugly bit somewhere to translate between them.
second don't pass through the executor provided by reactor framework
I'm not sure what you mean here. There's an executor implicitly created by block_on or tokio::main. The thread pool has some internal logic that checks to see if a thread is done, but that should only be triggered when the user's executor polls it.

How to use collect(.byTime) or collect(.byTimeOrCount) in Combine

publisher.collect(<#T##strategy: Publishers.TimeGroupingStrategy<Scheduler>##Publishers.TimeGroupingStrategy<Scheduler>#>)
I couldn't find any example anywhere and the documentation is bland... free Using Combine book has nothing interesting as well.

In Xcode 11.3, the completion doesn't work extremely well, but the format isn't too complex. There are two options (as of iOS 13.3) that are published in the TimeGroupingStrategy enumeration:
byTime
byTimeOrCount
When you specify either of the strategies, you will also need to specify a scheduler that it runs upon, which is one of the parameters to those enumeration cases.
For example, to collect by time, with a 1.0 second interval of collection, using a DispatchQueue, you might use:
let q = DispatchQueue(label: self.debugDescription)
let cancellable = publisher
.collect(.byTime(q, 1.0))
The byTime version will use an unbounded amount of memory during the interval specified in buffering values provided by upstream publishers.
The byTimeOrCount takes an additional count parameter which puts an upper bound on the number of items collected before the buffered collection is sent to subscribers.
For example, the same code with a 10 item max buffer size:
let q = DispatchQueue(label: self.debugDescription)
let cancellable = publisher
.collect(.byTimeOrCount(q, 1.0, 10))
You can see more specific examples of the code being used, and unit tests verifying how they're operating, in the Using Combine project:
testCollectByTime
testCollectByTimeOrCount

Slow Performance When Using Scalaz Task

I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?

I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse