I have 2 concurrent queues:
let concurrentQueue = DispatchQueue(label: "test.concurrent", attributes: .concurrent)
let syncQueue = DispatchQueue(label: "test.sync", attributes: .concurrent)
And code:
for index in 1...65 {
concurrentQueue.async {
self.syncQueue.async(flags: .barrier) {
print("WRITE \(index)")
}
self.syncQueue.sync {
print("READ \(index)")
}
}
}
Outputs:
WRITE 1
READ 1
Why, where and how it gets deadlock?
With <65 iterations count everything is good.
This pattern (async writes with barrier, concurrent reads) is known as the “reader-writer” pattern. This particular multithreaded synchronization mechanism can deadlock in thread explosion scenarios.
In short, it deadlocks because:
You have “thread explosion”;
You have exhausted the worker thread pool, which only has 64 threads;
Your dispatched item has two potentially blocking calls, not only the sync, which obviously can block, but also the concurrent async (see next point); and
When you hit a dispatch, if there is not an available worker thread in the pool, it will wait until one is made available (even if dispatching asynchronously).
The key observation is that one should simply avoid unbridled thread explosion. Generally we reach for tools such as GCD's concurrentPerform (a parallel for loop which is constrained to the maximum number of CPU cores), operation queues (which can be controlled through judicious maxConcurrentOperationCount setting) or Swift concurrency (use its cooperative thread pool to control degree of concurrency, actors for synchronization, etc.).
While the reader-writer has intuitive appeal, in practice it simply introduces complexities (synchronization for multithreaded environment with yet another multithreaded mechanism, both of which are constrained by surprisingly small GCD worker thread pools), without many practical benefits. Benchmark it and you will see that it is negligibly faster than a simple serial GCD queue, and relatively much slower than lock-based approaches.
The < 65 observation immediately makes me think you're hitting the queue width limit, which is undocumented (for good reason), but widely understood to be 64. A while back, I wrote a very detailed answer about this queue width limit over here. You should check it out.
I do have some relevant ideas I can share:
The first thing I would recommend would be replacing the print() calls with something that does not trigger I/O. You could create a numReads variable and a numWrites variable, and then use something like an atomic compare and swap operation to increment them, then read them after the loop completes and make sure they're what you expected. See Swift Atomics over here. You could also just dispatch the print operations (async) to the main queue. That would also eliminate any I/O problems.
I'll also note here that by introducing a barrier block on every single iteration of the outer for loop, you're effectively making that queue serial, but with a non-deterministic order. At the very least, you're creating a ton of contention for it, which is suboptimal. Reader/Writer locks tend to make the most sense when there are many more reads than writes, but your code here has a 1:1 ratio of reads to writes. If that is what your real use case looks like, you should just use a serial queue (or a mutex) since it would achieve the same effect, but without the contention.
At the risk of being "that guy", can you maybe elaborate on what you're trying to accomplish here? If you're just trying to prove to yourself that reader/writer locks emulated using barrier blocks work, I can assure you that they do.
Related
There is a list of parameters each of which is an input for MongoDB query.
Queries might be different, but for simplicity let's keep only one encapsulated in callMongoReactive.
fun callMongoReactive(p: Param): Mono<Result>{
// ...
}
fun queryInParallel(parameters: List<Param>): List<Result> =
parameters
.map { async { mongo.findSomething(it).awaitSingle() } }
.awaitAll()
Suppose parameters list size is not greater than 20.
What is the optimal strategy of coroutine dispatchers usage that makes these async requests run in parallel simultaneously?
Should a custom dispatcher (with a dedicated thread pool) be created or is there a standard approach for such situations?
If you want a very general answer, I'd say you may or may not want to care at this level. You don't have to.
In this specific case, the original API is reactive (immediately returns Mono), which means the actual work is not going to be performed on the thread calling callMongoReactive/findSomething, but on whatever thread pool mongoDB decided to use to back this Mono. This means the choice of dispatcher in this case really doesn't matter.
So especially in this case, I'd go for the simplest option: don't choose. Use coroutineScope and expose a suspend function so the caller decides on the coroutine context (including the dispatcher):
suspend fun queryInParallel(parameters: List<Param>): List<Result> = coroutineScope {
parameters
.map { async { mongo.findSomething(it).awaitSingle() } }
.awaitAll()
}
This is the usual idiom for "parallel decomposition of work". I'd say it's the most common.
What is the optimal strategy of coroutine dispatchers usage that makes these async requests run in parallel simultaneously?
It's worth noting that the idiom above expresses concurrency, but whether the bodies of the asyncs will be run in parallel or not depends on the dispatcher chosen by the caller.
(To reiterate, the dispatcher only affects the body of those asyncs, and in this specific case they don't use the thread that much because they call a non-blocking method anyway, so it really doesn't matter.)
Now in cases where it does matter, any dispatcher backed by more than 1 thread would allow parallelism here. There are several existing dispatchers that may be useful, without needing to create your own thread pool. Dispatchers.Default has a number of threads adapted to the number of cores of the machine it's running on, so it's a good fit for CPU-bound work. Dispatchers.IO scales the number of threads as needed, which is useful if you have a lot of blocking IO and want to avoid starvation.
You can also use limitedParallelism on any dispatcher to get a view of it with only a limited number of threads, which may be useful in some cases where you don't want to create an extra thread pool, but you do want to limit the number of available threads more than what the built-in dispatchers offer.
Creating a custom thread pool can be interesting if you want to isolate parts of your system in case one subsystem misbehaves and starves threads. It does have a memory overhead, though, since you're creating more threads.
After reading about Concurrent and Serial queues, sync and async, I think I have an idea about how to create queues and the order they are executed in. My problem is that in any of the tutorials I have seen, none of them actually tell you many use cases. For example:
I have a network manager that uses URLSessions and serializes json to send a request to my api. Does it make sense to wrap it in a .utility Queue or in a .userInitiated or do I just don't wrap it in a queue.
let task = LoginTask(username: username, password: password)
let networkQueue = DispatchQueue(label: "com.messenger.network",
qos: DispatchQoS.userInitiated)
networkQueue.async {
task.dataTask(in: dispatcher) { (user, httpCode, error) in
self.presenter?.loginUserResponse(user: user, httpCode: httpCode, error: error)
}
}
My question is: Is there any guidlines I can follow to know when there is a need to use queues or not because I cant find this information anywhere. I realise apple provides example usage howver it is very vague
Dispatch queues are used in a multitude of use cases, so it's hard to enumerate them, but two very common use cases are as follows:
You have some expensive and/or time-consuming process that you want to run on some thread other than the current thread. Often this is used when you're on the main thread and you want to run something on a background thread.
A good example of this would be image manipulation, which is a notoriously computationally (and memory) intensive process. So, you'd create a queue for image manipulation and then you'd dispatch each image manipulation task to that queue. You might also dispatch the UI update when it's done back to the main queue (because all UI updates must happen on the main thread). A common pattern would be:
imageQueue.async {
// manipulate the image here
// when done, update the UI:
DispatchQueue.main.async {
// update the UI and/or model objects on the main thread
}
}
You have some shared resource (it could be a simple variable, it could be some interaction with some other shared resource like a file or database) that you want to synchronize regardless of from which thread to invoke it. This is often part of a broader strategy of making something that is not inherently thread-safe behave in a thread safe manner.
The virtue of dispatch queues is that it greatly simplifies writing multi-threaded code, an otherwise very complicated technology.
The thing is that your example, initiating a network request, already runs the request on a background thread and URLSession manages all of this for you, so there's little value in using queues for that.
In the interest of full disclosure, there is a surprising of variety of different tools using GCD directly (e.g. dispatch groups or dispatch sources) or indirectly (e.g. operation queues) above and beyond the basic dispatch queues discussed above:
Dispatch groups: Sometimes you will initiate a series of asynchronous tasks and you want to be notified when they're all done. You can use a dispatch group (see https://stackoverflow.com/a/28101212/1271826 for a random example). This eliminates you from needing to keep track of when all of these tasks are done yourself.
Dispatch "apply" (now called concurrentPerform): Sometimes when you're running some massively parallel task, you want to use as many threads as you reasonably can. So concurrentPerform lets you effectively perform a for loop in parallel, and Apple has optimized it for the number of cores and CPUs your particular device, while not flooding it with too many concurrent tasks at any one time, exhausting the limited number of worker threads. See the https://stackoverflow.com/a/39949292/1271826 for an example of running a for loop in parallel.
Dispatch sources:
For example, if you have some background task that is doing a lot of work and you want to update the UI with the progress, sometimes those UI updates can come more quickly than the UI can handle them. So you can use a dispatch source (a DispatchSourceUserDataAdd) to decouple the background process from the UI updates. See aforementioned https://stackoverflow.com/a/39949292/1271826 for an example.
Traditionally, a Timer runs on the main run loop. But sometimes you want to run it on a background thread, but doing that with a Timer is complicated. But you can use a DispatchSourceTimer (a GCD timer) to run a timer on a queue other than the main queue. See https://stackoverflow.com/a/38164203/1271826 for example of how to create and use a dispatch timer. Dispatch timers also can be used to avoid some of the strong reference cycles that are easily introduced with target-based Timer objects.
Barriers: Sometimes when using a concurrent queue, you want most things to run concurrently, but for other things to run serially with respect to everything else on the queue. A barrier is a way to say "add this task to the queue, but make sure it doesn't run concurrently with respect to anything else on that queue."
An example of a barrier is the reader-writer pattern, where reading from some memory resource can happen concurrently with respect to all other reads, but any writes must not happen concurrently with respect to anything else on the queue. See https://stackoverflow.com/a/28784770/1271826 or https://stackoverflow.com/a/45628393/1271826.
Dispatch semaphores: Sometimes you need to let two tasks running on separate threads communicate to each other. You can use semaphores for one thread to "wait" for the "signal" from another.
One common application of semaphores is to make an inherently asynchronous task behave in a more synchronous manner.
networkQueue.async {
let semaphore = DispatchSemaphore(0)
let task = session.dataTask(with: url) { data, _, error in
// process the response
// when done, signal that we're done
semaphore.signal()
}
task.resume()
semaphore.wait(timeout: .distantFuture)
}
The virtue of this approach is that the dispatched task won't finish until the asynchronous network request is done. So if you needed to issue a series of network requests, but not have them run concurrently, semaphores can accomplish that.
Semaphores should be used sparingly, though, because they're inherently inefficient (generally blocking one thread waiting for another). Also, make sure you never wait for a semaphore from the main thread (because you're defeating the purpose of having the asynchronous task). That's why in the above example, I'm waiting on the networkQueue, not the main queue. All of this having been said, there's often better techniques than semaphores, but it is sometimes useful.
Operation queues: Operation queues are built on top of GCD dispatch queues, but offer some interesting advantages including:
The ability to wrap an inherently asynchronous task in a custom Operation subclass. (This avoids the disadvantages of the semaphore technique I discussed earlier.) Dispatch queues are generally used when running inherently synchronous tasks on a background thread, but sometimes you want to manage a bunch of tasks that are, themselves, asynchronous. A common example is the wrapping of asynchronous network requests in Operation subclass.
The ability to easily control the degree of concurrency. Dispatch queues can be either serial or concurrent, but it's cumbersome to design the control mechanism to, for example, to say "run the queued tasks concurrent with respect to each other, but no more than four at any given time." Operation queues make this much easier with the use of maxConcurrentOperationCount. (See https://stackoverflow.com/a/27022598/1271826 for an example.)
The ability to establish dependencies between various tasks (e.g. you might have a queue for downloading images and another queue for manipulating the images). With operation queues you can have one operation for the downloading of an image and another for the processing of the image, and you can make the latter dependent upon the completion of the former.
There are lots of other GCD related applications and technologies, but these are a few that I use with some frequency.
I usually use serial queues as a mechanism of locking to make sure that one resource can be accessed by many different threads without having problems. But, I have seen cases where other devs use concurrent queues with or even without semaphores (saw IBM/Swift on Linux using concurrent queue with semaphore).
Are there any advantages/disadvantages? I would believe that just using serial queues would correctly block the resource without wasting time for semaphores.
On the other hand, what happens when the cpu is busy? If I remember correctly, a serial queue is not necessarily executed on the same thread/same cpu, right?
That would be the only explanation I can think of; a concurrent queue would be able to share the workload on all available threads/cpus, assuring thread-safe access through the semaphore.
Using a concurrent queue without a semaphore would not be safe, right?
Concurrent queues with semaphores give you more granularity as to what conditions require locking. You can have most of the functions be executed in parallel, with only the mutually exclusive regions (the critical regions) requiring locking.
However, this can be equally simulated with a concurrent queue whose critical regions are dispatched to a serial queue, to ensure mutual exclusion.
I would believe that just using serial queues would correctly block the resource without wasting time for semaphores.
Serial queues also need semaphores as mutation to the queue must be synchronized. However, it tucks it under the rug, and protects you from the many easy-to-make mistakes associated with manual semaphore use.
Using a concurrent queue without a semaphore would not be safe, right?
Nope
With reference to the third point in this accepted answer, are there any cases for which it would be pointless or bad to use blocking for a long-running computation, whether CPU- or IO-bound, that is being executed 'within' a Future?
It depends on the ExecutionContext your Future is being executed in.
Pointless:
If the ExecutionContext is not a BlockContext, then using blocking will be pointless. That is, it would use the DefaultBlockContext, which simply executes the code without any special handling. It probably wouldn't add that much overhead, but pointless nonetheless.
Bad:
Scala's ExecutionContext.Implicits.global is made to spawn new threads in a ForkJoinPool when the thread pool is about to be exhausted. That is, if it knows that is going to happen via blocking. This can be bad if you're spawning lots of threads. If you're queuing up a lot of work in a short span of time, the global context will happily expand until gridlock. #dk14's answer explains this in more depth, but the gist is that it can be a performance killer as managed blocking can actually become quickly unmanageable.
The main purpose of blocking is to avoid deadlocks within thread pools, so it is tangentially related to performance in the sense that reaching a deadlock would be worse than spawning a few more threads. However, it is definitely not a magical performance enhancer.
I've written more about blocking in particular in this answer.
From my practice, blocking + ForkJoinPool may lead to contionuous and uncontrollable creation of threads if you have a lot of messages to process and each one requires long blocking (which also means that it holds some memory during such). ForkJoinPool creates new thread to compensate the "managable blocked" one, regardless of MaxThreadCount; say hello to hundreds of threads in VisualVm. And it almost kills backpressure, as there is always a place for task in the pool's queue (if your backpressure is based on ThreadPoolExecutor's policies). Performance becomes killed by both new-thread-allocation and garbage collection.
So:
it's good when message rate is not much higher than 1/blocking_time as it allows you to use full power of threads. Some smart backpressure might help to slow down incoming messages.
It's pointless if a task actually uses your CPU during blocking{} (no locks), as it will just increase counts of threads more than count of real cores in system.
And bad for any other cases - you should use separate fixed thread-pool (and maybe polling) then.
P.S. blocking is hidden inside Await.result, so it's not always obvious. In our project someone just did such Await inside some underlying worker actor.
I have been reading about MCS locks which I feel is pretty cool. Now that I know how it's implemented the next question is when to use it. Below are my thoughts. Please feel free to add items to the list
1) Ideally should be used when there more than 2 threads we want to synchronise
2) MCS locks reduces the number of cache lines that has to be invalidated. In the worst case, cache lines of 2 CPUs is invalidated.
Is there anything else I'm missing ?
Also can MCS used to implement a mutex instead of a spinlock ?
A code will benefit from using MCS lock when there's a high locks contention, i.e., many threads attempting to acquire a lock simultaneously. When using a simple spin lock, all threads are polling a single shared variable, whereas MCS forms a queue of waiting threads, such that each thread is polling on its predecessor in the queue. Hence cache coherency is much lighter since waiting is performed "locally".
Using MCS to implement a mutex doesn't really makes sense.
In mutex, waiting threads are usually queued by the OS and de-scheduled, so there's no polling whatsoever. For example check out pthread's mutex implementation.
I think the other answer by #CodeMoneky1 doesn't really explain "Also can MCS used to implement a mutex instead of a spinlock ?"
The mutex was implemented using spinlock + counter + wait queue. Here the spinlock is usually Test&Set primitive, or using Peterson's solution. I would actually agree that MCS could be an alternative. The reason it is not used is probably the gain is limited. After all the scope of spinlock used in mutex is much smaller.