Doing thousands of operations in parallel then waiting at the end - swift

I want to speed up some process so I wrote a swift CLI script that process thousands of files in parallel and write the process result of each file into a single file. (The order of the files does not really matter)
So I wrote below code and it works in the Xcode unit tests (Even with a list of approx 1200 files!) However when I execute the program from the command line without Xcode and with the same list of files it never ends. It looks like it is stuck near the end.
I read that sometimes spanning too many threads will cause the program to stop because it runs out of resources but I thought DispatchQueue.concurrentPerform will take care of that... I have no clue why this works in XCTests and does not work in the terminal.
I have tried DispatchGroup and Semaphore approach and both have the same problem...
Any help is highly appreciated.
let filePaths: [String] = Array with thousands of file paths to process
let group = DispatchGroup()
let concurrentQueue = DispatchQueue(label: "my.concurrent.queue", qos: .userInitiated, attributes: .concurrent)
let serialQueue = DispatchQueue(label: "my.serial.queue", qos: .userInitiated)
group.enter()
concurrentQueue.async {
DispatchQueue.concurrentPerform(iterations: filePaths.count) { (fileIndex) in
let filePath = filePaths[fileIndex]
let result = self.processFile(path: filePath)
group.enter()
serialQueue.async {
self.writeResult(result)
group.leave()
}
}
group.leave()
}
group.wait()

First, a few simplifications:
You have code that is
group.enter()
serialQueue.async {
self.writeResult(result)
group.leave()
}
That can be simplified to:
serialQueue.async(group: group) {
self.writeResult(result)
}
Consider:
group.enter()
concurrentQueue.async {
DispatchQueue.concurrentPerform(iterations: filePaths.count) { (fileIndex) in
...
}
group.leave()
}
That concurrentQueue is redundant. This can be simplified to:
DispatchQueue.concurrentPerform(iterations: filePaths.count) { (fileIndex) in
...
}
That reduces your code to:
let group = DispatchGroup()
let writeQueue = DispatchQueue(label: "serial.write.queue", qos: .userInitiated)
DispatchQueue.concurrentPerform(iterations: filePaths.count) { [self] index in
let result = processFile(path: filePaths[index])
writeQueue.async(group: group) {
writeResult(result)
}
}
group.wait()
That begs the question as to why you are dispatching asynchronously to a serial queue for the write operations. That can introduce problems (e.g. if it gets backlogged, you will holding all unwritten result values in memory at the same time).
One option is to write synchronously (you have to wait for the write operations in the end, anyway):
let writeQueue = DispatchQueue(label: "serial.write.queue", qos: .userInitiated)
DispatchQueue.concurrentPerform(iterations: filePaths.count) { [self] index in
let result = processFile(path: filePaths[index])
writeQueue.sync {
writeResult(result)
}
}
Or you can probably just write from the various concurrent threads, themselves:
let writeQueue = DispatchQueue(label: "serial.write.queue", qos: .userInitiated)
DispatchQueue.concurrentPerform(iterations: filePaths.count) { [self] index in
let result = processFile(path: filePaths[index])
writeResult(result)
}

Related

iOS concurrent: how to use OperationQueue instead of barrier?

When learning OperationQueue, is is possible to use OperationQueue instead of gcd barrier?
Here is the case:
upload 3 images , then upload other 3 images
With barrier, gcd works perfect
func workFlow(){
let queue = DispatchQueue(label: "test.concurrent.queue", qos: .background, attributes: .concurrent, autoreleaseFrequency: .workItem)
queue.async {
self.uploadImg(idx: "A_0")
}
queue.async {
Thread.sleep(forTimeInterval: 2)
self.uploadImg(idx: "A_1")
}
queue.async {
self.uploadImg(idx: "A_2")
}
queue.async(qos: queue.qos, flags: .barrier) {
print("group A done")
}
print("A: should not be hanged")
queue.async {
self.uploadImg(idx: "B_0")
}
queue.async {
self.uploadImg(idx: "B_1")
}
queue.async {
self.uploadImg(idx: "B_2")
}
queue.async(qos: queue.qos, flags: .barrier) {
print("group B done")
}
print("B: should not be hanged")
}
func uploadImg(idx info: String){
Thread.sleep(forTimeInterval: 1)
print("img \(info) uploaded")
}
While with OperationQueue, there is a little flaw here
The main queue gets hanged, just check the print
"A/B: should not be hanged"
lazy var uploadQueue: OperationQueue = {
var queue = OperationQueue()
queue.name = "upload queue"
queue.maxConcurrentOperationCount = 5
return queue
}()
func workFlow(){
let one = BlockOperation {
self.uploadImg(idx: "A_0")
}
let two = BlockOperation {
Thread.sleep(forTimeInterval: 3)
self.uploadImg(idx: "A_1")
}
let three = BlockOperation {
self.uploadImg(idx: "A_2")
}
uploadQueue.addOperations([one, two, three], waitUntilFinished: true)
print("A: should not be hanged")
uploadQueue.addOperation {
print("group A done")
}
let four = BlockOperation {
self.uploadImg(idx: "B_0")
}
let five = BlockOperation {
self.uploadImg(idx: "B_1")
}
let six = BlockOperation {
self.uploadImg(idx: "B_2")
}
uploadQueue.addOperations([four, five, six], waitUntilFinished: true)
print("B: should not be hanged")
uploadQueue.addOperation {
print("group B done")
}
}
How to do it better with OperationQueue?
If you do not want the operations added to the queue to block the current thread, waitUntilFinished must be false. But if you set it to true, it will block the current thread until the added operations finish.
Obviously, if you do not wait, it will not block the the main thread, but you will also lose the barrier behavior. But iOS 13 and macOS 10.15 introduced addBarrierBlock. If you really need barriers and must support earlier OS versions, then you will have to use dependencies. But if you were previously using GCD barriers simply to constrain the degree of concurrency, then maxConcurrentOperationCount might render the barrier moot. It all depends upon why you were using barriers with these uploads/downloads. (It is a little unusual to see barriers with upload/download queues as it reduces efficiency.)
How to do it better with OperationQueue?
I assume that uploadImg downloads the image synchronously. I would refactor it to be its own Operation subclass, that does the necessary Operation KVO such as shown here. That wraps download task in operation, but you can do the same with upload or data tasks, too (though the memory impact with data tasks is much greater).
But it is always advisable to avoid having synchronous network requests to (a) make sure you do not tie up worker threads; and (b) to make the requests cancelable.

How to make for-in loop wait for data fetch function to complete

I am trying to fetch bunch of data with for in loop function, but it doesn't return data in correct orders. It looks like some data take longer to fetch and so they are mixed up in an array where I need to have all the data in correct order. So, I used DispatchGroup. However, it's not working. Can you please let me know what I am doing wrong here? Spent past 10 hours searching for a solution... below is my code.
#IBAction func parseXMLTapped(_ sender: Any) {
let codeArray = codes[0]
for code in codeArray {
self.fetchData(code)
}
dispatchGroup.notify(queue: .main) {
print(self.dataToAddArray)
print("Complete.")
}
}
private func fetchData(_ code: String) {
dispatchGroup.enter()
print("count: \(count)")
let dataParser = DataParser()
dataParser.parseData(url: url) { (dataItems) in
self.dataItems = dataItems
print("Index #\(self.count): \(self.dataItems)")
self.dataToAddArray.append(self.dataItems)
}
self.dispatchGroup.leave()
dispatchGroup.enter()
self.count += 1
dispatchGroup.leave()
}
The problem with asynchronous functions is that you can never know in which order the blocks return.
If you need to preserve the order, use indices like so:
let dispatchGroup = DispatchGroup()
var dataToAddArray = [String](repeating: "", count: codeArray.count)
for (index, code) in codeArray.enumerated() {
dispatchGroup.enter()
DataParser().parseData(url: url) { dataItems in
dataToAddArray[index] = dataItems
dispatchGroup.leave()
}
}
dispatchGroup.notify(queue: .main) {
print("Complete"
}
Also in your example you are calling dispatchGroup.leave() before the asynchronous block has even finished. That would also yield wrong results.
Using semaphores to eliminate all concurrency solves the order issue, but with a large performance penalty. Dennis has the right idea, namely, rather than sacrificing concurrency, instead, just sort the results.
That having been said, I would probably use a dictionary:
let group = DispatchGroup()
var results: [String: [DataItem]] // you didn't say what `dataItems` was, so I'll assume it's an array of `DataItem` objects; but this detail isn't material to the broader question
for code in codes {
group.enter()
DataParser().parseData(url: url) { dataItems in
results[code] = dataItems // if parseData doesn't already uses the main queue for its completion handler, then dispatch these two lines to the main queue
group.leave()
}
}
group.notify(queue: .main) {
let sortedResults = codes.compactMap { results[$0] } // this very efficiently gets the results in the right order
// do something with sortedResults
}
Now, I might advise constraining the degree of concurrency (e.g. maybe you want to constrain this to the number of CPUs or some reasonable fixed number (e.g. 4 or 6). That is a separate question. But I would advise against sacrificing concurrency just to get the results in the right order.
In this case, using DispatchSemaphore:
let semaphore = DispatchSemaphore(value: 0)
DispatchQueue.global().async {
for code in codeArray {
self.fetchData(code)
semaphore.wait()
}
}
private func fetchData(_ code: String) {
print("count: \(count)")
let dataParser = DataParser()
dataParser.parseData(url: url) { (dataItems) in
self.dataItems = dataItems
print("Index #\(self.count): \(self.dataItems)")
self.dataToAddArray.append(self.dataItems)
semaphore.signal()
}
}

Swift: Simple DispatchQueue does not run & notify correctly

What i am doing wrong? At playground it runs as it should. But as soon as i deploy it on iOS simulator it returns the wrong sequence.
#objc func buttonTapped(){
let group = DispatchGroup()
let dispatchQueue = DispatchQueue.global(qos: .default)
for i in 1...4 {
group.enter()
dispatchQueue.async {
print("๐Ÿ”น \(i)")
}
group.leave()
}
for i in 1...4 {
group.enter()
dispatchQueue.async {
print("โŒ \(i)")
}
group.leave()
}
group.notify(queue: DispatchQueue.main) {
print("jobs done by group")
}
}
Console Output:
I don't get it. ๐Ÿ˜…
You should put the group.leave() statement in the dispatchQueue.async block as well, otherwise it will be executed synchronously before the async block would finish execution.
#objc func buttonTapped(){
let group = DispatchGroup()
let dispatchQueue = DispatchQueue.global(qos: .default)
for i in 1...4 {
group.enter()
dispatchQueue.async {
print("๐Ÿ”น \(i)")
group.leave()
}
}
for i in 1...4 {
group.enter()
dispatchQueue.async {
print("โŒ \(i)")
group.leave()
}
}
group.notify(queue: DispatchQueue.main) {
print("jobs done by group")
}
}
As Dรกvid said, properly employed dispatch groups only ensure that the notification takes place after all of the tasks finish, which you can achieve by calling leave from within the dispatched blocks, as he showed you. Or alternatively, since your dispatched tasks are, themselves, synchronous, you don't have to manually enter and leave the group, but can use group parameter of async method:
let group = DispatchGroup()
let queue = DispatchQueue.global(qos: .default)
for i in 1...4 {
queue.async(group: group) {
print("๐Ÿ”น \(i)")
}
}
for i in 1...4 {
queue.async(group: group) {
print("โŒ \(i)")
}
}
group.notify(queue: .main) {
print("jobs done by group")
}
Use group.enter() and group.leave() when calling some asynchronous method, but in the case of these print statements, you can just use async(group:execute:) as shown above.
Now, we've solved the problem where the "jobs done by group" block didn't wait for all of the dispatched tasks. But, because you're doing all of this dispatching to a concurrent queue (all the global queues are concurrent queues), you have no assurances that your tasks will be performed in the order that you requested. They're queued up in a strict FIFO manner, but because they're concurrent, you have no assurances when you'll hit the respective print statements.
If you need it to print the messages in order, you will have to use a serial queue. For example, if you create your own queue, in the absence of the .concurrent attribute, the following will create a serial queue:
// create serial queue
let queue = DispatchQueue(label: "...")
// but not your own concurrent queue:
//
// let queue = DispatchQueue(label: "...", attributes: .concurrent)
//
// nor one of the global concurrent queues:
//
// let queue = DispatchQueue.global(qos: .default)
//
And if you run the above code with this serial queue, you'll see what you were looking for:
๐Ÿ”น 1
๐Ÿ”น 2
๐Ÿ”น 3
๐Ÿ”น 4
โŒ 1
โŒ 2
โŒ 3
โŒ 4
jobs done by group
But, then, again, if you were using a serial queue, the group would be completely unnecessary (you could just add the "completion" task as yet another dispatched task at the end of the serial queue). I only show the use of serial queues as a way to avoid the race condition of dispatching eight tasks to a concurrent queue.

How to stop DispatchGroup or OperationQueue waiting?

DispatchGroup and OperationQueue have methods wait() and waitUntilAllOperationsAreFinished() which wait for all operations in respective queues to complete.
But even when I call cancelAllOperations it just changes the flag isCancelled in every running operation and stop the queue from executing new operations. But it still waits for the operations to complete. Therefore running the operations must be stopped from the inside. But it is possible only if operation is incremental or has an inner cycle of any kind. When it's just long external request (web request for example), there is no use of isCancelled variable.
Is there any way of stopping the OperationQueue or DispatchGroup waiting for the operations to complete if one of the operations decides that all queue is now outdated?
The practical case is: mapping a request to a list of responders, and it is known that only one may answer. If it happens, queue should stop waiting for other operations to finish and unlock the thread.
Edit: DispatchGroup and OperationQueue usage is not obligatory, these are just tools I thought would fit.
OK, so I think I came up with something. Results are stable, I've just tested. The answer is just one semaphore :)
let semaphore = DispatchSemaphore(value: 0)
let group = DispatchGroup()
let queue = DispatchQueue(label: "map-reduce", qos: .userInitiated, attributes: .concurrent)
let stopAtFirst = true // false for all results to be appended into one array
let values: [U] = <some input values>
let mapper: (U) throws -> T? = <closure>
var result: [T?] = []
for value in values {
queue.async(group: group) {
do {
let res = try mapper(value)
// appending must always be thread-safe, otherwise you end up with race condition and unstable results
DispatchQueue.global().sync {
result.append(res)
}
if stopAtFirst && res != nil {
semaphore.signal()
}
} catch let error {
print("Could not map value \"\(value)\" to mapper \(mapper): \(error)")
}
}
}
group.notify(queue: queue) { // this must be declared exactly after submitting all tasks, otherwise notification fires instantly
semaphore.signal()
}
if semaphore.wait(timeout: .init(secondsFromNow: 5)) == .timedOut {
print("MapReduce timed out on values \(values)")
}

Swift - Semaphore within semaphore [duplicate]

I'm entering the concurrency programming with some semaphore issues.
My function first loads data from server, analyze received info and then, if necessary, makes second request to server.
I tried different ways to make it run, none of them did it well.
My current code FOR ME seems to be correct, but on second request it just locks(maybe like a DeadLock) and the last log is "<__NSCFLocalDataTask: 0x7ff470c58c90>{ taskIdentifier: 2 } { suspended }"
Please, tell me what do I don't know. Maybe there is more elegant way to work with completions for these purposes?
Thank you in advance!
var users = [Int]()
let linkURL = URL.init(string: "https://bla bla")
let session = URLSession.shared()
let semaphore = DispatchSemaphore.init(value: 0)
let dataRequest = session.dataTask(with:linkURL!) { (data, response, error) in
let json = JSON (data: data!)
if (json["queue"]["numbers"].intValue>999) {
for i in 0...999 {
users.append(json["queue"]["values"][i].intValue)
}
for i in 1...lround(json["queue"]["numbers"].doubleValue/1000) {
let session2 = URLSession.shared()
let semaphore2 = DispatchSemaphore.init(value: 0)
let linkURL = URL.init(string: "https://bla bla")
let dataRequest2 = session2.dataTask(with:linkURL!) { (data, response, error) in
let json = JSON (data: data!)
print(i)
semaphore2.signal()
}
dataRequest2.resume()
semaphore2.wait(timeout: DispatchTime.distantFuture)
}
}
semaphore.signal()
}
dataRequest.resume()
semaphore.wait(timeout: DispatchTime.distantFuture)
P.S. Why do I do it. Server returns limited count of data. To get more, I have to use offset.
This is deadlocking because you are waiting for a semaphore on the URLSession's delegateQueue. The default delegate queue is not the main queue, but it is a serial background queue (i.e. an OperationQueue with a maxConcurrentOperationCount of 1). So your code is waiting for a semaphore on the same serial queue that is supposed to be signaling the semaphore.
The tactical fix is to make sure you're not calling wait on the same serial queue that the session's completion handlers are running on. There are two obvious fixes:
Do not use shared session (whose delegateQueue is a serial queue), but rather instantiate your own URLSession and specify its delegateQueue to be a concurrent OperationQueue that you create:
let queue = OperationQueue()
queue.name = "com.domain.app.networkqueue"
let configuration = URLSessionConfiguration.default()
let session = URLSession(configuration: configuration, delegate: nil, delegateQueue: queue)
Alternatively, you can solve this by dispatching the code with the semaphore off to some other queue, e.g.
let mainRequest = session.dataTask(with: mainUrl) { data, response, error in
// ...
DispatchQueue.global(attributes: .qosUserInitiated).async {
let semaphore = DispatchSemaphore(value: 0)
for i in 1 ... n {
let childUrl = URL(string: "https://blabla/\(i)")!
let childRequest = session.dataTask(with: childUrl) { data, response, error in
// ...
semaphore.signal()
}
childRequest.resume()
_ = semaphore.wait(timeout: .distantFuture)
}
}
}
mainRequest.resume()
For the sake of completeness, I'll note that you probably shouldn't be using semaphores to issue these requests at all, because you'll end up paying a material performance penalty for issuing a series of consecutive requests (plus you're blocking a thread, which is generally discouraged).
The refactoring of this code to do that is a little more considerable. It basically entails issuing a series of concurrent requests, perhaps use "download" tasks rather than "data" tasks to minimize memory impact, and then when all of the requests are done, piece it all together as needed at the end (triggered by either a Operation "completion" operation or dispatch group notification).