Tail recursion when loading lot of items - swift

I need to load a lot of small files from an api that allows me to load only one file at a time. As they are very small I start several downloads at a time. Depending on the result I start the next batch load.
For each request I use a observable and then combine several with combineLatest. After combineLatest I do a flatMap and concat a new call to the same function.
As abstraction I do this - pseudo code, not compiling:
func loadRecursively(items) -> Observable<XY> {
combineLatest(requestObservables)
.flatMap {
return loadRecursively(items-loadedItems)
}
}
This works perfectly in general.
The problem: This leads to a growing recursive tail, which is not cut off by compiler optimisation as it seems. So when loading some thousand files the stack will grow and finally the app will close.
How would I avoid the growing tail? Or in general how would I approach this problem with rx?

RxSwift has concatMap operator (because people had been faced with same problem), that allows you to sequentially loop through your Observables.
Simple example:
Observable.from([1, 2, 3, 4])
.concatMap(Observable.just)
.subscribe(onNext: {
print($0)
})
.disposed(by: bag)
Prints:
1
2
3
4

Related

Parallelism within concurrentPerform closure

I am looking to implement concurrency inside part of my app in order to speed up processing. The input array can be a large array, that I need to check multiple things related to it. This would be some sample code.
EDITED:
So this is helpful for looking at striding through the array, which was something else I was looking at doing, but I think the helpful answers are sliding away from the original question, due to the fact that I already have a DispatchQueue.concurrentPerform present in the code.
Within a for loop multiple times, I was looking to implement other for loops, due to having to relook at the same data multiple times. The inputArray is an array of structs, so in the outer loop, I am looking at one value in the struct, and then in the inner loops I am looking at a different value in the struct. In the change below I made the two inner for loops function calls to make the code a bit more clear. But in general, I would be looking to make the two funcA and funcB calls, and wait until they are both done before continuing in the main loop.
//assume the startValues and stop values will be within the bounds of the
//array and wont under/overflow
private func funcA(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 1_000_000 {
return true
}
}
return false
}
private func funcB(inputArray: [Int], startValue: Int, endValue: Int) -> Bool{
for index in startValue...endValue {
let dataValue = inputArray[index]
if dataValue == 10 {
return true
}
}
return false
}
private func testFunc(inputArray: [Int]) {
let dataIterationArray = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
DispatchQueue.concurrentPerform(iterations: dataIterationArray.count) { index in
//I want to do these two function calls starting roughly one after another,
//to work them in parallel, but i want to wait until both are complete before
//moving on. funcA is going to take much longer than funcB in this case,
//just because there are more values to check.
let funcAResult = funcA(inputArray: dataIterationArray, startValue: 10, endValue: 2_000_000)
let funcBResult = funcB(inputArray: dataIterationArray, startValue: 5, endValue: 9)
//Wait for both above to finish before continuing
if funcAResult && funcBResult {
print("Yup we are good!")
} else {
print("Nope")
}
//And then wait here until all of the loops are done before processing
}
}
In your revised question, you contemplated a concurrentPerform loop where each iteration called funcA and then funcB and suggested that you wanted them “to work them in parallel”.
Unfortunately. that is not how concurrentPerform works. It runs the separate iterations in parallel, but the code within the closure should be synchronous and run sequentially. If the closure introduces additional parallelism, that will adversely affect how the concurrentPerform can reason about how many worker threads it should use.
Before we consider some alternatives, let us see what will happen if funcA and funcB remain synchronous. In short, you will still enjoy parallel execution benefits.
Below, I logged this with “Points of Interest” intervals in Instruments, and you will see that funcA (in green) never runs concurrently with funcB (in purple) for the same iteration (i.e., for the same range of start and end indices). In this example, I am processing an array with 180 items, striding 10 items at a time, ending up with 18 iterations running on an iPhone 12 Pro Max with six cores:
But, as you can see, although funcB for a given range of indices will not start until funcA finishes for the same range of indices, it does not really matter, because we are still enjoying full parallelism on the device, taking advantage of all the CPU cores.
I contend that, given that we are enjoying parallelism, that there is little benefit to contemplate making funcA and funcB run concurrently with respect to each other, too. Just let the individual iterations run parallel to each other, but let A and B run sequentially, and call it a day.
If you really want to have funcA and funcB run parallel with each other, as well, you will need to consider a different pattern. The concurrentPerform simply is not designed for launching parallel tasks that, themselves, are asynchronous. You could consider:
Have concurrentPerform launch, using my example, 36 iterations, half of which do funcA and half of which do funcB.
Or you might consider using OperationQueue with a reasonable maxConcurrentOperationCount (but you do not enjoy the dynamic limitation of the degree concurrency to the device’s CPU cores).
Or you might use an async-await task group, which will limit itself to the cooperative thread pool.
But you will not want to have concurrentPerform have a closure that launches asynchronous tasks or introduces additional parallel execution.
And, as I discuss below, the example provided in the question is not a good candidate for parallel execution. Mere tests of equality are not computationally intensive enough to enjoy parallelism benefits. It will undoubtedly just be slower than the serial pattern.
My original answer, below, outlines the basic concurrentPerform considerations.
The basic idea is to “stride” through the values. So calculate how many “iterations” are needed and calculate the “start” and “end” index for each iteration:
private func testFunc(inputArray: [Int]) {
DispatchQueue.global().async {
let array = Array(Set(inputArray))
let syncQueue = DispatchQueue(label: "syncQueue")
// calculate how many iterations will be needed
let count = array.count
let stride = 10
let (quotient, remainder) = count.quotientAndRemainder(dividingBy: stride)
let iterations = remainder == 0 ? quotient : quotient + 1
// now iterate
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
// calculate the `start` and `end` indices
let start = stride * iteration
let end = min(start + stride, count)
// now loop through that range
for index in start ..< end {
let value = array[index]
print("iteration =", iteration, "index =", index, "value =", value)
}
}
// you won't get here until they're all done; obviously, if you
// want to now update your UI or model, you may want to dispatch
// back to the main queue, e.g.,
//
// DispatchQueue.main.async {
// ...
// }
}
}
Note, if something is so slow that it merits concurrentPerform, you probably want to dispatch the whole thing to a background queue, too. Hence the DispatchQueue.global().async {…} shown above. You would probably want to add a completion handler to this method, now that it runs asynchronously, but I will leave that to the reader.
Needless to say, there are quite a few additional considerations:
The stride should be large enough to ensure there is enough work on each iteration to offset the modest overhead introduced by multithreading. Some experimentation is often required to empirically determine the best striding value.
The work done in each thread must be significant (again, to justify the multithreading overhead). I.e., simply printing values is obviously not enough. (Worse, print statements compound the problem by introducing a hidden synchronization.) Even building a new array with some simple calculation will not be sufficient. This pattern really only works if you are doing something very computationally intensive.
You have a “sync” queue, which suggests that you understand that you need to synchronize the combination of the results of the various iterations. That is good. I will point out, though, that you will want to minimize the total number of synchronizations you do. E.g. let’s say you have 1000 values and you end up doing 10 iterations, each striding through 100 values. You generally want to have each iteration build a local result and do a single synchronization for each iteration. Using my example, you should strive to end up with only 10 total synchronizations, not 1000 of them, otherwise excessive synchronization can easily negate any performance gains.
Bottom line, making a routine execute in parallel is complicated and you can easily find that the process is actually slower than the serial rendition. Some processes simply don’t lend themselves to parallel execution. We obviously cannot comment further without understanding what your processes entail. Sometimes other technologies, such as Accelerate or Metal can achieve better results.
I will explain it here, since comment is too small, but will delete later if it doesn't answer the question.
Instead of looping over iterations: dataIterationArray.count, have number of iterations based on number of desired parallel streams of work, not based on array size. For example as you mentioned you want to have 3 streams of work, then you should have 3 iterations, each iteration processing independent part of work:
DispatchQueue.concurrentPerform(iterations: 3) { iteration in
switch iteration {
case 0:
for i in 1...10{
print ("i \(i)")
}
case 1:
for j in 11...20{
print ("j \(j)")
}
case 2:
for k in 21...30{
print ("k \(k)")
}
}
}
And the "And then wait here until all of the loops are done before processing" will happen automatically, this is what concurrentPerform guarantees.

RxSwift - How does merge handle simultaneous events?

So I have a code that looks something like this:
let a = Observable.just("A").delay(.seconds(3), scheduler: MainScheduler.instance)
let b = Observable.just("B").delay(.seconds(3), scheduler: MainScheduler.instance)
Observable.merge([a, b]).subscribe(onNext: { print($0) })
I thought that printed order should be random, as the delay that finishes marginally earlier will be emitted first by merge. But the order seems to be determined solely by the order of variables in array passed to merge.
That means .merge([a, b]) prints
A
B
but .merge([b, a]) prints
B
A
Why is this the case? Does it mean that merge somehow handles the concurrent events and I can rely on the fact that the events with same delay will always be emitted in their order in array?
I know I could easily solve it by using .from(["A", "B"]), but I am now curious to know how exactly does the merge work.
You put both delays on the same thread (and I suspect the subscriptions are also happening on that thread,) so they will execute on the same thread in the order they are subscribed to. Since the current implementation of merge subscribes to the observables in the order it receives them, you get the behavior you are seeing. However, it's important to note that you cannot rely on this, there is no guarantee that it will be true for every version of the library.

Is there a simpler (to remember) structure to do breaks in scala than util.control.Breaks._?

The following is a skeleton of break in scala using util.control.Breaks._ :
import util.control.Breaks._
breakable {
for (oct <- 1 to 4) {
if (...) {
} else {
break
}
}
}
This structure requires remembering several non intuitive names. I do not use break statements every other or third day - but part of that is the difficulty in remembering how to do them. In a number of other popular languages it is dead simple: break out of a loop and you're good.
Is there any mechanism closer to that simple keyword/structure in scala?
No, there isn't. for is not a loop but a syntactic sugar to a chain of functions with closures passed into them. Since the local stack changes, it cannot be translated into some goto instruction like a "normal" break does. To break the execution of arbitrarily nested closures you have to throw an exception, which is what both break and return do to exit early.
For particular situation you can often use something like .filter, .takeWhile, .dropWhile (sometimes with .sliding to look ahead), etc. For situation where you have no better idea, #tailrecursive function is always an option.
In your current example I would probably do:
(1 to 4).takeWhile(...).foreach { oct => ... }
Basically, I would consider every single case individually to write it in a declarative way.

Swift reduce on array is orders of magnitude slower than unsafe reduce

In my program I heavily rely in the following pattern that I think is very common. There is a structure that holds data (lets say Double):
struct Container {
var data: Double
}
Next, an array of that struct is created, potentially big:
let containers = (0...<10_000_000).map { _ in
Container(data: .random(in: -10...10))
}
Then, a (map)reduce operation is applied to the array to extract a value, for instance:
let sum = containers.reduce(0.0) { $0 + $1.data }
I first thought that this particular operation would be fast because the operation is simple enough for the compiler to optimize bounds checks and other slow operations: there is no weird index computations, no capture in the closure, everything is immutable, etc...
But then, I compared performance against this unsafe implementation of reduce on arrays:
extension Array {
func unsafeReduce<Result>(
_ initialResult: Result,
_ nextPartialResult: (Result, Element) throws -> Result
) rethrows -> Result {
try self.withUnsafeBufferPointer { buffer -> Result in
try buffer.reduce(initialResult, nextPartialResult)
}
}
}
I compiled both programs with full optimizations -O -unchecked and I also used Xcode profiler to time programs, it appears that the unsafe reduce is 16x faster than its safe counterpart: 200ms vs 4s.
My questions are:
First, is my unsafe implementation right? What are the potential issued in using this version?
What is the cause of the slowdown? I tried to have a look at the assembly code with Compiler Explorer and with Xcode profiler's call tree but can't figure out what is the exact cause. Can someone explain step by step what exactly the program does?
I'm using Xcode 11.5 with Swift 5 on a MacBook Air mid-2015.
Edit
I ported the code to Python and without using Numpy for any computation it is still 2 times faster than Swift reduce operation. I did not check the memory consumption, though.

Efficiency/scalability of parallel collections in Scala (graphs)

So I've been working with parallel collections in Scala for a graph project I'm working on, I've got the basics of the graph class defined, it is currently using a scala.collection.mutable.HashMap where the key is Int and the value is ListBuffer[Int] (adjacency list). (EDIT: This has since been change to ArrayBuffer[Int]
I had done a similar thing a few months ago in C++, with a std::vector<int, std::vector<int> >.
What I'm trying to do now is run a metric between all pairs of vertices in the graph, so in C++ I did something like this:
// myVec = std::vector<int> of vertices
for (std::vector<int>::iterator iter = myVec.begin(); iter != myVec.end(); ++iter) {
for (std::vector<int>::iterator iter2 = myVec.begin();
iter2 != myVec.end(); ++iter2) {
/* Run algorithm between *iter and *iter2 */
}
}
I did the same thing in Scala, parallelized, (or tried to) by doing this:
// vertexList is a List[Int] (NOW CHANGED TO Array[Int] - see below)
vertexList.par.foreach(u =>
vertexList.foreach(v =>
/* Run algorithm between u and v */
)
)
The C++ version is clearly single-threaded, the Scala version has .par so it's using parallel collections and is multi-threaded on 8 cores (same machine). However, the C++ version processed 305,570 pairs in a span of roughly 3 days, whereas the Scala version so far has only processed 23,573 in 17 hours.
Assuming I did my math correctly, the single-threaded C++ version is roughly 3x faster than the Scala version. Is Scala really that much slower than C++, or am I completely mis-using Scala (I only recently started - I'm about 300 pages into Programming in Scala)?
Thanks!
-kstruct
EDIT To use a while loop, do I do something like..
// Where vertexList is an Array[Int]
vertexList.par.foreach(u =>
while (i <- 0 until vertexList.length) {
/* Run algorithm between u and vertexList(i) */
}
}
If you guys mean use a while loop for the entire thing, is there an equivalent of .par.foreach for whiles?
EDIT2 Wait a second, that code isn't even right - my bad. How would I parallelize this using while loops? If I have some var i that keeps track of the iteration, then wouldn't all threads be sharing that i?
From your comments, I see that your updating a shared mutable HashMap at the end of each algorithm run. And if you're randomizing your walks, a shared Random is also a contention point.
I recommend two changes:
Use .map and .flatMap to return an immutable collection instead of modifying a shared collection.
Use a ThreadLocalRandom (from either Akka or Java 7) to reduce contention on the random number generator
Check the rest of your algorithm for further possible contention points.
You may try running the inner loop in parallel, too. But without knowing your algorithm, it's hard to know if that will help or hurt. Fortunately, running all combinations of parallel and sequential collections is very simple; just switch out pVertexList and vertexList in the code below.
Something like this:
val pVertexList = vertexList.par
val allResult = for {
u <- pVertexList
v <- pVertexList
} yield {
/* Run algorithm between u and v */
((u -> v) -> result)
}
The value allResult will be a ParVector[((Int, Int), Int)]. You may call .toMap on it to convert that into a Map.
Why mutable? I don't think there's a good parallel mutable map on Scala 2.9.x -- particularly because just such a data structure was added to the upcoming Scala 2.10.
On the other hand... you have a List[Int]? Don't use that, use a Vector[Int]. Also, are you sure you aren't wasting time elsewhere, doing the conversions from your mutable maps and buffers into immutable lists? Scala data structures are different than C++'s so you might well be incurring in complexity problems elsewhere in the code.
Finally, I think dave might be onto something when he asks about contention. If you have contention, parallelism might well make things slower. How faster/slower does it run if you do not make it parallel? If making it not parallel makes it faster, then you most likely do have contention issues.
I'm not completely sure about it, but I think foreach loops in foreach loops are rather slow, because lots of objects get created. See: http://scala-programming-language.1934581.n4.nabble.com/for-loop-vs-while-loop-performance-td1935856.html
Try rewriting it using a while loop.
Also Lists are only efficient for head access, Arrays are probably faster.