ZStream ignores parallel operation and executes it sequentially instead - scala

Following code is supposed to execute putStrLn effect in parallel because of mapMPar:
val runtime = zio.Runtime.default
val foo = ZIO.sleep(5.second) *> ZIO("foo")
val bar = ZIO("bar")
val k = ZStream.fromEffect(foo) ++ ZStream.fromEffect(bar)
val r = k.mapMPar(3)(x => console.putStrLn(s"Processing `${x}`"))
runtime.unsafeRun(r.runDrain)
But in fact it always process foo before bar no matter what. Did I miss something or it's a bug?

I think your example just doesn't do what you expect it to do. fromEffect creates a stream which basically says "I have an effect that will eventually generate a single item", the first stream then waits 5 seconds before producing that item. Due to the nature of the stream the ++ or concat operator is lazy, meaning it can't start processing until all items have been consumed from the first stream (which can't happen for 5 seconds). As a result your stream really looks like this:
--5s--(foo)(bar)|
instead of what I imagine you think it should like:
(bar)--5s--(foo)|
The best way to perhaps think about it is that for most of the stream you have a single lane highway, only one item can move at a time, and all subsequent items are blocked by items at the head of the line. Once you hit that Par block you are opening up to multiple lanes of traffic, meaning that faster moving stuff could potentially overtake.
Thus I can achieve the desired behavior by doing something like this instead:
val k = ZStream("foo", "bar")
val r = k.mapMPar(3)(x => putStrLn(s"$x:enter") *> (ZIO.sleep(5.second) *> putStrLn(s"Processing `${x}`")) <* putStrLn(s"$x:exit"))
r.runDrain
Or written slightly more compact:
ZStream("foo", "bar").mapMPar(3)(x => for {
_ <- putStrLn(s"$x:enter")
_ <- ZIO.sleep(5.seconds) *> putStrLn(s"Processing `$x`")
_ <- putStrLn(s"$x:exit")
} yield ()).runDrain

Related

Scala/functional way of doing things

I am using scala to write up a spark application that reads data from csv files using dataframes (none of these details matter really, my question can be answered by anyone who is good at functional programming)
I'm used to sequential programming and its taking a while to think of things in the functional way.
I basically want to read to columns (a,b) from a csv file and keep track of those rows where b < 0.
I implemented this but its pretty much how I would do it Java and I would like to utilize Scala's features instead:
val ValueDF = fileDataFrame.select("colA", "colB")
val ValueArr = ValueDF.collect()
for ( index <- 0 until (ValueArr.length)){
var row = ValueArr(index)
var A = row(0).toString()
var B = row(1).toString().toDouble
if (B < 0){
//write A and B somewhere
}
}
Converting the dataframe to an array defeats the purpose of distributed computation.
So how could I possibly get the same results but instead of forming an array and traversing through it, I would rather want to perform some transformations of the data frame itself (such as map/filter/flatmap etc).
I should get going soon hopefully, just need some examples to wrap my head around it.
You are doing basically a filtering operation (ignore if not (B < 0)) and mapping (from each row, get A and B / do something with A and B).
You could write it like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.filter(_(1).toString().toDouble < 0).map{row => (row(0).toString(), row(1).toString().toDouble)}
// do something with result
You also can do first the mapping and then the filtering:
val result = valueArr.map{row => (row(0).toString(), row(1).toString().toDouble)}.filter(_._2 < 0)
Scala also offers more convenient versions for this kind of operations (thanks Sascha Kolberg), called withFilter and collect. withFilter has the advantage over filter that it doesn't create a new collection, saving you one pass, see this answer for more details. With collect you also map and filter in one pass, passing a partial function which allows to do pattern matching, see e.g. this answer.
In your case collect would look like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.collect{
case row if row(1).toString().toDouble < 0) => (row(0).toString(), row(1).toString().toDouble)
}
// do something with result
(I think there's a more elegant way to express this but that's left as an exercise ;))
Also, there's a lightweight notation called "sequence comprehensions". With this you could write:
val result = for (row <- valueArr if row(1).toString().toDouble < 0) yield (row(0).toString(), row(1).toString().toDouble)
Or a more flexible variant:
val result = for (row <- valueArr) yield {
val double = row(1).toString().toDouble
if (double < 0) {
(row(0).toString(), double)
}
}
Alternatively, you can use foldLeft:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.foldLeft(Seq[(String, Double)]()) {(s, row) =>
val a = row(0).toString()
val b = row(1).toString().toDouble
if (b < 0){
s :+ (a, b) // append tuple with A and B to results sequence
} else {
s // let results sequence unmodified
}
}
// do something with result
All of these are considered functional... which one you prefer is for the most part a matter of taste. The first 2 examples (filter/map, map/filter) do have a performance disadvantage compared to the rest because they iterate through the sequence twice.
Note that in FP it's very important to minimize side effects / isolate them from the main logic. I/O ("write A and B somewhere") is a side effect. So you normally will write your functions such that they don't have side effects - just input -> output logic without affecting or retrieving data from the surroundings. Once you have a final result, you can do side effects. In this concrete case, once you have result (which is a sequence of A and B tuples), you can loop through it and print it. This way you can for example change easily the way to print (you may want to print to the console, send to a remote place, etc.) without touching the main logic.
Also you should prefer immutable values (val) wherever possible, which is safer. Even in your loop, row, A and B are not modified so there's no reason to use var.
(Btw, I corrected the values names to start with lower case, see conventions).

Composing Operations on Streams in Scala

Let's say you have a program which manipulates a stream Stream[Foo] in some manner to produce a computation of interest, e.g.
myFooStream.map(toBar).groupBy(identity).mapValues(_.size)
Lovely, except now you've got to do some other kind of computation on myFooStream like
myFooStream.map(toBar).sum
And you'd like to compose these computations somehow so that you do not need to iterate twice over the stream (let's say that iterating over the stream is expensive for some reason).
Is there some Scala-ish way of dealing with this problem? My problem, put more abstractly, is that I'd like to somehow abstract computation over these streams from the iteration over these streams. That is, what be best is if I could somehow write two methods f: Stream[Foo] => Bar and g: Stream[Foo] => Baz and somehow compose f and g in a way such that they operated on a single iteration of the stream.
Is there some abstraction which allows this?
UPDATED QUESTION: I've done a little digging around. Would scalaz arrows be helpful with this problem?
Streams naturally try to avoid generating their elements multiple times if possible, by memoizing results. From the docs:
The Stream class also employs memoization such that previously computed values are converted from Stream elements to concrete values of type A.
We can see that by construction a Stream that prints every time an element is produced, and running multiple operations:
val stream = Stream.from(0).map(x => { println(x); x }).take(10) //prints 0
val double = stream.map(_ * 2).take(5).toList //prints 1 through 4
val sum = stream.sum //prints 5 through 9
val sum2 = stream.sum //doesn't print any more
This works as long as you use a val and not a def:
So long as something is holding on to the head, the head holds on to the tail, and so it continues recursively. If, on the other hand, there is nothing holding on to the head (e.g. we used def to define the Stream) then once it is no longer being used directly, it disappears.
This memoization means one must be cautious with Streams:
One must be cautious of memoization; you can very quickly eat up large amounts of memory if you're not careful. The reason for this is that the memoization of the Stream creates a structure much like scala.collection.immutable.List.
Of course, if the generating of the items isn't what is expensive, but the actual traversal of the Stream, or memoization isn't available because it would be too expensive, one can always use foldLeft with a tuple, keeping track of multiple values:
//Only prints 0-9 once, even if stream is a def
val (sum, double) = stream.foldLeft(0 -> List.empty[Int]) {
case ((sum, list), next) => (sum + next, list :+ (next * 2))
}
If this is a common enough operation, you might even enrich Stream to make some of the more common operations like foldLeft, reduceLeft, and others available in this format:
implicit class RichStream[T](val stream: Stream[T]) extends AnyVal {
def doubleFoldLeft[A, B](start1: A, start2: B)(f: (A, T) => A, g: (B, T) => B) = stream.foldLeft(start1 -> start2) {
case ((aAcc, bAcc), next) => (f(aAcc, next), g(bAcc, next))
}
}
Which would allow you to do things like:
val (sum, double) = stream.doubleFoldLeft(0, List.empty[Int])(_ + _, _ :+ _)
The stream will not iterate twice:
Stream.continually{println("bob"); 1}.take(4).map(v => v).sum
bob
bob
bob
bob
4
and
val bobs = Stream.continually{println("bob"); 1}.take(4)
val alices = Stream.continually{println("alice"); 2}.take(4)
bobs.zip(alices).map{ case (b, a) => a + b}.sum
bob
bob
bob
bob
alice
alice
alice
alice
12

Scala: what is the right way to fold a collection of futures?

I have a piece of code that sends broadcast message to a set of actors and collects their responses. Please look at the simplified code:
{
val responses: Set[Future[T] = // ask a set of actors
val zeroResult: T
val foldResults: (T, T) => T
//1. Future.fold(responses)(zeroResult)(foldResults)
//2. (future(zeroResult) /: responses) { (acc, f) => for { x <- f; xs <- acc } yield foldResults(x, xs) }
} foreach {
client ! resp(_)
}
Then I noticed that behaviors of lines of code 1 and 2 differ. E.g. there are 4 actors sending Traversable(1) as response, and
zeroResult = Traversable.empty[Int]
foldResults = { _ ++ _ }
The results of the first line are different: usually I get List(1, 1, 1, 1), but sometimes List(1, 1, 1) or even List(1, 1). It wasn't surprising for me, because Future.fold is unblocking, so it seems some responses could be missed.
But the second line always yields List of four ones.
Could anyone explain the reason why these kinds of fold differ and which of them is preferable?
The surprising thing for me in your question is the fact that your first fold (which I find preferable by reason of its conciseness) sometimes seems to operate on a list of three (or two) successfully completed futures.
In the normal scheme of things, you have 3 possible outcomes for a future:
Completed normally
Failed (completed with exception)
Uncompleted
Both of the folds that you give will yield a single future that is uncompleted while all of its component futures are completed, or failed if any future (or the fold operation) fails, or completed if all goes well. (Your sentence Future.fold is unblocking, so it seems some responses could be missed is incorrect.)
I can only think that you have some other code somewhere that is completing futures with no result if a certain timeout is breached.
Apart from that, you have swapped the order of the operands in the fold operation in line 2 (should be yield foldResults(xs, x)) so the final order is reversed with respect to line 1.

Efficiently iterate over one Set, and then another, in one for loop in Scala

I want to iterate over all the elements of one Set, and then all the elements of another Set, using a single loop. (I don't care about duplicates, because I happen to know the two Sets are disjoint.)
The reason I want to do it in a single loop is because I have some additional code to measure progress, which requires it to be in a single loop.
This doesn't work in general, because it might intermix the two Sets arbitrarily:
for(x <- firstSet ++ secondSet) {
...
}
This works, but builds 3 intermediate Seqs in memory, so it's far too inefficient in terms of both time and space usage:
for(x <- firstSet.toSeq ++ secondSet.toSeq) {
...
}
for(x <- firstSet.toIterator ++ secondSet.toIterator) {
...
}
This doesn't build any intermediate data structures, so I think it's the most efficient way.
If you just want a traversal, and you want maximum performance, this is the best way even though it is ugly:
val s1 = Set(1,2,3)
val s2 = Set(4,5,6)
val block : Int => Unit = x => { println(x) }
s1.foreach(block)
s2.foreach(block)
Since this is pretty ugly, you can just define a class for it:
def traverse[T](a:Traversable[T], b:Traversable[T]) : Traversable[T] =
new Traversable[T] {
def foreach[U](f:T=>U) { a.foreach(f); b.foreach(f) }
}
And then use it like this:
for(x<-traverse(s1, s2)) println(x)
However, unless this is extremely performance-critical, the solution posted by Robin Green is better. The overhead is creation of two iterators and concatenation of them. If you have deeper nested data structures, concatenating iterators can be quite expensive though. For example a tree iterator that is defined by concatenating the iterators of the subtrees will be painfully slow, whereas a tree traversable where you just call foreach on each subtree will be close to optimal.

foldLeft early termination in a Stream[Boolean]?

I have a:
val a : Stream[Boolean] = ...
When I foldLeft it as follows
val b = a.foldLeft(false)(_||_)
Will it terminate when it finds the first true value in the stream? If not, how do I make it to?
It would not terminate on the first true. You can use exists instead:
val b = a.exists(identity)
No it won't terminate early. This is easy to illustrate:
val a : Stream[Boolean] = Stream.continually(true)
// won't terminate because the strea
val b = a.foldLeft(false)(_||_)
stew showed that a simple solution to terminate early, in your specific case, is
val b = a.exists(identity).
Even simpler, this is equivalent to:
val b = a.contains(true)
A more general solution which unlike the above is also applicable if you actually need a fold, is to use recursion (note that here I am assuming the stream is non-empty, for simplicity):
def myReduce( s: Stream[Boolean] ): Boolean = s.head || myReduce( s.tail )
val b = myReduce(a)
Now the interesting thing of the recursive solution is how it can be used in a more general use case where you actually need to accumulate the values in some way (which is what fold is for) and still terminate early. Say that you want to add the values of a stream of ints using an add method that will "terminate" early in a way similar to || (in this case, it does not evaluate its right hand side if the left hand side is > 100):
def add(x: Int, y: => Int) = if ( x >= 100 ) x else x + y
val a : Stream[Int] = Stream.range(0, Int.MaxValue)
val b = a.foldLeft(0)(add(_, _))
The last line won't terminate, much like in your example. But you can fix it like this:
def myReduce( s: Stream[Int] ): Int = add( s.head, myReduce( s.tail ) )
val b = myReduce(a)
WARNING: there is a significant downside to this approach though: myReduce here is not tail recursive, meaning that it will blow your stack if iterating over too many elements of the stream.
Yet another solution, which does nto blow the stack, is this:
val b = a.takeWhile(_ <= 100).foldLeft(0)(_ + _)
But I fear I have gone really too far on the off topic side, so I'd better stop now.
You could use takeWhile to extract the prefix of the Stream on which you want to operate and then apply foldLeft to that.