scalaz stream structure for growing lists - scala

I have a hunch that I can (should?) be using scalaz-streams for solving my problem which is like this.
I have a starting item A. I have a function that takes an A and returns a list of A.
def doSomething(a : A) : List[A]
I have a work queue that starts with 1 item (the starting item). When we process (doSomething) each item it may add many items to the end of the same work queue. At some point however (after many million items) each subsequent item that we doSomething on will start adding less and less items to the work queue and eventually no new items will be added (doSomething will return Nil for these items). This is how we know the computation will eventually terminate.
Assuming scalaz-streams is appropriate for this could something please give me some tips as to which overall structure or types I should be looking at to implement this?
Once a simple implementation with a single "worker" is done, I would also like to use multiple workers to process queue items in parallel, e.g. having a pool of 5 workers (and each worker would be farming its task to an agent to calculate doSomething) so I would need to handle effects (such as worker failure) as well in this algorithm.

So the answer to the "how?" is :
import scalaz.stream._
import scalaz.stream.async._
import Process._
def doSomething(i: Int) = if (i == 0) Nil else List(i - 1)
val q = unboundedQueue[Int]
val out = unboundedQueue[Int]
q.dequeue
.flatMap(e => emitAll(doSomething(e)))
.observe(out.enqueue)
.to(q.enqueue).run.runAsync(_ => ()) //runAsync can process failures, there is `.onFailure` as well
q.enqueueAll(List(3,5,7)).run
q.size.continuous
.filter(0==)
.map(_ => -1)
.to(out.enqueue).once.run.runAsync(_ => ()) //call it only after enqueueAll
import scalaz._, Scalaz._
val result = out
.dequeue
.takeWhile(_ != -1)
.map(_.point[List])
.foldMonoid.runLast.run.get //run synchronously
Result:
result: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
However, you might notice that:
1) I had to solve termination problem . Same problem for akka-stream and much harder to resolve there as you don't have access to the Queue and no natural back-pressure to guarantee that queue won't be empty just because of fast-readers.
2) I had to introduce another queue for the output (and convert it to the List) as working one is becoming empty at the end of computation.
So, both libraries are not much adapted to such requirements (finite stream), however scalaz-stream (which is going to became "fs2" after removing scalaz dependency) is flexible enough to implement your idea. The big "but" about that is it's gonna be run sequentially by default. There is (at least) two ways to make it faster:
1) split your doSomething into stages, like .flatMap(doSomething1).flatMap(doSomething2).map(doSomething3) and then put another queues between them (about 3 times faster if stages taking equal time).
2) parallelize queue processing. Akka has mapAsync for that - it can do maps in parallel automatically. Scalaz-stream has chunks - you can group your q into chunks of let's say 5 and then process each element inside chunk in parallel. Anyway both solutions (akka vs scalaz) aren't much adapted for using one queue as both input and ouput.
But, again, it's all too complex and pointless as there is a classic simple way:
#tailrec def calculate(l: List[Int], acc: List[Int]): List[Int] =
if (l.isEmpty) acc else {
val processed = l.flatMap(doSomething)
calculate(processed, acc ++ processed)
}
scala> calculate(List(3,5,7), Nil)
res5: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
And here is the parallelized one:
#tailrec def calculate(l: List[Int], acc: List[Int]): List[Int] =
if (l.isEmpty) acc else {
val processed = l.par.flatMap(doSomething).toList
calculate(processed, acc ++ processed)
}
scala> calculate(List(3,5,7), Nil)
res6: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
So, yes I would say that neither scalaz-stream nor akka-streams fits into your requirements; however classic scala parallel collections fit perfectly.
If you need distributed calculations across multiple JVMs - take a look at Apache Spark, its scala-dsl uses the same map/flatMap/fold style. It allows you to work with big collections (by scaling them across JVM's), that don't fit into JVM's memory, so you can improve #tailrec def calculate by using RDD instead of List. It will also give you intruments to process failures inside doSomething.
P.S. So here is why I don't like the idea of using streaming libraries for such tasks. Streaming is more like about infinite streams coming from some external systems (like HttpRequests) not about computation of predefined (even big) data.
P.S.2 If you need reactive-like (without blocking) you might use Future (or scalaz.concurrent.Task) + Future.sequence

Related

Can I use function composition to avoid the "temporary list" in scala?

On page 64 of fpis 《function programming in scala 》said
List(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
"each transformation
will produce a temporary list that only ever gets used as input to the next transformation
and is then immediately discarded"
so the compiler or the library can't help to avoid this?
if so,is this haskell code also produce a temporary list?
map (*2) (map (+1) [1,2,3])
if it is,can I use function composition to avoid this?
map ((*2).(+1)) [1,2,3]
If I can use function composition to avoid temporary list in haskell,can I use function composition to avoid temporary list in scala?
I know scala use funciton "compose" to compose function:https://www.geeksforgeeks.org/scala-function-composition/
so can I write this to avoid temporary list in scala?
((map(x:Int=>x+10)) compose (filter(x=>x%2==0)) compose (map(x=>x*3)) (List(1,2,3,4))
(IDEA told me I can't)
Thanks!
The compiler is not supposed to. If you consider map fusion, it nicely works with pure functions:
List(1, 2, 3).map(_ + 1).map(_ * 10)
// can be fused to
List(1, 2, 3).map(x => (x + 1) * 10)
However, Scala is not a purely functional language, nor does it have any notion of purity in it that compiler could track. For example, with side-effects there's a difference in behavior:
List(1, 2, 3).map { i => println(i); i + 1 }.map { i => println(i); i * 10 }
// prints 1, 2, 3, 2, 3, 4
List(1, 2, 3).map { i =>
println(i)
val j = i + 1
println(j)
j * 10
}
// prints 1, 2, 2, 3, 3, 4
Another thing to note is that Scala List is a strict collection - if you have a reference to a list, all of its elements are already allocated in memory. Haskell list, on the contrary, is lazy (like most of things in Haskell), so even if temporary "list shell" is created, it's elements are kept unevaluated until needed. That also allows Haskell lists to be infinite (you can write [1..] for increasing numbers)
The closest Scala counterpart to Haskell list is LazyList, which doesn't evaluate its elements until requested, and then caches them. So doing
LazyList(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
Would allocate intermediate LazyList instances, but not calculate/allocate any elements in them until they are requested from the final list. LazyList is also suitable for infinite collections (LazyList.from(1) is analogous to Haskell example above except it's Int).
Here, actually, doing map with side effects twice or fusing it by hand will make no difference.
You can switch any collection to be "lazy" by doing .view, or just work with iterators by doing .iterator - they have largely the same API as any collection, and then go back to a concrete collection by doing .to(Collection), so something like:
List(1,2,3,4).view.map(_ + 10).filter(_ % 2 == 0).map(_ * 3).to(List)
would make a List without any intermediaries. The catch is that it's not necessarily faster (though usually is more memory efficient).
You can avoid these temporary lists by using views:
https://docs.scala-lang.org/overviews/collections-2.13/views.html
It's also possible to use function composition to express the function that you asked about:
((_: List[Int]).map(_ + 10) andThen (_: List[Int]).filter(_ % 2 == 0) andThen (_: List[Int]).map(_ * 3))(List(1, 2, 3, 4))
But this will not avoid the creation of temporary lists, and due to Scala's limited type inference, it's usually more trouble than it's worth, because you often end up having to annotate types explicitly.

what is the difference between Scala Stream vs Scala List vs Scala Sequence

I have a scenario where i get DB data in the form of Stream of Objects.
and while transforming it into a sequence of Object it is taking time.
I am looking for alternative which takes less time.
Quick answer: a Scala stream is already a Scala sequence and does not need to be converted at all. Further explanation below...
A Scala sequence (scala.collection.Seq) is simply any collection that stores a sequence of elements in a specific order (the ordering is arbitrary, but element order doesn't change once defined).
A Scala list (scala.collection.immutable.List) is a subclass of Seq and is also the default implementation of a scala.collection.Seq. That is, Seq(1, 2, 3) is implemented as a List(1, 2, 3). Lists are strict, so any operation on a list processes all elements, one after the other, before another operation can be performed.
For example, consider this example in the Scala REPL:
$ scala
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171).
Type in expressions for evaluation. Or try :help.
scala> val xs = List(1, 2, 3)
xs: List[Int] = List(1, 2, 3)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Mapping value 2 to 4...
Mapping value 3 to 6...
Printing value 2
Printing value 4
Printing value 6
Note how each value is mapped, creating a new list (List(2, 4, 6)), before any of the values of that new list are printed out?
A Scala stream (scala.collection.immutable.Stream) is also a subclass of Seq, but it is lazy (or non-strict), meaning that the next value from the stream is only taken when required. It is often referred to as a lazy list.
To illustrate the difference between a Stream and a List, let's redo that example:
scala> val xs = Stream(1, 2, 3)
xs: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Printing value 2
Mapping value 2 to 4...
Printing value 4
Mapping value 3 to 6...
Printing value 6
Note how, for a Stream, we only process the next map operation after all of the operations for the previous element have been completed? The Map operation still returns a new stream (Stream(2, 4, 6)), but values are only taken when needed.
Whether a Stream performs better than a List in any particular situation will depend upon what you're trying to do. If performance is your primary goal, I suggest that you benchmark your code (using a tool such as ScalaMeter) to determine which type works best.
BTW, since both Stream and List are subclasses of Seq, it is common practice to write code that requires a sequence to utilize Seq. That way, you can supply a List or a Stream or any other Seq subclass, without having to change your code, and without having to convert lists, streams, etc. to sequences. For example:
def doSomethingWithSeq[T](seq: Seq[T]) = {
//
}
// This works!
val list = List(1, 2, 3)
doSomethingWithSeq(list)
// This works too!
val stream = Stream(4, 5, 6)
doSomethingWithSeq(stream)
UPDATED
The performance of List vs. Stream for a groupBy operation is going to be very similar. Depending upon how it's used, a Stream can require less memory than a List, but might require a little extra CPU time. If collection performance is definitely the issue, benchmark both types of collection (see above) and measure precisely to determine the trade-offs between the two. I cannot make that determination for you. It's possible that the slowness you refer to is down to the transmission of data between the database and your application, and has nothing to do with the collection type.
For general information on Scala collection performance, refer to Collections: Performance Charateristics.
UPDATED 2
Also note that any type of Scala sequence will typically be processed sequentially (hence the name), by a single thread at a time. Neither List nor Stream lend themselves to parallel processing of their elements. If you need to process a collection in parallel, you'll need a parallel collection type (one of the collections in scala.collection.parallel). A scala.collection.parallel.ParSeq should process groupBy faster than a List or a Stream, but only if you have multiple cores/hyperthreads available. However, ParSeq operations do not guarantee to preserve the order of the grouped-by elements.

Why does Scala Immutable Vector not provide an insertAt method?

Scala Immutable Vector is implemented using a Relaxed Radix Balanced Trees, which provides single element append in log (n) complexity like an HAMT but also log (n) insertAt and concatenation.
Why does the API does not expose insertAt?
You can create a custom insertAt method (neglecting performance issues) operating on immutable vectors. Just the rough method sketch here
def insertAt[T]( v: Vector[T], elem: T, pos: Int) : Vector[T] = {
val n = v.size
val front = v.take(pos)
val end = v.takeRight(n-pos)
front ++ Vector(elem) ++ end
}
Call:
val x = Vector(1,2,3,5)
println( insertAt( x, 7, 0) )
println( insertAt( x, 7, 1) )
println( insertAt( x, 7, 2) )
Output:
Vector(7, 1, 2, 3, 5)
Vector(1, 7, 2, 3, 5)
Vector(1, 2, 7, 3, 5)
Not handled properly in this sketch
types.
index checking.
Use the pimp-my-library pattern to add that to the Vector class.
Edit: Updated version of insertAt
def insertAt[T]( v: Vector[T], elem: T, pos: Int) : Vector[T] =
v.take(pos) ++ Vector(elem) ++ v.drop(pos)
Having an efficient insertAt is typically not an operation I would expect from a general Vector, immutable or not. That's more the purview of (mutable) linked lists.
Putting an efficient insertAt into the public API of Vector would severely constrain the implementation choices for that API. While at the moment, there is only one implementation of the Scala standard library APIs (which I personally find rather unfortunate, a bit of competition wouldn't hurt, see C++, C, Java, Ruby, Python for examples of how multiple implementations can foster an environment of friendly coopetition), there is no way to know that this will forever be the case. So, you should be very careful what guarantees you add to the public API of the Scala standard library, otherwise you might constrain both future versions of the current single implementation as well as potential alternative implementations in undue ways.
Again, see Ruby for an example, where exposing implementation details of one implementation in the API has led to severe pains for other implementors.

Scala Seq.sliding() violating the docs rationale?

When writing tests for some part of my system I found some weird behavior, which upon closer inspection boils down to the following:
scala> List(0, 1, 2, 3).sliding(2).toList
res36: List[List[Int]] = List(List(0, 1), List(1, 2), List(2, 3))
scala> List(0, 1, 2).sliding(2).toList
res37: List[List[Int]] = List(List(0, 1), List(1, 2))
scala> List(0, 1).sliding(2).toList
res38: List[List[Int]] = List(List(0, 1))
scala> List(0).sliding(2).toList //I mean the result of this line
res39: List[List[Int]] = List(List(0))
To me it seems like List.sliding(), and the sliding() implementations for a number of other types are violating the guarantees given in the docs:
def sliding(size: Int): Iterator[List[A]]
Groups elements in fixed size blocks by passing a "sliding window"
over them (as opposed to partitioning them, as is done in grouped.)
size: the number of elements per group
returns: An iterator producing lists of size size, except the last and the only element will be truncated if there are fewer
elements than size.
From what I understand there is a guarantee that all the lists that can be iterated over using the iterator returned by sliding(2) will be of length 2. I find it hard to believe that this is a bug that got all the way to the current version of scala, so perhaps there's an explanation for this or I'm misunderstanding the docs?
I'm using "Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25)."
No, there's is no such guarantee, and your pretty much emphasized the doc line that explicitly says so. Here it is again, with a different emphasis:
returns: An iterator producing lists of size size, except the last and
the only element will be truncated if there are fewer elements than
size.
So if you have a list that has length n, and call .sliding(m), where m > n, the last and the only element of the result with have length n.
In the case of:
List(0).sliding(2)
there is only one element (n = 1), you call sliding(2), i.e. m = 2, 2 > 1, this causes the last and only element of the result to be truncated to 1.

Stream#filter Runs out of Memory for 1,000,000 items

Let's say I have a Stream of length 1,000,000 with all 1's.
scala> val million = Stream.fill(100000000)(1)
million: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> million filter (x => x % 2 == 0)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I get an Out of Memory exception.
Then, I tried the same filter call with List.
scala> val y = List.fill(1000000)(1)
y: List[Int] = List(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ...
scala> y.filter(x => x % 2 == 0)
res2: List[Int] = List()
Yet it succeeds.
Why does the Stream#filter run out of memory here, but the List#filter completes just fine?
Lastly, with a large stream, will filter result in the non-lazy evaluation of the entire stream?
Overhead of List - single object (instance of ::) with 2 fields (2 pointers) per element.
Overhead of Stream - instance of Cons (with 3 pointers) plus an instance of Function (tl: => Stream[A]) for lazy evaluation of Stream#tail per element.
So you'll spend ~2 times more memory on Stream.
You have defined your Stream as val. Alternatively you could define million as def - in this case after filter GC will delete all created elements and you'll get your memory back.
Note that only tail in Stream is lazy, head is strict, so filter evaluates strictly until it gets first element that satisfies a given predicate, and since there is no such elements in your Stream filter iterates over all your million stream and puts all elements in memory.