Scala 2.13 views vs LazyList - scala

I'm migrating a project from Scala 2.12.1 to 2.13.6, and find that SeqView#flatMap now returns a View, which doesn't have a distinct method. I thus have one bit of code that does not compile anymore:
val nodes = debts.view
.flatMap { case Debt(from, to, _) => List(from, to) }
.distinct
.map(name => (name, new Node(name)))
.toMap
There's a dumb way to fix it, by converting the view to a seq and then back to a view:
val nodes = debts.view
.flatMap { case Debt(from, to, _) => List(from, to) }.toSeq.view
.distinct
.map(name => (name, new Node(name)))
.toMap
However, this is obviously not great because it forces the view to be collected, and also it's just super inelegant to have to go back-and-forth between types. I found another way to fix it, with is to use a LazyList:
val nodes = debts.to(LazyList)
.flatMap { case Debt(from, to, _) => List(from, to) }
.distinct
.map(name => (name, new Node(name)))
.toMap
Now that's what I want, it basically behaves like a Java stream. Sure, some operations have O(n) memory usage like distinct, but at least all operations after it get to be streamed, without reconstructing the data structure.
With this, it gets me thinking about why we should ever need a view, given that they're much less powerful than before (even if I can believe 2.13 has fixed some other issues this power was introducing). I looked for the answer and found hints, but nothing that I found comprehensive enough. Below is my research:
Description of views in 2.13
StackOverflow: What is the difference between List.view and LazyList?
Another comparison on an external website
It might be me, but even after reading these references, I don't find a clear upside in using views, for most if not all use cases. Anyone more enlightened than me?

There are actually 3 basic possibilities for lazy sequences in Scala 2.13: View, Iterator and LazyList.
View is the simplest lazy sequence with very little additional costs. It's good to use by default in general case to avoid allocations for intermediate results when working with large sequences.
It's possible to traverse the View several times (using foreach, foldLeft, toMap, etc.). Transformations (map, flatMap, filter, etc.) will be executed separately for every traversal. So care has to be applied either to avoid time-consuming transformations, or to traverse the View only once.
Iterator can be traversed only once. It's similar to Java Streams or Python generators. Most transformation methods on Iterator require that you only use the returned Iterator and discard the original object.
It's also fast like a View and supports more operations, including distinct.
LazyList is basically a real strict structure, which can be expanded automatically on the fly. LazyList memoizes all the generated elements. If you have a val with a LazyList, the memory will be allocated for all the generated elements. But if you traverse it on the fly and don't store in a val, the garbage collector can clean up the traversed elements.
Stream in Scala 2.12 was considerably slower than Views or Iterators. I'm not sure if this applies to LazyList in Scala 2.13.
So every lazy sequence has some caveat:
View can execute transformations multiple times.
Iterator can be consumed only once.
LazyList can allocate the memory for all the sequence elements.
In your use case I believe, it's Iterator that's the most appropriate:
val nodes = debts.iterator
.flatMap { case Debt(from, to, _) => List(from, to) }
.distinct
.map(name => (name, new Node(name)))
.toMap

Related

How can I convert a list of Either to a list of Right values?

I do some data conversion on a list of string and I get a list of Either where Left represents an error and Right represents a successfully converted item.
val results: Seq[Either[String, T]] = ...
I partition my results with:
val (errors, items) = results.partition(_.isLeft)
After doing some error processing I want to return a Seq[T] of valid items. That means, returning the value of all Right elements. Because of the partitioning I already knew that all elements of items Right. I have come up with five possibilities of how to do it. But what is the best in readability and performance? Is there an idiomatic way of how to do it in Scala?
// which variant is most scala like and still understandable?
items.map(_.right.get)
items.map(_.right.getOrElse(null))
items.map(_.asInstanceOf[Right[String, T]].value)
items.flatMap(_.toOption)
items.collect{case Right(item) => item}
Using .get is considered "code smell": it will work in this case, but makes the reader of the code pause and spend a few extra "cycles" in order to "prove" that it is ok. It is better to avoid using things like .get on Either and Option or .apply on a Map or a IndexedSeq.
.getOrElse is ok ... but null is not something you often see in scala code. Again, makes the reader stop and think "why is this here? what will happen if it ends up returning null?" etc. Better to avoid as well.
.asInstanceOf is ... just bad. It breaks type safety, and is just ... not scala.
That leaves .flatMap(_.toOption) or .collect. Both are fine. I would personally prefer the latter as it is a bit more explicit (and does not make the reader stop to remember which way Either is biased).
You could also use foldRight to do both partition and extract in one "go":
val (errors, items) = results.foldRight[(List[String], List[T])](Nil,Nil) {
case (Left(error), (e, i)) => (error :: e, i)
case ((Right(result), (e, i)) => (e, result :: i)
}
Starting in Scala 2.13, you'll probably prefer partitionMap to partition.
It partitions elements based on a function which returns either Right or Left. Which in your case, is simply the identity:
val (lefts, rights) = List(Right(1), Left("2"), Left("3")).partitionMap(identity)
// val lefts: List[String] = List(2, 3)
// val rights: List[Int] = List(1)
which let you use lefts and rights independently and with the right types.
Going through them one by one:
items.map(_.right.get)
You already know that these are all Rights. This will be absolutely fine.
items.map(_.right.getOrElse(null))
The .getOrElse is unnecessary here as you already know it should never happen. I would recommend throwing an exception if you find a Left (somehow) though, something like this: items.map(x => x.right.getOrElse(throw new Exception(s"Unexpected Left: [$x]")) (or whatever exception you think is suitable), rather than meddling with null values.
items.map(_.asInstanceOf[Right[String, T]].value)
This is unnecessarily complicated. I also can't get this to compile, but I may be doing something wrong. Either way there's no need to use asInstanceOf here.
items.flatMap(_.toOption)
I also can't get this to compile. items.flatMap(_.right.toOption) compiles for me, but at that point it'll always be a Some and you'll still have to .get it.
items.collect{case Right(item) => item}
This is another case of "it works but why be so complicated?". It also isn't exhaustive in the case of a Left item being there, but this should never happen so there's no need to use .collect.
Another way you could get the right values out is with pattern matching:
items.map {
case Right(value) => value
case other => throw new Exception(s"Unexpected Left: $other")
}
But again, this is probably unnecessary as you already know that all values will be Right.
If you are going to partition results like this, I recommend the first option, items.map(_.right.get). Any other options either have unreachable code (code you'd never be able to hit through Unit tests or real-life operation) or are unnecessarily complicated for the sake of "looking functional".

scala.collection.breakOut vs views

This SO answer describes how scala.collection.breakOut can be used to prevent creating wasteful intermediate collections. For example, here we create an intermediate Seq[(String,String)]:
val m = List("A", "B", "C").map(x => x -> x).toMap
By using breakOut we can prevent the creation of this intermediate Seq:
val m: Map[String,String] = List("A", "B", "C").map(x => x -> x)(breakOut)
Views solve the same problem and in addition access elements lazily:
val m = (List("A", "B", "C").view map (x => x -> x)).toMap
I am assuming the creation of the View wrappers is fairly cheap, so my question is: Is there any real reason to use breakOut over Views?
You're going to make a trip from England to France.
With view: you're taking a set of notes in your notebook and boom, once you've called .force() you start making all of them: buy a ticket, board on the plane, ....
With breakOut: you're departing and boom, you in the Paris looking at the Eiffel tower. You don't remember how exactly you've arrived there, but you did this trip actually, just didn't make any memories.
Bad analogy, but I hope this give you a taste of what is the difference between them.
I don't think views and breakOut are identical.
A breakOut is a CanBuildFrom implementation used to simplify transformation operations by eliminating intermediary steps. E.g get from A to B without the intermediary collection. A breakOut means letting Scala choose the appropriate builder object for maximum efficiency of producing new items in a given scenario. More details here.
views deal with a different type of efficiency, the main sale pitch being: "No more new objects". Views store light references to objects to tackle different usage scenarios: lazy access etc.
Bottom line:
If you map on a view you may still get an intermediary collection of references created before the expected result can be produced. You could still have superior performance from:
collection.view.map(somefn)(breakOut)
Than from:
collection.view.map(someFn)
As of Scala 2.13, this is no longer a concern. Breakout has been removed and views are the recommended replacement.
Scala 2.13 Collections Rework
Views are also the recommended replacement for collection.breakOut.
For example,
val s: Seq[Int] = ...
val set: Set[String] = s.map(_.toString)(collection.breakOut)
can be expressed with the same performance characteristics as:
val s: Seq[Int] = ...
val set = s.view.map(_.toString).to(Set)
What flavian said.
One use case for views is to conserve memory. For example, if you had a million-character-long string original, and needed to use, one by one, all of the million suffixes of that string, you might use a collection of
val v = original.view
val suffixes = v.tails
views on the original string. Then you might loop over the suffixes one by one, using suffix.force() to convert them back to strings within the loop, thus only holding one in memory at a time. Of course, you could do the same thing by iterating with your own loop over the indices of the original string, rather than creating any kind of collection of the suffixes.
Another use-case is when creation of the derived objects is expensive, you need them in a collection (say, as values in a map), but you only will access a few, and you don't know which ones.
If you really have a case where picking between them makes sense, prefer breakOut unless there's a good argument for using view (like those above).
Views require more code changes and care than breakOut, in that you need to add force() where needed. Depending on context, failure to do so is
often only detected at run-time. With breakOut, generally if it
compiles, it's right.
In cases where view does not apply, breakOut
will be faster, since view generation and forcing is skipped.
If you use a debugger, you can inspect the collection contents, which you
can't meaningfully do with a collection of views.

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?
Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

why use foldLeft instead of procedural version?

So in reading this question it was pointed out that instead of the procedural code:
def expand(exp: String, replacements: Traversable[(String, String)]): String = {
var result = exp
for ((oldS, newS) <- replacements)
result = result.replace(oldS, newS)
result
}
You could write the following functional code:
def expand(exp: String, replacements: Traversable[(String, String)]): String = {
replacements.foldLeft(exp){
case (result, (oldS, newS)) => result.replace(oldS, newS)
}
}
I would almost certainly write the first version because coders familiar with either procedural or functional styles can easily read and understand it, while only coders familiar with functional style can easily read and understand the second version.
But setting readability aside for the moment, is there something that makes foldLeft a better choice than the procedural version? I might have thought it would be more efficient, but it turns out that the implementation of foldLeft is actually just the procedural code above. So is it just a style choice, or is there a good reason to use one version or the other?
Edit: Just to be clear, I'm not asking about other functions, just foldLeft. I'm perfectly happy with the use of foreach, map, filter, etc. which all map nicely onto for-comprehensions.
Answer: There are really two good answers here (provided by delnan and Dave Griffith) even though I could only accept one:
Use foldLeft because there are additional optimizations, e.g. using a while loop which will be faster than a for loop.
Use fold if it ever gets added to regular collections, because that will make the transition to parallel collections trivial.
It's shorter and clearer - yes, you need to know what a fold is to understand it, but when you're programming in a language that's 50% functional, you should know these basic building blocks anyway. A fold is exactly what the procedural code does (repeatedly applying an operation), but it's given a name and generalized. And while it's only a small wheel you're reinventing, but it's still a wheel reinvention.
And in case the implementation of foldLeft should ever get some special perk - say, extra optimizations - you get that for free, without updating countless methods.
Other than a distaste for mutable variable (even mutable locals), the basic reason to use fold in this case is clarity, with occasional brevity. Most of the wordiness of the fold version is because you have to use an explicit function definition with a destructuring bind. If each element in the list is used precisely once in the fold operation (a common case), this can be simplified to use the short form. Thus the classic definition of the sum of a collection of numbers
collection.foldLeft(0)(_+_)
is much simpler and shorter than any equivalent imperative construct.
One additional meta-reason to use functional collection operations, although not directly applicable in this case, is to enable a move to using parallel collection operations if needed for performance. Fold can't be parallelized, but often fold operations can be turned into commutative-associative reduce operations, and those can be parallelized. With Scala 2.9, changing something from non-parallel functional to parallel functional utilizing multiple processing cores can sometimes be as easy as dropping a .par onto the collection you want to execute parallel operations on.
One word I haven't seen mentioned here yet is declarative:
Declarative programming is often defined as any style of programming that is not imperative. A number of other common definitions exist that attempt to give the term a definition other than simply contrasting it with imperative programming. For example:
A program that describes what computation should be performed and not how to compute it
Any programming language that lacks side effects (or more specifically, is referentially transparent)
A language with a clear correspondence to mathematical logic.
These definitions overlap substantially.
Higher-order functions (HOFs) are a key enabler of declarativity, since we only specify the what (e.g. "using this collection of values, multiply each value by 2, sum the result") and not the how (e.g. initialize an accumulator, iterate with a for loop, extract values from the collection, add to the accumulator...).
Compare the following:
// Sugar-free Scala (Still better than Java<5)
def sumDoubled1(xs: List[Int]) = {
var sum = 0 // Initialized correctly?
for (i <- 0 until xs.size) { // Fenceposts?
sum = sum + (xs(i) * 2) // Correct value being extracted?
// Value extraction and +/* smashed together
}
sum // Correct value returned?
}
// Iteration sugar (similar to Java 5)
def sumDoubled2(xs: List[Int]) = {
var sum = 0
for (x <- xs) // We don't need to worry about fenceposts or
sum = sum + (x * 2) // value extraction anymore; that's progress
sum
}
// Verbose Scala
def sumDoubled3(xs: List[Int]) = xs.map((x: Int) => x*2). // the doubling
reduceLeft((x: Int, y: Int) => x+y) // the addition
// Idiomatic Scala
def sumDoubled4(xs: List[Int]) = xs.map(_*2).reduceLeft(_+_)
// ^ the doubling ^
// \ the addition
Note that our first example, sumDoubled1, is already more declarative than (most would say superior to) C/C++/Java<5 for loops, because we haven't had to micromanage the iteration state and termination logic, but we're still vulnerable to off-by-one errors.
Next, in sumDoubled2, we're basically at the level of Java>=5. There are still a couple things that can go wrong, but we're getting pretty good at reading this code-shape, so errors are quite unlikely. However, don't forget that a pattern that's trivial in a toy example isn't always so readable when scaled up to production code!
With sumDoubled3, desugared for didactic purposes, and sumDoubled4, the idiomatic Scala version, the iteration, initialization, value extraction and choice of return value are all gone.
Sure, it takes time to learn to read the functional versions, but we've drastically foreclosed our options for making mistakes. The "business logic" is clearly marked, and the plumbing is chosen from the same menu that everyone else is reading from.
It is worth pointing out that there is another way of calling foldLeft which takes advantages of:
The ability to use (almost) any Unicode symbol in an identifier
The feature that if a method name ends with a colon :, and is called infix, then the target and parameter are switched
For me this version is much clearer, because I can see that I am folding the expr value into the replacements collection
def expand(expr: String, replacements: Traversable[(String, String)]): String = {
(expr /: replacements) { case (r, (o, n)) => r.replace(o, n) }
}

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.