what is the difference between Scala Stream vs Scala List vs Scala Sequence - scala

I have a scenario where i get DB data in the form of Stream of Objects.
and while transforming it into a sequence of Object it is taking time.
I am looking for alternative which takes less time.

Quick answer: a Scala stream is already a Scala sequence and does not need to be converted at all. Further explanation below...
A Scala sequence (scala.collection.Seq) is simply any collection that stores a sequence of elements in a specific order (the ordering is arbitrary, but element order doesn't change once defined).
A Scala list (scala.collection.immutable.List) is a subclass of Seq and is also the default implementation of a scala.collection.Seq. That is, Seq(1, 2, 3) is implemented as a List(1, 2, 3). Lists are strict, so any operation on a list processes all elements, one after the other, before another operation can be performed.
For example, consider this example in the Scala REPL:
$ scala
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171).
Type in expressions for evaluation. Or try :help.
scala> val xs = List(1, 2, 3)
xs: List[Int] = List(1, 2, 3)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Mapping value 2 to 4...
Mapping value 3 to 6...
Printing value 2
Printing value 4
Printing value 6
Note how each value is mapped, creating a new list (List(2, 4, 6)), before any of the values of that new list are printed out?
A Scala stream (scala.collection.immutable.Stream) is also a subclass of Seq, but it is lazy (or non-strict), meaning that the next value from the stream is only taken when required. It is often referred to as a lazy list.
To illustrate the difference between a Stream and a List, let's redo that example:
scala> val xs = Stream(1, 2, 3)
xs: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Printing value 2
Mapping value 2 to 4...
Printing value 4
Mapping value 3 to 6...
Printing value 6
Note how, for a Stream, we only process the next map operation after all of the operations for the previous element have been completed? The Map operation still returns a new stream (Stream(2, 4, 6)), but values are only taken when needed.
Whether a Stream performs better than a List in any particular situation will depend upon what you're trying to do. If performance is your primary goal, I suggest that you benchmark your code (using a tool such as ScalaMeter) to determine which type works best.
BTW, since both Stream and List are subclasses of Seq, it is common practice to write code that requires a sequence to utilize Seq. That way, you can supply a List or a Stream or any other Seq subclass, without having to change your code, and without having to convert lists, streams, etc. to sequences. For example:
def doSomethingWithSeq[T](seq: Seq[T]) = {
//
}
// This works!
val list = List(1, 2, 3)
doSomethingWithSeq(list)
// This works too!
val stream = Stream(4, 5, 6)
doSomethingWithSeq(stream)
UPDATED
The performance of List vs. Stream for a groupBy operation is going to be very similar. Depending upon how it's used, a Stream can require less memory than a List, but might require a little extra CPU time. If collection performance is definitely the issue, benchmark both types of collection (see above) and measure precisely to determine the trade-offs between the two. I cannot make that determination for you. It's possible that the slowness you refer to is down to the transmission of data between the database and your application, and has nothing to do with the collection type.
For general information on Scala collection performance, refer to Collections: Performance Charateristics.
UPDATED 2
Also note that any type of Scala sequence will typically be processed sequentially (hence the name), by a single thread at a time. Neither List nor Stream lend themselves to parallel processing of their elements. If you need to process a collection in parallel, you'll need a parallel collection type (one of the collections in scala.collection.parallel). A scala.collection.parallel.ParSeq should process groupBy faster than a List or a Stream, but only if you have multiple cores/hyperthreads available. However, ParSeq operations do not guarantee to preserve the order of the grouped-by elements.

Related

Can I use function composition to avoid the "temporary list" in scala?

On page 64 of fpis 《function programming in scala 》said
List(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
"each transformation
will produce a temporary list that only ever gets used as input to the next transformation
and is then immediately discarded"
so the compiler or the library can't help to avoid this?
if so,is this haskell code also produce a temporary list?
map (*2) (map (+1) [1,2,3])
if it is,can I use function composition to avoid this?
map ((*2).(+1)) [1,2,3]
If I can use function composition to avoid temporary list in haskell,can I use function composition to avoid temporary list in scala?
I know scala use funciton "compose" to compose function:https://www.geeksforgeeks.org/scala-function-composition/
so can I write this to avoid temporary list in scala?
((map(x:Int=>x+10)) compose (filter(x=>x%2==0)) compose (map(x=>x*3)) (List(1,2,3,4))
(IDEA told me I can't)
Thanks!
The compiler is not supposed to. If you consider map fusion, it nicely works with pure functions:
List(1, 2, 3).map(_ + 1).map(_ * 10)
// can be fused to
List(1, 2, 3).map(x => (x + 1) * 10)
However, Scala is not a purely functional language, nor does it have any notion of purity in it that compiler could track. For example, with side-effects there's a difference in behavior:
List(1, 2, 3).map { i => println(i); i + 1 }.map { i => println(i); i * 10 }
// prints 1, 2, 3, 2, 3, 4
List(1, 2, 3).map { i =>
println(i)
val j = i + 1
println(j)
j * 10
}
// prints 1, 2, 2, 3, 3, 4
Another thing to note is that Scala List is a strict collection - if you have a reference to a list, all of its elements are already allocated in memory. Haskell list, on the contrary, is lazy (like most of things in Haskell), so even if temporary "list shell" is created, it's elements are kept unevaluated until needed. That also allows Haskell lists to be infinite (you can write [1..] for increasing numbers)
The closest Scala counterpart to Haskell list is LazyList, which doesn't evaluate its elements until requested, and then caches them. So doing
LazyList(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
Would allocate intermediate LazyList instances, but not calculate/allocate any elements in them until they are requested from the final list. LazyList is also suitable for infinite collections (LazyList.from(1) is analogous to Haskell example above except it's Int).
Here, actually, doing map with side effects twice or fusing it by hand will make no difference.
You can switch any collection to be "lazy" by doing .view, or just work with iterators by doing .iterator - they have largely the same API as any collection, and then go back to a concrete collection by doing .to(Collection), so something like:
List(1,2,3,4).view.map(_ + 10).filter(_ % 2 == 0).map(_ * 3).to(List)
would make a List without any intermediaries. The catch is that it's not necessarily faster (though usually is more memory efficient).
You can avoid these temporary lists by using views:
https://docs.scala-lang.org/overviews/collections-2.13/views.html
It's also possible to use function composition to express the function that you asked about:
((_: List[Int]).map(_ + 10) andThen (_: List[Int]).filter(_ % 2 == 0) andThen (_: List[Int]).map(_ * 3))(List(1, 2, 3, 4))
But this will not avoid the creation of temporary lists, and due to Scala's limited type inference, it's usually more trouble than it's worth, because you often end up having to annotate types explicitly.

What is the difference between partition and groupBy?

I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?
groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?
And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.
Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.

How to remove duplicates from collection (without creating new ones in-between)?

So first up, I'm fully aware mutation is a bad idea, but I need to keep object creation down to a minimum as I have an incredibly huge amount of data to process (keeps GC hang time down and speeds up my code).
What I want is a scala collection that has a method like distinct or similar, or possibly a library or code snippet (but native scala preferred) such that the method is side effecting / mutating the collection rather than creating a new collection.
I've explored the usual suspects like ArrayBuffer, mutable.List, Array, MutableList, Vector and they all "create a new sequence" from the original rather than mutate the original in place. Am I trying to find something that does not exist? Will I just have to write my own??
I think this exists in C++ http://www.cplusplus.com/reference/algorithm/unique/
Also, mega mega bonus points if there is some kind of awesome tail recursive way of doing this so that any bookkeeping structures created are kept in a single stack frame that is thus deallocated from memory once the method exits. The reason this would be uber cool is then even if the method creates some instances of things in order to perform the removal of duplicates, those instance will not need to be garbage collected and therefore not contribute to massive GC hangs. It doesn't have to be recursion, as long as it's likely to cause the objects to go on the stack (see escape analysis here http://www.ibm.com/developerworks/java/library/j-jtp09275/index.html)
(Also if I can specify and fix the capacity (size in memory) of the collection that would also be great)
The algorithm (for C++), you mentioned is for consecutive duplicates. So if you need it for consecutive, you could use some LinkedList, but mutable lists was deprecated. On the other hand if you want something memory-efficient and agree with linear access - you could wrap your collection (mutable or immutable) with distinct iterator (O(N)):
def toConsDist[T](c: Traversable[T]) = new Iterator[T] {
val i = c.toIterator
var prev: Option[T] = None
var _nxt: Option[T] = None
def nxt = {
if (_nxt.isEmpty) _nxt = i.find(x => !prev.toList.contains(x))
prev = _nxt
_nxt
}
def hasNext = nxt.nonEmpty
def next = {
val next = nxt.get
_nxt = None
next
}
}
scala> toConsDist(List(1,1,1,2,2,3,3,3,2,2)).toList
res44: List[Int] = List(1, 2, 3, 2)
If you need to remove all duplicates, it will be О(N*N), but you can't use scala collections for that, because of https://github.com/scala/scala/commit/3cc99d7b4aa43b1b06cc837a55665896993235fc (see LinkedList part), https://stackoverflow.com/a/27645224/1809978.
But you may use Java's LinkedList:
import scala.collection.JavaConverters._
scala> val mlist = new java.util.LinkedList[Integer]
mlist: java.util.LinkedList[Integer] = []
scala> mlist.asScala ++= List(1,1,1,2,2,3,3,3,2,2)
res74: scala.collection.mutable.Buffer[Integer] = Buffer(1, 1, 1, 2, 2, 3, 3, 3, 2, 2)
scala> var i = 0
i: Int = 0
scala> for(x <- mlist.asScala){ if (mlist.indexOf(x) != i) mlist.set(i, null); i+=1} //O(N*N)
scala> while(mlist.remove(null)){} //O(N*N)
scala> mlist
res77: java.util.LinkedList[Integer] = [1, 2, 3]
mlist.asScala just creates wrapper without any copying. You can't modify Java's LinkedList during iteration, that's why i used null's. You may try Java ConcurrentLinkedQueue, but it doesn't support indexOf, so you will have to implement it by yourself (scala maps it to the Iterator, so asScala.indexOf won't work).
By definition, immutability forces you to create new objects whenever you want to change your collection.
What Scala provides for some collection are buffers which allow you to build a collection using a mutable interface and finally returning a immutable version but once you got your immutable collection you can't change its references in any way, that includes filtering as distinct. The furthest point you can reach concerning mutability in an immutable collection is changing its elements state when these are mutable objects.
On the other hand, some collections as Vector are implemented as trees (in this case as a trie) and insert or delete operations are implemented not by copying the entire tree but just the required branches.
From Martin Ordesky's Programming in Scala:
Updating an element in the middle of a vector can be done by copying
the node that contains the element, and every node that points to it,
starting from the root of the tree. This means that a functional
update creates between one and five nodes that each contain up to 32
elements or subtrees. This is certainly more expensive than an
in-place update in a mutable array, but still a lot cheaper than
copying the whole vector.

The easiest way to write {1, 2, 4, 8, 16 } in Scala

I was advertising Scala to a friend (who uses Java most of the time) and he asked me a challenge: what's the way to write an array {1, 2, 4, 8, 16} in Scala.
I don't know functional programming that well, but I really like Scala. However, this is a iterative array formed by (n*(n-1)), but how to keep track of the previous step? Is there a way to do it easily in Scala or do I have to write more than one line of code to achieve this?
Array.iterate(1, 5)(2 * _)
or
Array.iterate(1, 5)(n => 2 * n)
Elaborating on this as asked for in comment. Don't know what you want me to elaborate on, hope you will find what you need.
This is the function iterate(start,len)(f) on object Array (scaladoc). That would be a static in java.
The point is to fill an array of len elements, from first value start and always computing the next element by passing the previous one to function f.
A basic implementation would be
import scala.reflect.ClassTag
def iterate[A: ClassTag](start: A, len: Int)(f: A => A): Array[A] = {
val result = new Array[A](len)
if (len > 0) {
var current = start
result(0) = current
for (i <- 1 until len) {
current = f(current)
result(i) = current
}
}
result
}
(the actual implementation, not much different can be found here. It is a little different mostly because the same code is used for different data structures, e.g List.iterate)
Beside that, the implementation is very straightforward . The syntax may need some explanations :
def iterate[A](...) : Array[A] makes it a generic methods, usable for any type A. That would be public <A> A[] iterate(...) in java.
ClassTag is just a technicality, in scala as in java, you normally cannot create an array of a generic type (java new E[]), and the : ClassTag asks the compiler to add some magic which is very similar to adding at method declaration, and passing at call site, a class<A> clazz parameter in java, which can then be used to create the array by reflection. If you do e.g List.iterate rather than Array.iterate, it is not needed.
Maybe more surprising, the two parameters lists, one with start and len, and then in a separate parentheses, the one with f. Scala allows a method to have severals parameters lists. Here the reason is the peculiar way scala does type inference : Looking at the first parameter list, it will determine what is A, based on the type of start. Only afterwards, it will look at the second list, and then it knows what type A is. Otherwise, it would need to be told, so if there had been only one parameter list, def iterate[A: ClassTag](start: A, len: Int, f: A => A),
then the call should be either
Array.iterate(1, 5, n : Int => 2 * n)
Array.iterate[Int](1, 5, n => 2 * n)
Array.iterate(1, 5, 2 * (_: int))
Array.iterate[Int](1, 5, 2 * _)
making Int explicit one way or another. So it is common in scala to put function arguments in a separate argument list. The type might be much longer to write than just 'Int'.
A => A is just syntactic sugar for type Function1[A,A]. Obviously a functional language has functions as (first class) values, and a typed functional language has types for functions.
In the call, iterate(1, 5)(n => 2 * n), n => 2 * n is the value of the function. A more complete declaration would be {n: Int => 2 * n}, but one may dispense with Int for the reason stated above. Scala syntax is rather flexible, one may also dispense with either the parentheses or the brackets. So it could be iterate(1, 5){n => 2 * n}. The curlies allow a full block with several instruction, not needed here.
As for immutability, Array is basically mutable, there is no way to put a value in an array except to change the array at some point. My implementation (and the one in the library) also use a mutable var (current) and a side-effecting for, which is not strictly necessary, a (tail-)recursive implementation would be only a little longer to write, and just as efficient. But a mutable local does not hurt much, and we are already dealing with a mutable array anyway.
always more than one way to do it in Scala:
scala> (0 until 5).map(1<<_).toArray
res48: Array[Int] = Array(1, 2, 4, 8, 16)
or
scala> (for (i <- 0 to 4) yield 1<<i).toArray
res49: Array[Int] = Array(1, 2, 4, 8, 16)
or even
scala> List.fill(4)(1).scanLeft(1)(2*_+0*_).toArray
res61: Array[Int] = Array(1, 2, 4, 8, 16)
The other answers are fine if you happen to know in advance how many entries will be in the resulting list. But if you want to take all of the entries up to some limit, you should create an Iterator, use takeWhile to get the prefix you want, and create an array from that, like so:
scala> Iterator.iterate(1)(2*_).takeWhile(_<=16).toArray
res21: Array[Int] = Array(1, 2, 4, 8, 16)
It all boils down to whether what you really want is more correctly stated as
the first 5 powers of 2 starting at 1, or
the powers of 2 from 1 to 16
For non-trivial functions you almost always want to specify the end condition and let the program figure out how many entries there are. Of course your example was simple, and in fact the real easiest way to create that simple array is just to write it out literally:
scala> Array(1,2,4,8,16)
res22: Array[Int] = Array(1, 2, 4, 8, 16)
But presumably you were asking for a general technique you could use for arbitrarily complex problems. For that, Iterator and takeWhile are generally the tools you need.
You don't have to keep track of the previous step. Also, each element is not formed by n * (n - 1). You probably meant f(n) = f(n - 1) * 2.
Anyway, to answer your question, here's how you do it:
(0 until 5).map(math.pow(2, _).toInt).toArray

Should Scala's map() behave differently when mapping to the same type?

In the Scala Collections framework, I think there are some behaviors that are counterintuitive when using map().
We can distinguish two kinds of transformations on (immutable) collections. Those whose implementation calls newBuilder to recreate the resulting collection, and those who go though an implicit CanBuildFrom to obtain the builder.
The first category contains all transformations where the type of the contained elements does not change. They are, for example, filter, partition, drop, take, span, etc. These transformations are free to call newBuilder and to recreate the same collection type as the one they are called on, no matter how specific: filtering a List[Int] can always return a List[Int]; filtering a BitSet (or the RNA example structure described in this article on the architecture of the collections framework) can always return another BitSet (or RNA). Let's call them the filtering transformations.
The second category of transformations need CanBuildFroms to be more flexible, as the type of the contained elements may change, and as a result of this, the type of the collection itself maybe cannot be reused: a BitSet cannot contain Strings; an RNA contains only Bases. Examples of such transformations are map, flatMap, collect, scanLeft, ++, etc. Let's call them the mapping transformations.
Now here's the main issue to discuss. No matter what the static type of the collection is, all filtering transformations will return the same collection type, while the collection type returned by a mapping operation can vary depending on the static type.
scala> import collection.immutable.TreeSet
import collection.immutable.TreeSet
scala> val treeset = TreeSet(1,2,3,4,5) // static type == dynamic type
treeset: scala.collection.immutable.TreeSet[Int] = TreeSet(1, 2, 3, 4, 5)
scala> val set: Set[Int] = TreeSet(1,2,3,4,5) // static type != dynamic type
set: Set[Int] = TreeSet(1, 2, 3, 4, 5)
scala> treeset.filter(_ % 2 == 0)
res0: scala.collection.immutable.TreeSet[Int] = TreeSet(2, 4) // fine, a TreeSet again
scala> set.filter(_ % 2 == 0)
res1: scala.collection.immutable.Set[Int] = TreeSet(2, 4) // fine
scala> treeset.map(_ + 1)
res2: scala.collection.immutable.SortedSet[Int] = TreeSet(2, 3, 4, 5, 6) // still fine
scala> set.map(_ + 1)
res3: scala.collection.immutable.Set[Int] = Set(4, 5, 6, 2, 3) // uh?!
Now, I understand why this works like this. It is explained there and there. In short: the implicit CanBuildFrom is inserted based on the static type, and, depending on the implementation of its def apply(from: Coll) method, may or may not be able to recreate the same collection type.
Now my only point is, when we know that we are using a mapping operation yielding a collection with the same element type (which the compiler can statically determine), we could mimic the way the filtering transformations work and use the collection's native builder. We can reuse BitSet when mapping to Ints, create a new TreeSet with the same ordering, etc.
Then we would avoid cases where
for (i <- set) {
val x = i + 1
println(x)
}
does not print the incremented elements of the TreeSet in the same order as
for (i <- set; x = i + 1)
println(x)
So:
Do you think this would be a good idea to change the behavior of the mapping transformations as described?
What are the inevitable caveats I have grossly overlooked?
How could it be implemented?
I was thinking about something like an implicit sameTypeEvidence: A =:= B parameter, maybe with a default value of null (or rather an implicit canReuseCalleeBuilderEvidence: B <:< A = null), which could be used at runtime to give more information to the CanBuildFrom, which in turn could be used to determine the type of builder to return.
I looked again at it, and I think your problem doesn't arise from a particular deficiency of Scala collections, but rather a missing builder for TreeSet. Because the following does work as intended:
val list = List(1,2,3,4,5)
val seq1: Seq[Int] = list
seq1.map( _ + 1 ) // yields List
val vector = Vector(1,2,3,4,5)
val seq2: Seq[Int] = vector
seq2.map( _ + 1 ) // yields Vector
So the reason is that TreeSet is missing a specialised companion object/builder:
seq1.companion.newBuilder[Int] // ListBuffer
seq2.companion.newBuilder[Int] // VectorBuilder
treeset.companion.newBuilder[Int] // Set (oops!)
So my guess is, if you take proper provision for such a companion for your RNA class, you may find that both map and filter work as you wish...?