scala Stream transformation and evaluation model - scala

Consider a following list transformation:
List(1,2,3,4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
It is evaluated in the following way:
List(1, 2, 3, 4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
List(11, 12, 13, 14) filter (_ % 2 == 0) map (_ * 3)
List(12, 14) map (_ * 3)
List(36, 42)
So there are three passes and with each one a new list structure created.
So, the first question: can Stream help to avoid it and if yes -- how? Can all evaluations be made in a single pass and without additional structures created?
Isn't the following Stream evaluation model correct:
Stream(1, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Stream(11, ?) filter (_ % 2 == 0) map (_ * 3)
// filter condition fail, evaluate the next element
Stream(2, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Stream(12, ?) filter (_ % 2 == 0) map (_ * 3)
Stream(12, ?) map (_ * 3)
Stream(36, ?)
// finish
If it is, then there are the same number of passes and the same number of new Stream structures created as in the case of a List. If it is not -- then the second question: what is Stream evaluation model in particularly this type of transformation chain?

One way to avoid intermediate collections is to use view.
List(1,2,3,4).view map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
It doesn't avoid every intermediate, but it can be useful. This page has lots of info and is well worth the time.

No, you can't avoid it by using Stream.
But you do can avoid it by using the method collect, and you should keep the idea that everytime you use a map after filter you may need a collect.
Here is the code:
scala> def time(n: Int)(call : => Unit): Long = {
| val start = System.currentTimeMillis
| var cnt = n
| while(cnt > 0) {
| cnt -= 1
| call
| }
| System.currentTimeMillis - start
| }
time: (n: Int)(call: => Unit)Long
scala> val xs = List.fill(10000)((math.random * 100).toInt)
xs: List[Int] = List(37, 86, 74, 1, ...)
scala> val ys = Stream(xs :_*)
ys: scala.collection.immutable.Stream[Int] = Stream(37, ?)
scala> time(10000){ xs map (_+10) filter (_%2 == 0) map (_*3) }
res0: Long = 7182
//Note call force to evaluation of the whole stream.
scala> time(10000){ ys map (_+10) filter (_%2 == 0) map (_*3) force }
res1: Long = 17408
scala> time(10000){ xs.view map (_+10) filter (_%2 == 0) map (_*3) force }
res2: Long = 6322
scala> time(10000){ xs collect { case x if (x+10)%2 == 0 => (x+10)*3 } }
res3: Long = 2339

As far as I know, If you always iterate through the whole collection Stream does not help you.
It will create the same number as new Streams as with the List.
Correct me if I am wrong, but I understand it as follows:
Stream is a lazy structure, so when you do:
val result = Stream(1, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
the result is another stream that links to the results of the previous transformations. So if force evaluation with a foreach (or e.g. mkString)
result.foreach(println)
for each iteration the above chain is evaluated to get the current item.
However, you can reduce passes by 1, if you replace filter with withFilter. Then the filter is kind of applied with the map function.
List(1,2,3,4) map (_ + 10) withFilter (_ % 2 == 0) map (_ * 3)
You can reduce it to one pass with flatMap:
List(1,2,3,4) flatMap { x =>
val y = x + 10
if (y % 2 == 0) Some(y * 3) else None
}

Scala can filter and transform a collection in a variety of ways.
First your example:
List(1,2,3,4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Could be optimized:
List(1,2,3,4) filter (_ % 2 == 0) map (v => (v+10)*3)
Or, folds could be used:
List(1,2,3,4).foldLeft(List[Int]()){ case (a,b) if b % 2 == 0 => a ++ List((b+10)*3) case (a,b) => a }
Or, perhaps a for-expression:
for( v <- List(1,2,3,4); w=v+10 if w % 2 == 0 ) yield w*3
Or, maybe the clearest to understand, a collection:
List(1,2,3,4).collect{ case v if v % 2 == 0 => (v+10)*3 }
But to address your questions about Streams; Yes, streams can be used
and for large collections where what is wanted is often found early, a
Stream is a good choice:
def myStream( s:Stream[Int] ): Stream[Int] =
((s.head+10)*3) #:: myStream(s.tail.filter( _ % 2 == 0 ))
myStream(Stream.from(2)).take(2).toList // An infinitely long list yields
// 36 & 42 where the 3rd element
// has not been processed yet
With this Stream example the filter is only applied to the next element as it is needed, not to the entire list -- good thing, or it would never stop :)

Related

How to pass the initial value to foldLeft from a filtered List with function chaining?

Say I have a List. I filter it first on some condition. Now I want to pass the initial value from this filtered array to foldLeft all while chaining both together. Is there a way to do that?
For example:
scala> val numbers = List(5, 4, 8, 6, 2)
val numbers: List[Int] = List(5, 4, 8, 6, 2)
scala> numbers.filter(_ % 2 == 0).foldLeft(numbers(0)) { // this is obviously incorrect since numbers(0) is the value at index 0 of the original array not the filtered array
| (z, i) => z + i
| }
val res88: Int = 25
You could just pattern match on the result of filtering to get the first element of list (head) and the rest (tail):
val numbers = List(5, 4, 8, 6, 2)
val result = numbers.filter(_ % 2 == 0) match {
case head :: tail => tail.foldLeft(head) {
(z, i) => z + i
}
// here you need to handle the case, when after filtering there are no elements, in this case, I just return 0
case Nil => 0
}
You could also just use reduce:
numbers.filter(_ % 100 == 0).reduce {
(z, i) => z + i
}
but it will throw an exception in case after filtering the list is empty.

Recursive Function To Create Permutations

I have the following function which I have checked about a dozen times, and should work exactly as I want, but it ends up with the wrong result. Can anyone point out what is wrong with this function?
Note: I'm printing out the list that is being passed in recursive calls; and the list is exactly as I expect it to be. But the variable called result that accumulates the result does not contain the correct permutations at the end. Also, I synchronized the access to result variable, but that did NOT fix the problem; so, I don't think synchronization is a problem. The code can be copied and run as is.
import collection.mutable._
def permute(list:List[Int], result:StringBuilder):Unit =
{
val len = list.size
if (len == 0) (result.append("|"))
else
{
for (i <- 0 until len )
{
println("========" + list + "===========")
result.append( list(i) )
if (i != len -1)
{
//println("Adding comma since i is: " + i)
result.append(", ")
}
//println("******** Reslut is:" + result + "***********")
permute( (sublist(list, i) ), result)
}
}
// This function removes just the ith item, and returns the new list.
def sublist (list:List[Int], i:Int): List[Int] =
{
var sub:ListBuffer[Int] = (list.map(x => x)).to[ListBuffer]
sub.remove(i)
return sub.toList
}
}
var res = new StringBuilder("")
permute(List(1,2,3), res)
println(res)
The output is:
========List(1, 2, 3)===========
========List(2, 3)===========
========List(3)===========
========List(2, 3)===========
========List(2)===========
========List(1, 2, 3)===========
========List(1, 3)===========
========List(3)===========
========List(1, 3)===========
========List(1)===========
========List(1, 2, 3)===========
========List(1, 2)===========
========List(2)===========
========List(1, 2)===========
========List(1)===========
**1, 2, 3|32|2, 1, 3|31|31, 2|21|**
I think Dici's solution is good, but kind of cryptic. I think the following code is much more clear:
def permutations(list: List[Int]): List[List[Int]] = list match
{
case Nil | _::Nil => List(list)
case _ =>(
for (i <- list.indices.toList) yield
{
val (beforeElem, afterElem) = list.splitAt(i)
val element = afterElem.head
val subperm = permutations (beforeElem ++ afterElem.tail)
subperm.map(element:: _)
}
).flatten
}
val result = permutations(List (1,2,3,4,5) )
println(result.mkString("\n") )
The output will be:
List(1, 2, 3)
List(1, 3, 2)
List(2, 1, 3)
List(2, 3, 1)
List(3, 1, 2)
List(3, 2, 1)
There are various problems with your approach, the main one being that you don't actually implement the recurrence relation between the permutations of n elements and the permutations of n + 1 elements, which is that you can take all permutations of n elements and insert the n + 1th element at every position of every permutation of n elements to get all the permutations of n + 1 elements.
One way to do it, more Scalatically, is:
def sortedPermutations(list: List[Int]): List[List[Int]] = list match {
case Nil | _ :: Nil => List(list)
case _ => list.indices.flatMap(i => list.splitAt(i) match {
case (head, t :: tail) => sortedPermutations(head ::: tail).map(t :: _)
}).toList
}
println(sortedPermutations(List(1, 2, 3)).map(_.mkString(",")).mkString("|"))
Output:
1,2,3|1,3,2|2,1,3|2,3,1|3,1,2|3,2,1
Note that this is very inefficient though, because of all the list concatenations. An efficient solution would be tail-recursive or iterative. I'll post that a bit later for you.

Count occurrences of each item in a Scala parallel collection

My question is very similar to Count occurrences of each element in a List[List[T]] in Scala, except that I would like to have an efficient solution involving parallel collections.
Specifically, I have a large (~10^7) vector vec of short (~10) lists of Ints, and I would like to get for each Int x the number of times x occurs, for example as a Map[Int,Int]. The number of distinct integers is of the order 10^6.
Since the machine this needs to be done on has a fair amount of memory (150GB) and number of cores (>100) it seems like parallel collections would be a good choice for this. Is the code below a good approach?
val flatpvec = vec.par.flatten
val flatvec = flatpvec.seq
val unique = flatpvec.distinct
val counts = unique map (x => (x -> flatvec.count(_ == x)))
counts.toMap
Or are there better solutions? In case you are wondering about the .seq conversion: for some reason the following code doesn't seem to terminate, even for small examples:
val flatpvec = vec.par.flatten
val unique = flatpvec.distinct
val counts = unique map (x => (x -> flatpvec.count(_ == x)))
counts.toMap
This does something. aggregate is like fold except you also combine the results of the sequential folds.
Update: It's not surprising that there is overhead in .par.groupBy, but I was surprised by the constant factor. By these numbers, you would never count that way. Also, I had to bump the memory way up.
The interesting technique used to build the result map is described in this paper linked from the overview. (It cleverly saves the intermediate results and then coalesces them in parallel at the end.)
But copying around the intermediate results of the groupBy turns out to be expensive, if all you really want is a count.
The numbers are comparing sequential groupBy, parallel, and finally aggregate.
apm#mara:~/tmp$ scalacm countints.scala ; scalam -J-Xms8g -J-Xmx8g -J-Xss1m countints.Test
GroupBy: Starting...
Finished in 12695
GroupBy: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Par GroupBy: Starting...
Finished in 51481
Par GroupBy: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Aggregate: Starting...
Finished in 2672
Aggregate: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Nothing magical in the test code.
import collection.GenTraversableOnce
import collection.concurrent.TrieMap
import collection.mutable
import concurrent.duration._
trait Timed {
def now = System.nanoTime
def timed[A](op: =>A): A = {
val start = now
val res = op
val end = now
val lapsed = (end - start).nanos.toMillis
Console println s"Finished in $lapsed"
res
}
def showtime(title: String, op: =>GenTraversableOnce[(Int,Int)]): Unit = {
Console println s"$title: Starting..."
val res = timed(op)
//val showable = res.toIterator.min //(res.toIterator take 10).toList
val showable = res.toList.sorted take 10
Console println s"$title: $showable"
}
}
It generates some random data for interest.
object Test extends App with Timed {
val upto = math.pow(10,6).toInt
val ran = new java.util.Random
val ten = (1 to 10).toList
val maxSamples = 1000
// samples of ten random numbers in the desired range
val samples = (1 to maxSamples).toList map (_ => ten map (_ => ran nextInt upto))
// pick a sample at random
def anyten = samples(ran nextInt maxSamples)
def mag = 7
val data: Vector[List[Int]] = Vector.fill(math.pow(10,mag).toInt)(anyten)
The sequential operation and the combining operation of aggregate are invoked from a task, and the result is assigned to a volatile var.
def z: mutable.Map[Int,Int] = mutable.Map.empty[Int,Int]
def so(m: mutable.Map[Int,Int], is: List[Int]) = {
for (i <- is) {
val v = m.getOrElse(i, 0)
m(i) = v + 1
}
m
}
def co(m: mutable.Map[Int,Int], n: mutable.Map[Int,Int]) = {
for ((i, count) <- n) {
val v = m.getOrElse(i, 0)
m(i) = v + count
}
m
}
showtime("GroupBy", data.flatten groupBy identity map { case (k, vs) => (k, vs.size) })
showtime("Par GroupBy", data.flatten.par groupBy identity map { case (k, vs) => (k, vs.size) })
showtime("Aggregate", data.par.aggregate(z)(so, co))
}
If you want to make use of parallel collections and Scala standard tools, you could do it like that. Group your collection by the identity and then map it to (Value, Count):
scala> val longList = List(1, 5, 2, 3, 7, 4, 2, 3, 7, 3, 2, 1, 7)
longList: List[Int] = List(1, 5, 2, 3, 7, 4, 2, 3, 7, 3, 2, 1, 7)
scala> longList.par.groupBy(x => x)
res0: scala.collection.parallel.immutable.ParMap[Int,scala.collection.parallel.immutable.ParSeq[Int]] = ParMap(5 -> ParVector(5), 1 -> ParVector(1, 1), 2 -> ParVector(2, 2, 2), 7 -> ParVector(7, 7, 7), 3 -> ParVector(3, 3, 3), 4 -> ParVector(4))
scala> longList.par.groupBy(x => x).map(x => (x._1, x._2.size))
res1: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(5 -> 1, 1 -> 2, 2 -> 3, 7 -> 3, 3 -> 3, 4 -> 1)
Or even nicer like pagoda_5b suggested in the comments:
scala> longList.par.groupBy(identity).mapValues(_.size)
res1: scala.collection.parallel.ParMap[Int,Int] = ParMap(5 -> 1, 1 -> 2, 2 -> 3, 7 -> 3, 3 -> 3, 4 -> 1)

Comparing items in two lists

I have two lists : List(1,1,1) , List(1,0,1)
I want to get the following :
A count of every element that contains a 1 in first list and a 0 in the corresponding list at same position and vice versa.
In above example this would be 1 , 0 since the first list contains a 1 at middle position and second list contains a 0 at same position (middle).
A count of every element where 1 is in first list and 1 is also in second list.
In above example this is two since there are two 1's in each corresponding list. I can get this using the intersect method of class List.
I am just looking an answer to point 1 above. I could use an iterative a approach to count the items but is there a more functional method ?
Here is the entire code :
class Similarity {
def getSimilarity(number1: List[Int], number2: List[Int]) = {
val num: List[Int] = number1.intersect(number2)
println("P is " + num.length)
}
}
object HelloWorld {
def main(args: Array[String]) {
val s = new Similarity
s.getSimilarity(List(1, 1, 1), List(1, 0, 1))
}
}
For the first one:
scala> val a = List(1,1,1)
a: List[Int] = List(1, 1, 1)
scala> val b = List(1,0,1)
b: List[Int] = List(1, 0, 1)
scala> a.zip(b).filter(x => x._1==1 && x._2==0).size
res7: Int = 1
For the second:
scala> a.zip(b).filter(x => x._1==1 && x._2==1).size
res7: Int = 2
You can count all combinations easily and have it in a map with
def getSimilarity(number1 : List[Int] , number2 : List[Int]) = {
//sorry for the 1-liner, explanation follows
val countMap = (number1 zip number2) groupBy (identity) mapValues {_.length}
}
/*
* Example
* number1 = List(1,1,0,1,0,0,1)
* number2 = List(0,1,1,1,0,1,1)
*
* countMap = Map((1,0) -> 1, (1,1) -> 3, (0,1) -> 2, (0,0) -> 1)
*/
The trick is a common one
// zip the elements pairwise
(number1 zip number2)
/* List((1,0), (1,1), (0,1), (1,1), (0,0), (0,1), (1,1))
*
* then group together with the identity function, so pairs
* with the same elements are grouped together and the key is the pair itself
*/
.groupBy(identity)
/* Map( (1,0) -> List((1,0)),
* (1,1) -> List((1,1), (1,1), (1,1)),
* (0,1) -> List((0,1), (0,1)),
* (0,0) -> List((0,0))
* )
*
* finally you count the pairs mapping the values to the length of each list
*/
.mapValues(_.length)
/* Map( (1,0) -> 1,
* (1,1) -> 3,
* (0,1) -> 2,
* (0,0) -> 1
* )
Then all you need to do is lookup on the map
a.zip(b).filter(x => x._1 != x._2).size
Almost the same solution that was proposed by Jatin, except that you can useList.countfor a better lisibility:
def getSimilarity(l1: List[Int], l2: List[Int]) =
l1.zip(l2).count({case (x,y) => x != y})
You can also use foldLeft. Assuming there are no non-negative numbers:
a.zip(b).foldLeft(0)( (x,y) => if (y._1 + y._2 == 1) x + 1 else x )
1) You could zip 2 lists to get list of (Int, Int), collect only pairs (1, 0) and (0, 1), replace (1, 0) with 1 and (0, 1) with -1 and get sum. If count of (1, 0) and count of (0, 1) are the same the sum would be equal 0:
val (l1, l2) = (List(1,1,1) , List(1,0,1))
(l1 zip l2).collect{
case (1, 0) => 1
case (0, 1) => -1
}.sum == 0
You could use view method to prevent creation intermediate collections.
2) You could use filter and length to get count of elements with some condition:
(l1 zip l2).filter{ _ == (1, 1) }.length
(l1 zip l2).collect{ case (1, 1) => () }.length

Collection type generated by for with yield

When I evaluate a for in Scala, I get an immutable IndexedSeq (a collection with array-like performance characteristics, such as efficient random access):
scala> val s = for (i <- 0 to 9) yield math.random + i
s: scala.collection.immutable.IndexedSeq[Double] = Vector(0.6127056766832756, 1.7137598183155291, ...
Does a for with a yield always return an IndexedSeq, or can it also return some other type of collection class (a LinearSeq, for example)? If it can also return something else, then what determines the return type, and how can I influence it?
I'm using Scala 2.8.0.RC3.
Thanks michael.kebe for your comment.
This explains how for is translated to operations with map, flatMap, filter and foreach. So my example:
val s = for (i <- 0 to 9) yield math.random + i
is translated to something like this (I'm not sure if it's translated to map or flatMap in this case):
val s = (0 to 9) map { math.random + _ }
The result type of operations like map on collections depends on the collection you call it on. The type of 0 to 9 is a Range.Inclusive:
scala> val d = 0 to 9
d: scala.collection.immutable.Range.Inclusive with scala.collection.immutable.Range.ByOne = Range(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
The result of the map operation on that is an IndexedSeq (because of the builder stuff inside the collections library).
So, to answer my question: the result of a for (...) yield ... depends on what type is inside the parantheses. If I want a List as the result, I could do this:
scala> val s = for (i <- List.range(0, 9)) yield math.random + i
s: List[Double] = List(0.05778968639862214, 1.6758775042995566, ...
You can always transform a range to a list using toList:
> val s = for (i <- (0 to 9).toList) yield math.random + i
> s : List[Double]