Scala: Side effects with collection transformations - scala

I am trying to understand views in scala via this link http://docs.scala-lang.org/overviews/collections/views.html.
I didn't understand what collection transformations have/not side effects means ! ?
Thank you

By having side effects there is meant a situation when you execute a code in the collection transformations that closures over some external state, or have any side effect that effects anything else than the result of the transformation. Example:
val l = List(1, 2, 3, 4).view.map(x => {println(x); x + 1})
When you execute this code it will print nothing, because view delays the execution of map. Furthermore each time you try to iterate over this list, map will be executed, resulting in printing value more times than it was desirable.
var counter = 0
val ll = for (i <- List(1, 2, 3, 4).view)
yield { counter += 1; i + 1}
println(counter) // 0
println(ll.toList) // this executes .force internally
println(counter) // 4
Behaves in the same way, but it is even more unexpected. counter increases only after the fact of iteration, and given that ll is lazy and delayed, the iteration may happen far more deeper in the code, resulting in counter being equal to 0 before that

Scala has immutable collections (everything in scala.collection.immutable). These collection types have no operations to modfiy them, but only to get modified copies.
So for example this
Set(1) + 2
will give you a new Set containing 1 and 2, not modify the first set. The same holds for transformations as map, flatMap, filter etc.
Views do not change anything about that. The only difference between a view and the collection it is based on is, that (most) operations on views are lazy, i.e. intermediate results are not computed.
val l1 = List(1,2)
val l2 = List(1,2).map(x => x + 1) // a new List(2,3) is computed here
l2.foreach(println) // the elements of l2 are just printed
With views:
val v2 = l1.view.map(x => x + 1) // nothing is computed here
v2.foreach(println) // the values are computed step by step

Related

Unable to update a mutable Scala collection inside a loop (or map)

I have a mutable scala Set:
val valueSet = scala.collection.mutable.Set[Int](0, 1, 2)
when I perform
valueSet -= 1
the result is Set(0,2)
But when I perform same thing inside a loop or map:
Range(0, 10).map(entry => valueSet -= 1)
valueSet
res130: scala.collection.mutable.Set[Int] = Set(0, 1, 2)
The content of valueSet remain: Set(0, 1, 2).
I need to run a loop and based on some condition, remove elements from the Set until either the loop ends or Set becomes empty. I tried printing the valueSet inside the loop and it is working correctly but when the loop ends the valueSet comes back to the original set.
Using immutable version is going to seriously affect the performance of the code that's why I am using mutable version.
Please help!
EDIT:
I am using spark-shell REPL. (spark 1.6.1)
I tried several things more and figured out that if I am performing loop or map on and RDD then it doesn't work. But for collections that are non distributed, It works. I am guessing this has something to do with the fact that it is a transformation function on RDD and doesn't perform any action. But that is just my guess.
It works and removes entries according their existence
val valueSet = scala.collection.mutable.Set[Int](0, 1, 2)
Range(0, 10).foreach(entry => valueSet -= entry)
println(valueSet.size) //size = 0
Maybe a for comprehension - since I'm guessing your actual predicate is more involved than just removing value 1 from the set. It will return 1 new mutable Set but you will not generate an intermediary set for each value in the range.
scala> for {
| x <- valueSet
| if(x != 1) // or whatever
| } yield x
res1: scala.collection.mutable.Set[Int] = Set(0, 2)

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Seq with maximal elements

I have a Seq and function Int => Int. What I need to achieve is to take from original Seq only thoose elements that would be equal to the maximum of the resulting sequence (the one, I'll have after applying given function):
def mapper:Int=>Int= x=>x*x
val s= Seq( -2,-2,2,2 )
val themax= s.map(mapper).max
s.filter( mapper(_)==themax)
But this seems wasteful, since it has to map twice (once for the filter, other for the maximum).
Is there a better way to do this? (without using a cycle, hopefully)
EDIT
The code has since been edited; in the original this was the filter line: s.filter( mapper(_)==s.map(mapper).max). As om-nom-nom has pointed out, this evaluates `s.map(mapper).max each (filter) iteration, leading to quadratic complexity.
Here is a solution that does the mapping only once and using the `foldLeft' function:
The principle is to go through the seq and for each mapped element if it is greater than all mapped before then begin a new sequence with it, otherwise if it is equal return the list of all maximums and the new mapped max. Finally if it is less then return the previously computed Seq of maximums.
def getMaxElems1(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = s.foldLeft(Seq[(Int,Int)]())((res, elem) => {
val e2 = mapper(elem)
if(res.isEmpty || e2>res.head._2)
Seq((elem,e2))
else if (e2==res.head._2)
res++Seq((elem,e2))
else res
}).map(_._1) // keep only original elements
// test with your list
scala> getMaxElems1(s)(mapper)
res14: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems1(Seq(-1, 2,0, -2, 1,-2))(mapper)
res15: Seq[Int] = List(2, -2, -2)
Remark: About complexity
The algorithm I present above has a complexity of O(N) for a list with N elements. However:
the operation of mapping all elements is of complexity O(N)
the operation of computing the max is of complexity O(N)
the operation of zipping is of complexity O(N)
the operation of filtering the list according to the max is also of complexity O(N)
the operation of mapping all elements is of complexity O(M), with M the number of final elements
So, finally the algorithm you presented in your question has the same complexity (quality) than my answer's one, moreover the solution you present is more clear than mine. So, even if the 'foldLeft' is more powerful, for this operation I would recommend your idea, but with zipping original list and computing the map only once (especially if your map is more complicated than a simple square). Here is the solution computed with the help of *scala_newbie* in question/chat/comments.
def getMaxElems2(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = {
val mappedS = s.map(mapper) //map done only once
val m = mappedS.max // find the max
s.zip(mappedS).filter(_._2==themax).unzip._1
}
// test with your list
scala> getMaxElems2(s)(mapper)
res16: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems2(Seq(-1, 2,0, -2, 1,-2))(mapper)
res17: Seq[Int] = List(2, -2, -2)

Scala: what is the most appropriate data structure for sorted subsets?

Given a large collection (let's call it 'a') of elements of type T (say, a Vector or List) and an evaluation function 'f' (say, (T) => Double) I would like to derive from 'a' a result collection 'b' that contains the N elements of 'a' that result in the highest value under f. The collection 'a' may contain duplicates. It is not sorted.
Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection 'b'? Thanks for any pointers / ideas.
Notes:
(1) I guess my use case can be most concisely expressed as
val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n // evaluation function
val b = a.sortBy( f ).take( N ) // sort, then clip
except that I do not want to sort the entire set.
(2) one option might be an iteration over 'a' that fills a TreeSet with 'manual' size bounding (reject anything worse than the worst item in the set, don't let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.
(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?
You can use a priority queue:
def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = {
val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse)
val (before, after) = xs.splitAt(k)
q ++= before
after.foreach(x => q += ord.max(x, q.dequeue))
q.dequeueAll
}
We fill the queue with the first k elements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:
scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4)
res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)
And it doesn't sort the complete list. I've got an Ordering in this implementation, but adapting it to use an evaluation function would be pretty trivial.

Incrementing the for loop (loop variable) in scala by power of 5

I had asked this question on Javaranch, but couldn't get a response there. So posting it here as well:
I have this particular requirement where the increment in the loop variable is to be done by multiplying it with 5 after each iteration. In Java we could implement it this way:
for(int i=1;i<100;i=i*5){}
In scala I was trying the following code-
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
But its printing the following output:
1 1
6 2
11 3
16 4
21 5
26 6
31 7
36 8
....
....
Its incrementing by 5 always. So how do I got about actually multiplying the increment by 5 instead of adding it.
Let's first explain the problem. This code:
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
is equivalent to this:
var j = 1
val range: Range = Predef.intWrapper(1).to(100)
val increment: Int = scala.math.pow(5, j).toInt
val byRange: Range = range.by(increment)
byRange.foreach {
println(i+" "+j)
j=j+1
}
So, by the time you get to mutate j, increment and byRange have already been computed. And Range is an immutable object -- you can't change it. Even if you produced new ranges while you did the foreach, the object doing the foreach would still be the same.
Now, to the solution. Simply put, Range is not adequate for your needs. You want a geometric progression, not an arithmetic one. To me (and pretty much everyone else answering, it seems), the natural solution would be to use a Stream or Iterator created with iterate, which computes the next value based on the previous one.
for(i <- Iterator.iterate(1)(_ * 5) takeWhile (_ < 100)) {
println(i)
}
EDIT: About Stream vs Iterator
Stream and Iterator are very different data structures, that share the property of being non-strict. This property is what enables iterate to even exist, since this method is creating an infinite collection1, from which takeWhile will create a new2 collection which is finite. Let's see here:
val s1 = Stream.iterate(1)(_ * 5) // s1 is infinite
val s2 = s1.takeWhile(_ < 100) // s2 is finite
val i1 = Iterator.iterate(1)(_ * 5) // i1 is infinite
val i2 = i1.takeWhile(_ < 100) // i2 is finite
These infinite collections are possible because the collection is not pre-computed. On a List, all elements inside the list are actually stored somewhere by the time the list has been created. On the above examples, however, only the first element of each collection is known in advance. All others will only be computed if and when required.
As I mentioned, though, these are very different collections in other respects. Stream is an immutable data structure. For instance, you can print the contents of s2 as many times as you wish, and it will show the same output every time. On the other hand, Iterator is a mutable data structure. Once you used a value, that value will be forever gone. Print the contents of i2 twice, and it will be empty the second time around:
scala> s2 foreach println
1
5
25
scala> s2 foreach println
1
5
25
scala> i2 foreach println
1
5
25
scala> i2 foreach println
scala>
Stream, on the other hand, is a lazy collection. Once a value has been computed, it will stay computed, instead of being discarded or recomputed every time. See below one example of that behavior in action:
scala> val s2 = s1.takeWhile(_ < 100) // s2 is finite
s2: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> println(s2)
Stream(1, ?)
scala> s2 foreach println
1
5
25
scala> println(s2)
Stream(1, 5, 25)
So Stream can actually fill up the memory if one is not careful, whereas Iterator occupies constant space. On the other hand, one can be surprised by Iterator, because of its side effects.
(1) As a matter of fact, Iterator is not a collection at all, even though it shares a lot of the methods provided by collections. On the other hand, from the problem description you gave, you are not really interested in having a collection of numbers, just in iterating through them.
(2) Actually, though takeWhile will create a new Iterator on Scala 2.8.0, this new iterator will still be linked to the old one, and changes in one have side effects on the other. This is subject to discussion, and they might end up being truly independent in the future.
In a more functional style:
scala> Stream.iterate(1)(i => i * 5).takeWhile(i => i < 100).toList
res0: List[Int] = List(1, 5, 25)
And with more syntactic sugar:
scala> Stream.iterate(1)(_ * 5).takeWhile(_ < 100).toList
res1: List[Int] = List(1, 5, 25)
Maybe a simple while-loop would do?
var i=1;
while (i < 100)
{
println(i);
i*=5;
}
or if you want to also print the number of iterations
var i=1;
var j=1;
while (i < 100)
{
println(j + " : " + i);
i*=5;
j+=1;
}
it seems you guys likes functional so how about a recursive solution?
#tailrec def quints(n:Int): Unit = {
println(n);
if (n*5<100) quints(n*5);
}
Update: Thanks for spotting the error... it should of course be power, not multiply:
Annoyingly, there doesn't seem to be an integer pow function in the standard library!
Try this:
def pow5(i:Int) = math.pow(5,i).toInt
Iterator from 1 map pow5 takeWhile (100>=) toList
Or if you want to use it in-place:
Iterator from 1 map pow5 takeWhile (100>=) foreach {
j => println("number:" + j)
}
and with the indices:
val iter = Iterator from 1 map pow5 takeWhile (100>=)
iter.zipWithIndex foreach { case (j, i) => println(i + " = " + j) }
(0 to 2).map (math.pow (5, _).toInt).zipWithIndex
res25: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,0), (5,1), (25,2))
produces a Vector, with i,j in reversed order.