Terminal function for collections in scala - scala

Suppose we have a list of some int values(positive and negative) and we have a task to double only positive values. Here is a snippet that produces the desired result:
val list: List[Int] = ....
list.filter(_ > 0).map(_ * 2)
So far so good, but what if the list is very large of size N. Does the program iterates N times on filter function and then N times on map function?
How does scala know when it's time to go through the list in cycle and apply all the filtering and mapping stuff? What will be result(in terms of list traversing) if we group the original list by identity function for instance(to get rid of duplicates) and the apply map function?

Does the program iterates N times on filter function and then N times on map function?
Yes. Use a view to make operations on collections lazy. For example:
val list: List[Int] = ...
list.view.filter(_ > 0).map(_ * 2)
How does scala know when it's time to go through the list in cycle and apply all the filtering and mapping stuff?
When using a view, it will calculate the value when you actually go to use a materialized value:
val x = list.view.filter(_ > 0).map(_ * 2)
println(x.head * 2)
This only applies the filter and the map when head is called.

Does the program iterates N times on filter function and then N times
on map function?
Yes, for a List you should use withFilter instead. From withFilter doc:
Note: the difference between `c filter p` and `c withFilter p` is that
the former creates a new collection, whereas the latter only
restricts the domain of subsequent `map`, `flatMap`, `foreach`,
and `withFilter` operations.

If there is a map after filter, you can always using collect instead:
list.filter(0<).map(2*)
to
list.collect{ case x if x > 0 => 2*x }

Related

how do I chain the custom logic using scala?

I am trying to achieve the calculation of area of polygon using scala
Reference for Area of Polygon https://www.mathopenref.com/coordpolygonarea.html
I was able to write the code successfully but my requirement is I want the last for loop to be chained within points variable like other map logics. Map is not working because output of list is (N-1)
var lines=io.Source.stdin.getLines()
val nPoints= lines.next.toInt
var s=0
var points = lines.take(nPoints).toList map(_.split(" ")) map{case Array(e1,e2)=>(e1.toInt,e2.toInt)}
for(i <- 0 until points.length-1){ // want this for loop to be chained within points variable
s=s+(points(i)._1 * points(i+1)._2) - (points(i)._2 * points(i+1)._1)
}
println(scala.math.abs(s/2.0))
You're looking for a sliding and a foldLeft, I think. sliding(2) will give you lists List(i, i+1), List(i+1, i+2),... and then you can use foldLeft to do the computation:
val s = points.sliding(2).foldLeft(0){
case (acc, (p1, p2)::(q1, q2)::Nil) => acc + p1*q2 - p2*q1 }
If I read the formula correctly, there's one additional term coming from the first and last vertices (which looks like it's missing from your implementation, by the way). You could either add that separately or repeat the first vertex at the end of your list. As a result, combining into the same line that defines points doesn't work so easily. In any case, I think it is more readable split into separate statements--one defining points, another repeating the first element of points at the end of the list and then the fold.
Edited to fix the fact that you have tuples for your points, not lists. I'm also tacitly assuming points is a list, I believe.
how about something like (untested)
val points2 = points.tail :+ points.head // shift list one to the left
val area = ((points zip points2)
.map{case (p1, p2) => (p1._1 * p2._2) - (p2._1 * p1._2)}
.fold(0)(_+_))/2

Tail recursion with List + .toVector or Vector?

val dimensionality = 10
val zeros = DenseVector.zeros[Double](dimensionality)
#tailrec private def specials(list: List[DenseVector[Int]], i: Int): List[DenseVector[Int]] = {
if(i >= dimensionality) list
else {
val vec = zeros.copy
vec(i to i) := 1
specials(vec :: list, i + 1)
}
}
val specialList = specials(Nil, 0).toVector
specialList.map(...doing my thing...)
Should I write my tail recursive function using a List as accumulator above and then write
specials(Nil, 0).toVector
or should I write my trail recursion with a Vector in the first place? What is computationally more efficient?
By the way: specialList is a list that contains DenseVectors where every entry is 0 with the exception of one entry, which is 1. There are as many DenseVectors as they are long.
I'm not sur what you're trying to do here but you could rewrite your code like so:
type Mat = List[Vector[Int]]
#tailrec
private def specials(mat: Mat, i: Int): Mat = i match {
case `dimensionality` => mat
case _ =>
val v = zeros.copy.updated(i,1)
specials(v :: mat, i + 1)
}
As you are dealing with a matrix, Vector is probably a better choice.
Let's compare the performance characteristics of both variants:
List: prepending takes constant time, conversion to Vector takes linear time.
Vector: prepending takes "effectively" constant time (eC), no subsequent conversion needed.
If you compare the implementations of List and Vector, then you'll find out that prepending to a List is a simpler and cheaper operation than prepending to a Vector. Instead of just adding another element at the front as it is done by List, Vector potentially has to replace a whole branch/subtree internally. On average, this still happens in constant time ("effectively" constant, because the subtrees can differ in their size), but is more expensive than prepending to List. On the plus side, you can avoid the call to toVector.
Eventually, the crucial point of interest is the size of the collection you want to create (or in other words, the amount of recursive prepend-steps you are doing). It's totally possible that there is no clear winner and one of the two variants is faster for <= n steps, whereas the other variant is faster for > n steps. In my naive toy benchmark, List/toVecor seemed to be faster for less than 8k elements, but you should perform a set of well-chosen benchmarks that represent your scenario adequately.

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Spark: How to 'scan' RDD collections?

Does Spark have any analog of Scala scan operation to work on RDD collections?
(for details please see Reduce, fold or scan (Left/Right)?)
For example:
val abc = List("A", "B", "C")
def add(res: String, x: String) = {
println(s"op: $res + $x = ${res + x}")
res + x
}
So to get:
abc.scanLeft("z")(add)
// op: z + A = zA // same operations as foldLeft above...
// op: zA + B = zAB
// op: zAB + C = zABC
// res: List[String] = List(z, zA, zAB, zABC) // maps intermediate results
Any other means to achieve the same result?
Update
What is "Spark" way to solve, for example, the following problem:
Compute elements of the vector as (in pseudocode):
x(i) = SomeFun(for k from 0 to i-1)(y(k))
Should I collect RDD for this? No other way?
Update 2
Ok, I understand the general problem. Yet maybe you could advise me on the particular case I have to deal with.
I have a list of ints as input RDD and I have to build an outptut RDD, where the following should hold:
1) input.length == output.length // output list is of the same length as input
2) output(i) = sum( range (0..i), input(i)) / q^i // i-th element of output list equals sum of input elements from 0 to i divided by i-th power of some constant q
In fact I need a combination of map and fold function to solve this.
Another idea is to write a recursive fold on diminishing tails of the input list. But this is super inefficient and AFAIK Spark does not have tail or init function for RDD.
How would you solve this problem in Sparck?
You are correct that there does not exist the analog of scan() in the generic RDD.
A potential explantion: Such a method would require access to all elements of the distributed collection to process each element of the generated output collection. before continuing on to the next output element.
So if your input list were say 1 million plus one entries there would be 1 million shuffle operations on the cluster (even though the sorting is not required here - spark gives it for "free" when doing a cluster collect step).
UPDATE OP has expanded the question. Here is response to the expanded question.
from updated OP:
x(i) = SomeFun(for k from 0 to i-1)(y(k))
You need to distinguish whether x(i) computation - specifically the y(k) function - were going to either:
require access to the entire dataset x(0 .. i -1)
change the structure of the dataset
on each iteration. That is the case for scan - and given your description it seems to be your purpose. AFAIK this is not supported in Spark. Once again - think if you were developing the distributed framework. How would you achieve same? It does not seem to be a scalable means to achieve - so yes you would need to do that computation in an
collect()
invocation against the original RDD and perform it on the Driver.

Carry on information about previous computations

I'm new to functional programming, so some problems seems harder to solve using functional approach.
Let's say I have a list of numbers, like 1 to 10.000, and I want to get the items of the list which sums up to at most a number n (let's say 100). So, it would get the numbers until their sum is greater than 100.
In imperative programming, it's trivial to solve this problem, because I can keep a variable in each interaction, and stop once the objective is met.
But how can I do the same in functional programming? Since the sum function operates on completed lists, and I still don't have the completed list, how can I 'carry on' the computation?
If sum was lazily computed, I could write something like that:
(1 to 10000).sum.takeWhile(_ < 100)
P.S.:Even though any answer will be appreciated, I'd like one that doesn't compute the sum each time, since obviously the imperative version will be much more optimal regarding speed.
Edit:
I know that I can "convert" the imperative loop approach to a functional recursive function. I'm more interested in finding if one of the existing library functions can provide a way for me not to write one each time I need something.
Use Stream.
scala> val ss = Stream.from(1).take(10000)
ss: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> ss.scanLeft(0)(_ + _)
res60: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> res60.takeWhile(_ < 100).last
res61: Int = 91
EDIT:
Obtaining components is not very tricky either. This is how you can do it:
scala> ss.scanLeft((0, Vector.empty[Int])) { case ((sum, compo), cur) => (sum + cur, compo :+ cur) }
res62: scala.collection.immutable.Stream[(Int, scala.collection.immutable.Vector[Int])] = Stream((0,Vector()), ?)
scala> res62.takeWhile(_._1 < 100).last
res63: (Int, scala.collection.immutable.Vector[Int]) = (91,Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13))
The second part of the tuple is your desired result.
As should be obvious, in this case, building a vector is wasteful. Instead we can only store the last number from the stream that contributed to sum.
scala> ss.scanLeft(0)(_ + _).zipWithIndex
res64: scala.collection.immutable.Stream[(Int, Int)] = Stream((0,0), ?)
scala> res64.takeWhile(_._1 < 100).last._2
res65: Int = 13
The way I would do this is with recursion. On each call, add the next number. Your base case is when the sum is greater than 100, at which point you return all the way up the stack. You'll need a helper function to do the actual recursion, but that's no big deal.
This isn't hard using "functional" methods either.
Using recursion, rather than maintaining your state in a local variable that you mutate, you keep it in parameters and return values.
So, to return the longest initial part of a list whose sum is at most N:
If the list is empty, you're done; return the empty list.
If the head of the list is greater than N, you're done; return the empty list.
Otherwise, let H be the head of the list.
All we need now is the initial part of the tail of the list whose sum is at most N - H, then we can "cons" H onto that list, and we're done.
We can compute this recursively using the same procedure as we have used this far, so it's an easy step.
A simple pseudocode solution:
sum_to (n, ls) = if isEmpty ls or n < (head ls)
then Nil
else (head ls) :: sum_to (n - head ls, tail ls)
sum_to(100, some_list)
All sequence operations which require only one pass through the sequence can be implemented using folds our reduce like it is sometimes called.
I find myself using folds very often since I became used to functional programming
so here odd one possible approach
Use an empty collection as initial value and fold according to this strategy
Given the processed collection and the new value check if their sum is low enough and if then spend the value to the collection else do nothing
that solution is not very efficient but I want to emphasize the following
map fold filter zip etc are the way to get accustomed to functional programming try to use them as much as possible instead of loping constructs or recursive functions your code will be more declarative and functional