Scala: Best way to filter & map in one iteration - scala

I'm new to Scala and trying to figure out the best way to filter & map a collection. Here's a toy example to explain my problem.
Approach 1: This is pretty bad since I'm iterating through the list twice and calculating the same value in each iteration.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums filter { x: Int => (x * x) > N } map { x: Int => (x * x).toString }
Approach 2: This is slightly better but I still need to calculate (x * x) twice.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums collect { case x: Int if (x * x) > N => (x * x).toString }
So, is it possible to calculate this without iterating through the collection twice and avoid repeating the same calculations?

Could use a foldRight
nums.foldRight(List.empty[Int]) {
case (i, is) =>
val s = i * i
if (s > N) s :: is else is
}
A foldLeft would also achieve a similar goal, but the resulting list would be in reverse order (due to the associativity of foldLeft.
Alternatively if you'd like to play with Scalaz
import scalaz.std.list._
import scalaz.syntax.foldable._
nums.foldMap { i =>
val s = i * i
if (s > N) List(s) else List()
}

The typical approach is to use an iterator (if possible) or view (if iterator won't work). This doesn't exactly avoid two traversals, but it does avoid creation of a full-sized intermediate collection. You then map first and filter afterwards and then map again if needed:
xs.iterator.map(x => x*x).filter(_ > N).map(_.toString)
The advantage of this approach is that it's really easy to read and, since there are no intermediate collections, it's reasonably efficient.
If you are asking because this is a performance bottleneck, then the answer is usually to write a tail-recursive function or use the old-style while loop method. For instance, in your case
def sumSqBigN(xs: Array[Int], N: Int): Array[String] = {
val ysb = Array.newBuilder[String]
def inner(start: Int): Array[String] = {
if (start >= xs.length) ysb.result
else {
val sq = xs(start) * xs(start)
if (sq > N) ysb += sq.toString
inner(start + 1)
}
}
inner(0)
}
You can also pass a parameter forward in inner instead of using an external builder (especially useful for sums).

I have yet to confirm that this is truly a single pass, but:
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(x) else None
}

You can use collect which applies a partial function to every value of the collection that it's defined at. Your example could be rewritten as follows:
val sqNumsLargerThanN = nums collect {
case (x: Int) if (x * x) > N => (x * x).toString
}

A very simple approach that only does the multiplication operation once. It's also lazy, so it will be executing code only when needed.
nums.view.map(x=>x*x).withFilter(x => x> N).map(_.toString)
Take a look here for differences between filter and withFilter.

Consider this for comprehension,
for (x <- 0 until 10; v = x*x if v > N) yield v.toString
which unfolds to a flatMap over the range and a (lazy) withFilter onto the once only calculated square, and yields a collection with filtered results. To note one iteration and one calculation of square is required (in addition to creating the range).

You can use flatMap.
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(square.toString) else None
}
Or with Scalaz,
import scalaz.Scalaz._
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
(square > N).option(square.toString)
}
The solves the asked question of how to do this with one iteration. This can be useful when streaming data, like with an Iterator.
However...if you are instead wanting the absolute fastest implementation, this is not it. In fact, I suspect you would use a mutable ArrayList and a while loop. But only after profiling would you know for sure. In any case, that's for another question.

Using a for comprehension would work:
val sqNumsLargerThanN = for {x <- nums if x*x > N } yield (x*x).toString
Also, I'm not sure but I think the scala compiler is smart about a filter before a map and will only do 1 pass if possible.

I am also beginner did it as follows
for(y<-(num.map(x=>x*x)) if y>5 ) { println(y)}

Related

How to add to a list you're returning?

Sorry if this is a stupid question as I am a total beginner. I have a function factors which looks like this:
def factors (n:Int):List[Int] = {
var xs = List[Int]()
for(i <- 2 to (n-1)) {
if(n%i==0) {xs :+ i}
}
return xs
}
However if I do println(factors(10)) I always get List().
What am I doing wrong?
The :+ operation returns a new List, you never assign it to xs.
def factors (n:Int):List[Int] = {
var xs = List[Int]()
for (i <- 2 to (n - 1)) {
if(n%i==0) {xs = xs :+ i}
}
return xs
}
But, you really shouldn't be using var. We don't like them very much in Scala.
Also don't don't don't use return in Scala. It is a much more loaded keyword than you might think. Read about it here
Here is a better way of doing this.
def factors (n:Int): List[Int] =
for {
i <- (2 to (n - 1)).toList
if (n % i) == 0
} yield i
factors(10)
You don't need .toList either but didn't want to mess with your return types. You are welcome to adjust
Working link: https://scastie.scala-lang.org/haGESfhKRxqDdDIpaHXfpw
You can think of this problem as a filtering operation. You start with all the possible factors and you keep the ones where the remainder when dividing the input by that number is 0. The operation that does this in Scala is filter, which keeps values where a particular test is true and removes the others:
def factors(n: Int): List[Int] =
(2 until n).filter(n % _ == 0).toList
To keep the code short I have also used the short form of a function where _ stands for the argument to the function, so n % _ means n divided by the current number that is being tested.

Calculating the mean of elements in a list in Scala

I'm trying to write a method that calculates the mean of the elements in a given List in Scala. Here's my code:
def meanElements(list: List[Float]): Float = {
list match {
case x :: tail => (x + meanElements(tail))/(list.length)
case Nil => 0
}
}
When I call meanElements(List(10,12,14))), the result I get is different than 12. Can someone help?
You can simply do it using inbuilt functions:
scala> def mean(list:List[Int]):Int =
| if(list.isEmpty) 0 else list.sum/list.size
mean: (list: List[Int])Int
scala> mean(List(10,12,14))
res1: Int = 12
scala>
The formula is not correct, it should be:
case x :: tail => (x + meanElements(tail) * tail.length) / list.length
But this implementation is performing a lot of divisions and multiplications.
It would be better to split the computation of the mean to two steps,
calculating the sum first,
and then dividing by list.length.
That is, something more like this:
def meanElements(list: List[Float]): Float = sum(list) / list.length
Where sum is a helper function you have to implement.
If you don't want to expose its implementation,
then you can define it in the body of meanElements.
(Or as #ph88 pointed out,
it could be as simple as list.reduce(_ + _).)

How to pick a random value from a collection in Scala

I need a method to pick uniformly a random value from a collection.
Here is my current impl.
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.toSeq(Random.nextInt(elements.size))
}
But this code instantiate a new collection, so not ideal in term of memory.
Any way to improve ?
[update] make it work with Iterator
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = {
val seq = elements.toSeq
seq(Random.nextInt(seq.size))
}
}
It may seem at first glance that you can't do this without counting the elements first, but you can!
Iterate through the sequence f and take each element fi with probability 1/i:
def choose[A](it: Iterator[A], r: util.Random): A =
it.zip(Iterator.iterate(1)(_ + 1)).reduceLeft((x, y) =>
if (r.nextInt(y._2) == 0) y else x
)._1
A quick demonstration of uniformity:
scala> ((1 to 1000000)
| .map(_ => choose("abcdef".iterator, r))
| .groupBy(identity).values.map(_.length))
res45: Iterable[Int] = List(166971, 166126, 166987, 166257, 166698, 166961)
Here's a discussion of the math I wrote a while back, though I'm afraid it's a bit unnecessarily long-winded. It also generalizes to choosing any fixed number of elements instead of just one.
Simplest way is just to think of the problem as zipping the collection with an equal-sized list of random numbers, and then just extract the maximum element. You can do this without actually realizing the zipped sequence. This does require traversing the entire iterator, though
val maxElement = s.maxBy(_=>Random.nextInt)
Or, for the implicit version
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.maxBy(_=>Random.nextInt)
}
It's possible to select an element uniformly at random from a collection, traversing it once without copying the collection.
The following algorithm will do the trick:
def choose[A](elements: TraversableOnce[A]): A = {
var x: A = null.asInstanceOf[A]
var i = 1
for (e <- elements) {
if (Random.nextDouble <= 1.0 / i) {
x = e
}
i += 1
}
x
}
The algorithm works by at each iteration makes a choice: take the new element with probability 1 / i, or keep the previous one.
To understand why the algorithm choose the element uniformly at random, consider this: Start by considering an element in the collection, for example the first one (in this example the collection only has three elements).
At iteration:
Chosen with probability: 1.
Chosen with probability:
(probability of keeping the element at previous iteration) * (keeping at current iteration)
probability => 1 * 1/2 = 1/2
Chosen with probability: 1/2 * 2/3=1/3 (in other words, uniformly)
If we take another element, for example the second one:
0 (not possible to choose the element at this iteration).
1/2.
1/2*2/3=1/3.
Finally for the third one:
0.
0.
1/3.
This shows that the algorithm selects an element uniformly at random. This can be proved formally using induction.
If the collection is large enough that you care about about instantiations, here is the constant memory solution (I assume, it contains ints' but that only matters for passing initial param to fold):
collection.fold((0, 0)) {
case ((0, _), x) => (1, x)
case ((n, x), _) if (random.nextDouble() > 1.0/n) => (n+1, x)
case ((n, _), x) => (n+1, x)
}._2
I am not sure if this requires a further explanation ... Basically, it does the same thing that #svenslaggare suggested above, but in a functional way, since this is tagged as a scala question.

Convert normal recursion to tail recursion

I was wondering if there is some general method to convert a "normal" recursion with foo(...) + foo(...) as the last call to a tail-recursion.
For example (scala):
def pascal(c: Int, r: Int): Int = {
if (c == 0 || c == r) 1
else pascal(c - 1, r - 1) + pascal(c, r - 1)
}
A general solution for functional languages to convert recursive function to a tail-call equivalent:
A simple way is to wrap the non tail-recursive function in the Trampoline monad.
def pascalM(c: Int, r: Int): Trampoline[Int] = {
if (c == 0 || c == r) Trampoline.done(1)
else for {
a <- Trampoline.suspend(pascal(c - 1, r - 1))
b <- Trampoline.suspend(pascal(c, r - 1))
} yield a + b
}
val pascal = pascalM(10, 5).run
So the pascal function is not a recursive function anymore. However, the Trampoline monad is a nested structure of the computation that need to be done. Finally, run is a tail-recursive function that walks through the tree-like structure, interpreting it, and finally at the base case returns the value.
A paper from RĂșnar Bjanarson on the subject of Trampolines: Stackless Scala With Free Monads
In cases where there is a simple modification to the value of a recursive call, that operation can be moved to the front of the recursive function. The classic example of this is Tail recursion modulo cons, where a simple recursive function in this form:
def recur[A](...):List[A] = {
...
x :: recur(...)
}
which is not tail recursive, is transformed into
def recur[A]{...): List[A] = {
def consRecur(..., consA: A): List[A] = {
consA :: ...
...
consrecur(..., ...)
}
...
consrecur(...,...)
}
Alexlv's example is a variant of this.
This is such a well known situation that some compilers (I know of Prolog and Scheme examples but Scalac does not do this) can detect simple cases and perform this optimisation automatically.
Problems combining multiple calls to recursive functions have no such simple solution. TMRC optimisatin is useless, as you are simply moving the first recursive call to another non-tail position. The only way to reach a tail-recursive solution is remove all but one of the recursive calls; how to do this is entirely context dependent but requires finding an entirely different approach to solving the problem.
As it happens, in some ways your example is similar to the classic Fibonnaci sequence problem; in that case the naive but elegant doubly-recursive solution can be replaced by one which loops forward from the 0th number.
def fib (n: Long): Long = n match {
case 0 | 1 => n
case _ => fib( n - 2) + fib( n - 1 )
}
def fib (n: Long): Long = {
def loop(current: Long, next: => Long, iteration: Long): Long = {
if (n == iteration)
current
else
loop(next, current + next, iteration + 1)
}
loop(0, 1, 0)
}
For the Fibonnaci sequence, this is the most efficient approach (a streams based solution is just a different expression of this solution that can cache results for subsequent calls). Now,
you can also solve your problem by looping forward from c0/r0 (well, c0/r2) and calculating each row in sequence - the difference being that you need to cache the entire previous row. So while this has a similarity to fib, it differs dramatically in the specifics and is also significantly less efficient than your original, doubly-recursive solution.
Here's an approach for your pascal triangle example which can calculate pascal(30,60) efficiently:
def pascal(column: Long, row: Long):Long = {
type Point = (Long, Long)
type Points = List[Point]
type Triangle = Map[Point,Long]
def above(p: Point) = (p._1, p._2 - 1)
def aboveLeft(p: Point) = (p._1 - 1, p._2 - 1)
def find(ps: Points, t: Triangle): Long = ps match {
// Found the ultimate goal
case (p :: Nil) if t contains p => t(p)
// Found an intermediate point: pop the stack and carry on
case (p :: rest) if t contains p => find(rest, t)
// Hit a triangle edge, add it to the triangle
case ((c, r) :: _) if (c == 0) || (c == r) => find(ps, t + ((c,r) -> 1))
// Triangle contains (c - 1, r - 1)...
case (p :: _) if t contains aboveLeft(p) => if (t contains above(p))
// And it contains (c, r - 1)! Add to the triangle
find(ps, t + (p -> (t(aboveLeft(p)) + t(above(p)))))
else
// Does not contain(c, r -1). So find that
find(above(p) :: ps, t)
// If we get here, we don't have (c - 1, r - 1). Find that.
case (p :: _) => find(aboveLeft(p) :: ps, t)
}
require(column >= 0 && row >= 0 && column <= row)
(column, row) match {
case (c, r) if (c == 0) || (c == r) => 1
case p => find(List(p), Map())
}
}
It's efficient, but I think it shows how ugly complex recursive solutions can become as you deform them to become tail recursive. At this point, it may be worth moving to a different model entirely. Continuations or monadic gymnastics might be better.
You want a generic way to transform your function. There isn't one. There are helpful approaches, that's all.
I don't know how theoretical this question is, but a recursive implementation won't be efficient even with tail-recursion. Try computing pascal(30, 60), for example. I don't think you'll get a stack overflow, but be prepared to take a long coffee break.
Instead, consider using a Stream or memoization:
val pascal: Stream[Stream[Long]] =
(Stream(1L)
#:: (Stream from 1 map { i =>
// compute row i
(1L
#:: (pascal(i-1) // take the previous row
sliding 2 // and add adjacent values pairwise
collect { case Stream(a,b) => a + b }).toStream
++ Stream(1L))
}))
The accumulator approach
def pascal(c: Int, r: Int): Int = {
def pascalAcc(acc:Int, leftover: List[(Int, Int)]):Int = {
if (leftover.isEmpty) acc
else {
val (c1, r1) = leftover.head
// Edge.
if (c1 == 0 || c1 == r1) pascalAcc(acc + 1, leftover.tail)
// Safe checks.
else if (c1 < 0 || r1 < 0 || c1 > r1) pascalAcc(acc, leftover.tail)
// Add 2 other points to accumulator.
else pascalAcc(acc, (c1 , r1 - 1) :: ((c1 - 1, r1 - 1) :: leftover.tail ))
}
}
pascalAcc(0, List ((c,r) ))
}
It does not overflow the stack but as on big row and column but Aaron mentioned it's not fast.
Yes it's possible. Usually it's done with accumulator pattern through some internally defined function, which has one additional argument with so called accumulator logic, example with counting length of a list.
For example normal recursive version would look like this:
def length[A](xs: List[A]): Int = if (xs.isEmpty) 0 else 1 + length(xs.tail)
that's not a tail recursive version, in order to eliminate last addition operation we have to accumulate values while somehow, for example with accumulator pattern:
def length[A](xs: List[A]) = {
def inner(ys: List[A], acc: Int): Int = {
if (ys.isEmpty) acc else inner(ys.tail, acc + 1)
}
inner(xs, 0)
}
a bit longer to code, but i think the idea i clear. Of cause you can do it without inner function, but in such case you should provide acc initial value manually.
I'm pretty sure it's not possible in the simple way you're looking for the general case, but it would depend on how elaborate you permit the changes to be.
A tail-recursive function must be re-writable as a while-loop, but try implementing for example a Fractal Tree using while-loops. It's possble, but you need to use an array or collection to store the state for each point, which susbstitutes for the data otherwise stored in the call-stack.
It's also possible to use trampolining.
It is indeed possible. The way I'd do this is to
begin with List(1) and keep recursing till you get to the
row you want.
Worth noticing that you can optimize it: if c==0 or c==r the value is one, and to calculate let's say column 3 of the 100th row you still only need to calculate the first three elements of the previous rows.
A working tail recursive solution would be this:
def pascal(c: Int, r: Int): Int = {
#tailrec
def pascalAcc(c: Int, r: Int, acc: List[Int]): List[Int] = {
if (r == 0) acc
else pascalAcc(c, r - 1,
// from let's say 1 3 3 1 builds 0 1 3 3 1 0 , takes only the
// subset that matters (if asking for col c, no cols after c are
// used) and uses sliding to build (0 1) (1 3) (3 3) etc.
(0 +: acc :+ 0).take(c + 2)
.sliding(2, 1).map { x => x.reduce(_ + _) }.toList)
}
if (c == 0 || c == r) 1
else pascalAcc(c, r, List(1))(c)
}
The annotation #tailrec actually makes the compiler check the function
is actually tail recursive.
It could be probably be further optimized since given that the rows are symmetric, if c > r/2, pascal(c,r) == pascal ( r-c,r).. but left to the reader ;)

Scala performance - Sieve

Right now, I am trying to learn Scala . I've started small, writing some simple algorithms . I've encountered some problems when I wanted to implement the Sieve algorithm from finding all all prime numbers lower than a certain threshold .
My implementation is:
import scala.math
object Sieve {
// Returns all prime numbers until maxNum
def getPrimes(maxNum : Int) = {
def sieve(list: List[Int], stop : Int) : List[Int] = {
list match {
case Nil => Nil
case h :: list if h <= stop => h :: sieve(list.filterNot(_ % h == 0), stop)
case _ => list
}
}
val stop : Int = math.sqrt(maxNum).toInt
sieve((2 to maxNum).toList, stop)
}
def main(args: Array[String]) = {
val ap = printf("%d ", (_:Int));
// works
getPrimes(1000).foreach(ap(_))
// works
getPrimes(100000).foreach(ap(_))
// out of memory
getPrimes(1000000).foreach(ap(_))
}
}
Unfortunately it fails when I want to computer all the prime numbers smaller than 1000000 (1 million) . I am receiving OutOfMemory .
Do you have any idea on how to optimize the code, or how can I implement this algorithm in a more elegant fashion .
PS: I've done something very similar in Haskell, and there I didn't encountered any issues .
I would go with an infinite Stream. Using a lazy data structure allows to code pretty much like in Haskell. It reads automatically more "declarative" than the code you wrote.
import Stream._
val primes = 2 #:: sieve(3)
def sieve(n: Int) : Stream[Int] =
if (primes.takeWhile(p => p*p <= n).exists(n % _ == 0)) sieve(n + 2)
else n #:: sieve(n + 2)
def getPrimes(maxNum : Int) = primes.takeWhile(_ < maxNum)
Obviously, this isn't the most performant approach. Read The Genuine Sieve of Eratosthenes for a good explanation (it's Haskell, but not too difficult). For real big ranges you should consider the Sieve of Atkin.
The code in question is not tail recursive, so Scala cannot optimize the recursion away. Also, Haskell is non-strict by default, so you can't hardly compare it to Scala. For instance, whereas Haskell benefits from foldRight, Scala benefits from foldLeft.
There are many Scala implementations of Sieve of Eratosthenes, including some in Stack Overflow. For instance:
(n: Int) => (2 to n) |> (r => r.foldLeft(r.toSet)((ps, x) => if (ps(x)) ps -- (x * x to n by x) else ps))
The following answer is about a 100 times faster than the "one-liner" answer using a Set (and the results don't need sorting to ascending order) and is more of a functional form than the other answer using an array although it uses a mutable BitSet as a sieving array:
object SoE {
def makeSoE_Primes(top: Int): Iterator[Int] = {
val topndx = (top - 3) / 2
val nonprms = new scala.collection.mutable.BitSet(topndx + 1)
def cullp(i: Int) = {
import scala.annotation.tailrec; val p = i + i + 3
#tailrec def cull(c: Int): Unit = if (c <= topndx) { nonprms += c; cull(c + p) }
cull((p * p - 3) >>> 1)
}
(0 to (Math.sqrt(top).toInt - 3) >>> 1).filterNot { nonprms }.foreach { cullp }
Iterator.single(2) ++ (0 to topndx).filterNot { nonprms }.map { i: Int => i + i + 3 }
}
}
It can be tested by the following code:
object Main extends App {
import SoE._
val top_num = 10000000
val strt = System.nanoTime()
val count = makeSoE_Primes(top_num).size
val end = System.nanoTime()
println(s"Successfully completed without errors. [total ${(end - strt) / 1000000} ms]")
println(f"Found $count primes up to $top_num" + ".")
println("Using one large mutable1 BitSet and functional code.")
}
With the results from the the above as follows:
Successfully completed without errors. [total 294 ms]
Found 664579 primes up to 10000000.
Using one large mutable BitSet and functional code.
There is an overhead of about 40 milliseconds for even small sieve ranges, and there are various non-linear responses with increasing range as the size of the BitSet grows beyond the different CPU caches.
It looks like List isn't very effecient space wise. You can get an out of memory exception by doing something like this
1 to 2000000 toList
I "cheated" and used a mutable array. Didn't feel dirty at all.
def primesSmallerThan(n: Int): List[Int] = {
val nonprimes = Array.tabulate(n + 1)(i => i == 0 || i == 1)
val primes = new collection.mutable.ListBuffer[Int]
for (x <- nonprimes.indices if !nonprimes(x)) {
primes += x
for (y <- x * x until nonprimes.length by x if (x * x) > 0) {
nonprimes(y) = true
}
}
primes.toList
}