Scala MapReduce Filter - scala

Is there any way of doing the following in Scala?
Say I have an array of Double of size 15:
[10,20,30,40,50,60,70,80,Double.NaN,Double.NaN,110,120,130,140,150]
I would like to replace all the Double.NaN (from left to right) by the average of the last four values in the array using map reduce. So the first Double.NaN gets replaced by 60, and the next Double.NaN is replaced by 64 (i.e., the previously calculated 60 at the index 8 is used in this calculation).
So far I have used function type parameters to get the positions of the Double.NaN.

I'm not sure what exactly you mean by "map-reduce" in this case. It looks rather like a use-case for scanLeft:
import scala.collection.immutable.Queue
val input = List[Double](
10,20,30,40,50,60,70,80,Double.NaN,
Double.NaN,110,120,130,140,150
)
val patched = input.
scanLeft((Queue.fill(5)(0d), 0d)){
case ((q, _), x) => {
val y = if (x.isNaN) q.sum / 5 else x;
(q.dequeue._2.enqueue(y), y)
}
}.unzip._2.tail
Creates result:
List(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 60.0, 64.0, 110.0, 120.0, 130.0, 140.0, 150.0)
In general, unless the gaps are "rare", this would not work with a typical map-reduce workflow, because
Every value in the resulting list can depend on arbitrary many values to the left of it, therefore you cannot cut the dataset up in independent blocks and map them independently.
You are not reducing anything, you want a patched list back
If you are not mapping and not reducing, I wouldn't call it "map-reduce".
By the way: the above code works for any (positive integer) value of "5".

Note that averaging the last four values of the first NaN from the given example (50,60,70,80) gives 65, not 60. The last five will give 60.
Does it have to be a map-reduce? How about a fold?
(List[Double]() /: listOfDoubles)((acc: List[Double], double: Double) => {(if (double.isNaN)
acc match {
case Nil => 0.0 // first double in the list
case _ => {
val last5 = acc.take(5)
(0.0 /: last5)(_ + _) / last5.size // in case there's only a last 1, 2, 3, or 4 instead of 5
}
}
else double) :: acc}).reverse

Related

Efficient way to select a subset of elements from a Seq in Scala

I have a sequence
val input = Seq(1,3,4,5,9,11...)
I want to randomly select a subset of it. What is the fastest way.
I currently implement it like this:
//ratio is the percentage of the subgroup from the whole group
def randomSelect(ratio:Double): Boolean = {
val rr=scala.util.Random
if (rr.nextFloat() < ratio) true else false
}
val ratio = 0.3
val result = input.map(x=>(x, randomSelect(ratio))).filter(x._2).map(x=>x._1)
So I first attach a true/false label for each element, and filter out those false elements, and get back the subset of the sequence.
Is there any faster/ advantage way?
So there are basically two approaches to this:
select n elements at random
include or exclude each element with probability p
Your solution is the latter and can be simplified to:
l.filter(_ => r.nextFloat < p)
(I'm calling the list, l, the instance of Random r and your ratio p from here on out.)
If you wanted to sample exactly n elements you could do:
r.shuffle(l).take(n)
I compared these selecting 200 elements from a 1000 element list:
scala> val first = time{
| l.map(x => (x, r.nextFloat < p)).filter(_._2).map(_._1)
| }
Elapsed time: 3249507ns
scala> val second = time {
| r.shuffle(l).take(200)
| }
Elapsed time: 10640432ns
scala> val third = time{
| l.filter(_ => r.nextFloat < p)}
Elapsed time: 1689009ns
Dropping your extra two mapss appears to speed things up by about a third (which makes complete sense). The shuffle-and-take method is significantly slower, but does guarantee you a fixed number of elements.
I borrowed the timing function from here if you want to do a more rigorous investigation (i.e. average over many trials, rather than 1).
If your list isn't big, a simple filter as suggested by others should suffice:
list.filter(_ => Random.nextDouble < p)
In case you have a big list, the per-element call of Random could become the bottleneck. One approach to minimize the calls is to generate random gaps (0, 1, 2, ...) by which the data sampling will hop over the elements. Below is a simple implementation in Scala:
import scala.util.Random
import scala.math._
def gapSampling(list: List[Double], p: Double): List[Double] = {
def randomGap(p: Double): Double = {
val epsilon: Double = 1e-10
val u = max(Random.nextDouble, epsilon)
floor( log(u) / log(1 - p) )
}
#scala.annotation.tailrec
def samplingFcn(acc: List[Double], list: List[Double], p: Double): List[Double] = list match {
case Nil => acc
case _ =>
val gap = randomGap(p).toInt
val l = list.drop(gap + 1)
val accNew = l.headOption match {
case Some(e) => e :: acc
case None => acc
}
samplingFcn(accNew, l, p)
}
samplingFcn(List[Double](), list, p).reverse
}
val list = (1 to 100).toList.map(_.toDouble)
gapSampling(list, 0.3)
// res1: List[Double] = List(
// 2.0, 5.0, 7.0, 14.0, 15.0, 18.0, 20.0, 25.0, 26.0, 28.0, 33.0,
// 35.0, 42.0, 43.0, 47.0, 48.0, 50.0, 55.0, 56.0, 59.0, 62.0,
// 69.0, 72.0, 75.0, 76.0, 79.0, 82.0, 93.0, 96.0, 97.0, 98.0
// )
More details about such gap sampling can be found here.

How to sort the RDD and get top N elements using scala?

I have RDD of a case class(TopNModel) and want to get top N elements from giving RDD where sort by tx + rx.
In case of two equal (tx + rx) sort by mac.
case class TopNModel(mac: Long, tx: Int, rx: Int)
For example:
RDD[TopNModel(10L, 200, 100), TopNModel(12L, 100, 100), TopNModel(1L, 200, 400), TopNModel(11L, 100, 200)]
sort by tx + rx and mac:
RDD[TopNModel(1L, 200, 400), TopNModel(10L, 200, 100), TopNModel(11L, 100, 200), TopNModel(12L, 100, 100)]
My Question:
How to sort if rx + tx values are the same then sort based on mac?
EDIT: per important comment below, if indeed the requirement is to "get top N" entities based on this order, sortBy is wasteful compared to takeOrdered. Use the second solution ("alternative") with takeOrdered.
You can use the fact that tuples are naturally-ordered from "leftmost" argument to right, and create a tuple with the negative value of tx + rx (so that these are sorted in decending order) and the positive value of mac:
val result = rdd.sortBy { case TopNModel(mac, tx, rx) => (-(tx + rx), mac) }
Alternatively, if you want TopNModel to always be sorted this way (no matter the context), you can make it an Ordered and implement its compare method. Then, sorting by identity will use that compare to get the same result:
case class TopNModel(mac: Long, tx: Int, rx: Int) extends Ordered[TopNModel] {
import scala.math.Ordered.orderingToOrdered
def compare(that: TopNModel): Int = (-(tx + rx), mac) compare (-(that.tx + that.rx), that.mac)
}
val result = rdd.sortBy(identity)

Functional way to map over a list with an accumulator in Scala

I would like to write succinct code to map over a list, accumulating a value as I go and using that value in the output list.
Using a recursive function and pattern matching this is straightforward (see below). But I was wondering if there is a way to do this using the function programming family of combinators like map and fold etc. Obviously map and fold are no good unless you use a mutable variable defined outside the call and modify that in the body.
Perhaps I could do this with a State Monad but was wondering if there is a way to do it that I'm missing, and that utilizes the Scala standard library.
// accumulate(List(10, 20, 20, 30, 20))
// => List(10, 30, 50, 80, 100,)
def accumulate(weights : List[Int], sum : Int = 0, acc: List[Int] = List.empty) : List[Int] = {
weights match {
case hd :: tl =>
val total = hd + sum
accumulate(tl, total, total :: acc)
case Nil =>
acc.reverse
}
}
You may also use foldLeft:
def accumulate(seq: Seq[Int]) =
seq.foldLeft(Vector.empty[Int]) { (result, e) =>
result :+ result.lastOption.getOrElse(0) + e
}
accumulate(List(10, 20, 20, 30, 20))
// => List(10, 30, 50, 80, 100,)
This could be done with scan:
val result = list.scanLeft(0){case (acc, item) => acc+item}
Scan will include the initial value 0 into output so you have to drop it:
result.drop(1)
As pointed out in #Nyavro's answer, the operation you are looking for (the sum of the prefixes of the list) is called prefix-sum and its generalization to any binary operation is called scan and is included in the Scala standard library:
val l = List(10, 20, 20, 30, 20)
l.scan(0) { _ + _ }
//=> List(0, 10, 30, 50, 80, 100)
l.scan(0)(_ + _).drop(1)
//=> List(10, 30, 50, 80, 100)
This has already been answered, but I wanted to address a misconception in your question:
Obviously map and fold are no good unless you use a mutable variable defined outside the call and modify that in the body.
That is not true. fold is a general method of iteration. Everything you can do by iterating over a collection, you can do with fold. If fold were the only method in your List class, you could still do everything you can do now. Here's how to solve your problem with fold:
l.foldLeft(List(0)) { (list, el) ⇒ list.head + el :: list }.reverse.drop(1)
And a general implementation of scan:
def scan[A](l: List[A])(z: A)(op: (A, A) ⇒ A) =
l.
drop(1).
foldLeft(List(l.head)) { (list, el) ⇒ op(list.head, el) :: list }.
reverse
Think of it this way: a collection can be either empty or not. fold has two arguments, one which tells it what to do when the list is empty, and one which tells it what to do when the list is not empty. Those are the only two cases, so every possible case is handled. Therefore, fold can do anything! (More precisely in Scala, foldLeft and foldRight can do anything, while fold is restricted to associative operations.)
Or a different viewpoint: a collection is a stream of instructions, either the EMPTY instruction or the ELEMENT(value) instruction. foldLeft / foldRight are skeleton interpreters for that instruction set, and you as a programmer can supply the implementation for the interpretation of both those instructions, namely the two arguments to foldLeft / foldRight are the interpretation of those instructions.
Remember: while foldLeft / foldRight reduces a collection to a single value, that value can be arbitrarily complex, including being a collection itself!

foldLeft on list of Doubles returns strange value

I'm running into a strange issue when trying to sum a list of doubles that are contained in different instances using foldLeft. Upon investigation, it seems that even when working with a list of simple doubles, the issue persists:
val listOfDoubles = List(4.0, 100.0, 1.0, 0.6, 8.58, 80.0, 22.33, 179.99, 8.3, 59.0, 0.6)
listOfDoubles.foldLeft(0.0) ((elem, res) => res + elem) // gives 464.40000000000003 instead of 464.40
What am I doing wrong here?
NOTE: foldLeft here is necessary as what I'm trying to achieve is a sum of doubles contained in different instances of a case class SomeClass(value: Double), unless, of course, there is another method to go about this.
What am I doing wrong here?
Using doubles when you need this kind of precision. It's not a problem with foldLeft. You're suffering from floating point rounding error. The problem is that some numbers need very long (or infinite) representations in binary, which means that they need to be truncated in binary form. And when converted back to decimal, that binary rounding error surfaces.
Use BigDecimal instead, as it is designed to for arbitrary precision.
scala> val list: List[BigDecimal] = List(4.0, 100.0, 1.0, 0.6, 8.58, 80.0, 22.33, 179.99, 8.3, 59.0, 0.6)
list: List[BigDecimal] = List(4.0, 100.0, 1.0, 0.6, 8.58, 80.0, 22.33, 179.99, 8.3, 59.0, 0.6)
scala> list.foldLeft(BigDecimal(0.0))(_ + _)
res3: scala.math.BigDecimal = 464.40
Or simply:
scala> list.sum
res4: BigDecimal = 464.40

Scala foldLeft too many parameters

I have a list of tuples called item, each index in the list contains 2 x Doubles e.g.
item = ((1.0, 2.0), (3.0, 4.0), (10.0, 100.0))
I want to perform a calculation on each index within the list item and I'm trying to do it with foldLeft. This is my code:
item.foldLeft(0.0)(_ + myMethod(_._2, _._1, item.size)))
_._2 accesses the current item Tuple at index 1 and _._1 accesses the current item Tuple at index 0. e.g. for the first fold it should effectively be:
item.foldLeft(0.0)(_ + myMethod(2.0, 1.0, item.size)))
The Second Fold:
item.foldLeft(0.0)(_ + myMethod(4.0, 3.0, item.size)))
The Third Fold:
item.foldLeft(0.0)(_ + myMethod(100.0, 10.0, item.size)))
where myMethod:
def myMethod(i: Double, j:Double, size: Integer) : Double = {
(j - i) / size
}
It is giving me an error which says that there are too many parameters for foldLeft as it requires 2 parameters.
myMethod returns a Double, and _ is a Double. So, where is this extra parameter the compiler is seeing?
If I do this:
item.foldLeft(0.0)(_ + _._1))
It sums up all the first Doubles in each index of item - replacing _._1 with _._2 sums up all the second Doubles in each index of item.
Any help is greatly appreciated!
Each _ is equivalent to a new argument, so (_ + myMethod(_._2, _._1, item.size)) is an anonymous function with 3 arguments: (x, y, z) => x + myMethod(y._2, z._1, item.size).
What you want is (acc, x) => acc + myMethod(x._2, x._1, item.size).