Is there a better way for reduce operation on RDD[Array[Double]] - scala

I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array.
I use this code for the moment :
var rdd1 = RDD[Array[Double]]
var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) })
Is there a better way to make this more efficiently because it cost a harm.

Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles.
If you use spire, you can just do this
> import spire.implicits._
> val rdd1 = sc.parallelize(Seq(Array(1.0, 2.0), Array(3.0, 4.0)))
> var coord = rdd1.reduce( _ + _)
res1: Array[Double] = Array(4.0, 6.0)
This is much nicer to look at, and should also be much more efficient.
Spire is a dependency of spark, so you should be able to do the above without any extra dependencies. At least it worked with a spark-shell for spark 1.3.1 here.
This will work for any array where there is an AdditiveSemigroup typeclass instance available for the element type. In this case, the element type is Double. Spire typeclasses are #specialized for double, so there will be no boxing going on anywhere.
If you really want to know what is going on to make this work, you have to use reify:
> import scala.reflect.runtime.{universe => u}
> val a = Array(1.0, 2.0)
> val b = Array(3.0, 4.0)
> u.reify { a + b }
res5: reflect.runtime.universe.Expr[Array[Double]] = Expr[scala.Array[Double]](
implicits.additiveSemigroupOps(a)(
implicits.ArrayNormedVectorSpace(
implicits.DoubleAlgebra,
implicits.DoubleAlgebra,
Predef.this.implicitly)).$plus(b))
So the addition works because there is an instance of AdditiveSemigroup for Array[Double].

I assume the concern is that you have very large Array[Double] and the transformation as written does not distribute the addition of them. If so, you could do something like (untested):
// map Array[Double] to (index, double)
val rdd2 = rdd1.flatMap(a => a.zipWithIndex.map(t => (t._2,t._1))
// get the sum for each index
val reduced = rdd2.reduceByKey(_ + _)
// key everything the same to get a single iterable in groubByKey
val groupAll = reduced.map(t => ("constKey", (t._1, t._2)
// get the doubles back together into an array
val coord = groupAll.groupByKey { (k,vs) =>
vs.toList.sortBy(_._1).toArray.map(_._2) }

Related

reduce a list in scala by value

How can I reduce a list like below concisely
Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
to
List(Temp(a,2), Temp(b,1))
Only keep Temp objects with unique first param and max of second param.
My solution is with lot of groupBys and reduces which is giving a lengthy answer.
you have to
groupBy
sortBy values in ASC order
get the last one which is the largest
Example,
scala> final case class Temp (a: String, value: Int)
defined class Temp
scala> val data : Seq[Temp] = List(Temp("a",1), Temp("a",2), Temp("b",1))
data: Seq[Temp] = List(Temp(a,1), Temp(a,2), Temp(b,1))
scala> data.groupBy(_.a).map { case (k, group) => group.sortBy(_.value).last }
res0: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
or instead of sortBy(fn).last you can maxBy(fn)
scala> data.groupBy(_.a).map { case (k, group) => group.maxBy(_.value) }
res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
You can generate a Map with groupBy, compute the max in mapValues and convert it back to the Temp classes as in the following example:
case class Temp(id: String, value: Int)
List(Temp("a", 1), Temp("a", 2), Temp("b", 1)).
groupBy(_.id).mapValues( _.map(_.value).max ).
map{ case (k, v) => Temp(k, v) }
// res1: scala.collection.immutable.Iterable[Temp] = List(Temp(b,1), Temp(a,2))
Worth noting that the solution using maxBy in the other answer is more efficient as it minimizes necessary transformations.
You can do this using foldLeft:
data.foldLeft(Map[String, Int]().withDefaultValue(0))((map, tmp) => {
map.updated(tmp.id, max(map(tmp.id), tmp.value))
}).map{case (i,v) => Temp(i, v)}
This is essentially combining the logic of groupBy with the max operation in a single pass.
Note This may be less efficient because groupBy uses a mutable.Map internally which avoids constantly re-creating a new map. If you care about performance and are prepared to use mutable data, this is another option:
val tmpMap = mutable.Map[String, Int]().withDefaultValue(0)
data.foreach(tmp => tmpMap(tmp.id) = max(tmp.value, tmpMap(tmp.id)))
tmpMap.map{case (i,v) => Temp(i, v)}.toList
Use a ListMap if you need to retain the data order, or sort at the end if you need a particular ordering.

too many map keys causing out of memory exception in spark

I have an RDD 'inRDD' of the form RDD[(Vector[(Int, Byte)], Vector[(Int, Byte)])] which is a PairRDD(key,value) where key is Vector[(Int, Byte)] and value is Vector[(Int, Byte)].
For each element (Int, Byte) in the vector of key field, and each element (Int, Byte) in the vector of value field I would like to get a new (key,value) pair in the output RDD as (Int, Int), (Byte, Byte).
That should give me an RDD of the form RDD[((Int, Int), (Byte, Byte))].
For example, inRDD contents could be like,
(Vector((3,2)),Vector((4,2))), (Vector((2,3), (3,3)),Vector((3,1))), (Vector((1,3)),Vector((2,1))), (Vector((1,2)),Vector((2,2), (1,2)))
which would become
((3,4),(2,2)), ((2,3),(3,1)), ((3,3),(3,1)), ((1,2),(3,1)), ((1,2),(2,2)), ((1,1),(2,2))
I have the following code for that.
val outRDD = inRDD.flatMap {
case (left, right) =>
for ((ll, li) <- left; (rl, ri) <- right) yield {
(ll,rl) -> (li,ri)
}
}
It works when the vectors are small in size in the inRDD. But when there are lot elements in the vectors, I get out of memory exception. Increasing the available memory
to spark could only solve for smaller inputs and the error appears again for even larger inputs.
Looks like I am trying to assemble a huge structure in memory. I am unable to rewrite this code in any other ways.
I have implemented a similar logic with java in hadoop as follows.
for (String fromValue : fromAssetVals) {
fromEntity = fromValue.split(":")[0];
fromAttr = fromValue.split(":")[1];
for (String toValue : toAssetVals) {
toEntity = toValue.split(":")[0];
toAttr = toValue.split(":")[1];
oKey = new Text(fromEntity.trim() + ":" + toEntity.trim());
oValue = new Text(fromAttr + ":" + toAttr);
outputCollector.collect(oKey, oValue);
}
}
But when I try something similar in spark, I get nested rdd exceptions.
How do I do this efficiently with spark using scala?
Well, if Cartesian product is the only option you can at least make it a little bit more lazy:
inRDD.flatMap { case (xs, ys) =>
xs.toIterator.flatMap(x => ys.toIterator.map(y => (x, y)))
}
You can also handle this at the Spark level
import org.apache.spark.RangePartitioner
val indexed = inRDD.zipWithUniqueId.map(_.swap)
val partitioner = new RangePartitioner(indexed.partitions.size, indexed)
val partitioned = indexed.partitionBy(partitioner)
val lefts = partitioned.flatMapValues(_._1)
val rights = partitioned.flatMapValues(_._2)
lefts.join(rights).values

How to extract elements from 4 lists in scala?

case class TargetClass(key: Any, value: Number, lowerBound: Double, upperBound: Double)
val keys: List[Any] = List("key1", "key2", "key3")
val values: List[Number] = List(1,2,3);
val lowerBounds: List[Double] = List(0.1, 0.2, 0.3)
val upperBounds: List[Double] = List(0.5, 0.6, 0.7)
Now I want to construct a List[TargetClass] to hold the 4 lists. Does anyone know how to do it efficiently? Is using for-loop to add elements one by one very inefficient?
I tried to use zipped, but it seems that this only applies for combining up to 3 lists.
Thank you very much!
One approach:
keys.zipWithIndex.map {
case (item,i)=> TargetClass(item,values(i),lowerBounds(i),upperBounds(i))
}
You may want to consider using the lift method to deal with case of lists being of unequal lengths (and thereby provide a default if keys is longer than any of the lists?)
I realise this doesn't address your question of efficiency. You could fairly easily run some tests on different approaches.
You can apply zipped to the first two lists, to the last two lists, then to the results of the previous zips, then map to your class, like so:
val z12 = (keys, values).zipped
val z34 = (lowerBounds, upperBounds).zipped
val z1234 = (z12.toList, z34.toList).zipped
val targs = z1234.map { case ((k,v),(l,u)) => TargetClass(k,v,l,u) }
// targs = List(TargetClass(key1,1,0.1,0.5), TargetClass(key2,2,0.2,0.6), TargetClass(key3,3,0.3,0.7))
How about:
keys zip values zip lowerBounds zip upperBounds map {
case (((k, v), l), u) => TargetClass(k, v, l, u)
}
Example:
scala> val zipped = keys zip values zip lowerBounds zip upperBounds
zipped: List[(((Any, Number), Double), Double)] = List((((key1,1),0.1),0.5), (((key2,2),0.2),0.6), (((key3,3),0.3),0.7))
scala> zipped map { case (((k, v), l), u) => TargetClass(k, v, l, u) }
res6: List[TargetClass] = List(TargetClass(key1,1,0.1,0.5), TargetClass(key2,2,0.2,0.6), TargetClass(key3,3,0.3,0.7))
It would be nice if .transpose worked on a Tuple of Lists.
for (List(k, v:Number, l:Double, u:Double) <-
List(keys, values, lowerBounds, upperBounds).transpose)
yield TargetClass(k,v,l,u)
I think no matter what you use from an efficiency point of view, you will have to traverse the lists individually. The only question is, do you do it OR for the sake of readability, you use Scala idioms and let Scala do the dirty work for you :) ?
Other approaches are not necessarily more efficient. You can change the order of zipping and the order of assembling the return value of the map function as you like.
Here is a more functional way but I am not sure it will be more efficient. See comments on #wwkudu (zip with index) answer
val res1 = keys zip lowerBounds zip values zip upperBounds
res1.map {
x=> (x._1._1._1,x._1._1._2, x._1._2, x._2)
//Of course, you can return an instance of TargetClass
//here instead of the touple I am returning.
}
I am curious, why do you need a "TargetClass"? Will a touple work?

How to access/initialize and update values in a mutable map?

Consider the simple problem of using a mutable map to keep track of occurrences/counts, i.e. with:
val counts = collection.mutable.Map[SomeKeyType, Int]()
My current approach to incrementing a count is:
counts(key) = counts.getOrElse(key, 0) + 1
// or equivalently
counts.update(key, counts.getOrElse(key, 0) + 1)
This somehow feels a bit clumsy, because I have to specify the key twice. In terms of performance, I would also expect that key has to be located twice in the map, which I would like to avoid. Interestingly, this access and update problem would not occur if Int would provide some mechanism to modify itself. Changing from Int to a Counter class that provides an increment function would for instance allow:
// not possible with Int
counts.getOrElseUpdate(key, 0) += 1
// but with a modifiable counter
counts.getOrElseUpdate(key, new Counter).increment
Somehow I'm always expecting to have the following functionality with a mutable map (somewhat similar to transform but without returning a new collection and on a specific key with a default value):
// fictitious use
counts.updateOrElse(key, 0, _ + 1)
// or alternatively
counts.getOrElseUpdate(key, 0).modify(_ + 1)
However as far as I can see, such a functionality does not exist. Wouldn't it make sense in general (performance and syntax wise) to have such a f: A => A in-place modification possibility? Probably I'm just missing something here... I guess there must be some better solution to this problem making such a functionality unnecessary?
Update:
I should have clarified that I'm aware of withDefaultValue but the problem remains the same: performing two lookups is still twice as slow than one, no matter if it is a O(1) operation or not. Frankly, in many situations I would be more than happy to achieve a speed-up of factor 2. And obviously the construction of the modification closure can often be moved outside of the loop, so imho this is not a big issue compared to running an operation unnecessarily twice.
You could create the map with a default value, which would allow you to do the following:
scala> val m = collection.mutable.Map[String, Int]().withDefaultValue(0)
m: scala.collection.mutable.Map[String,Int] = Map()
scala> m.update("a", m("a") + 1)
scala> m
res6: scala.collection.mutable.Map[String,Int] = Map(a -> 1)
As Impredicative mentioned, map lookups are fast so I wouldn't worry about 2 lookups.
Update:
As Debilski pointed out you can do this even more simply by doing the following:
scala> val m = collection.mutable.Map[String, Int]().withDefaultValue(0)
scala> m("a") += 1
scala> m
res6: scala.collection.mutable.Map[String,Int] = Map(a -> 1)
Starting Scala 2.13, Map#updateWith serves this exact purpose:
map.updateWith("a")({
case Some(count) => Some(count + 1)
case None => Some(1)
})
def updateWith(key: K)(remappingFunction: (Option[V]) => Option[V]): Option[V]
For instance, if the key doesn't exist:
val map = collection.mutable.Map[String, Int]()
// map: collection.mutable.Map[String, Int] = HashMap()
map.updateWith("a")({ case Some(count) => Some(count + 1) case None => Some(1) })
// Option[Int] = Some(1)
map
// collection.mutable.Map[String, Int] = HashMap("a" -> 1)
and if the key exists:
map.updateWith("a")({ case Some(count) => Some(count + 1) case None => Some(1) })
// Option[Int] = Some(2)
map
// collection.mutable.Map[String, Int] = HashMap("a" -> 2)
I wanted to lazy-initialise my mutable map instead of doing a fold (for memory efficiency). The collection.mutable.Map.getOrElseUpdate() method suited my purposes. My map contained a mutable object for summing values (again, for efficiency).
val accum = accums.getOrElseUpdate(key, new Accum)
accum.add(elem.getHours, elem.getCount)
collection.mutable.Map.withDefaultValue() does not keep the default value for a subsequent requested key.

Newbie Scala question about simple math array operations

Newbie Scala Question:
Say I want to do this [Java code] in Scala:
public static double[] abs(double[] r, double[] im) {
double t[] = new double[r.length];
for (int i = 0; i < t.length; ++i) {
t[i] = Math.sqrt(r[i] * r[i] + im[i] * im[i]);
}
return t;
}
and also make it generic (since Scala efficiently do generic primitives I have read). Relying only on the core language (no library objects/classes, methods, etc), how would one do this? Truthfully I don't see how to do it at all, so I guess that's just a pure bonus point question.
I ran into sooo many problems trying to do this simple thing that I have given up on Scala for the moment. Hopefully once I see the Scala way I will have an 'aha' moment.
UPDATE:
Discussing this with others, this is the best answer I have found so far.
def abs[T](r: Iterable[T], im: Iterable[T])(implicit n: Numeric[T]) = {
import n.mkNumericOps
r zip(im) map(t => math.sqrt((t._1 * t._1 + t._2 * t._2).toDouble))
}
Doing generic/performant primitives in scala actually involves two related mechanisms which scala uses to avoid boxing/unboxing (e.g. wrapping an int in a java.lang.Integer and vice versa):
#specialize type annotations
Using Manifest with arrays
specialize is an annotation that tells the Java compiler to create "primitive" versions of code (akin to C++ templates, so I am told). Check out the type declaration of Tuple2 (which is specialized) compared with List (which isn't). It was added in 2.8 and means that, for example code like CC[Int].map(f : Int => Int) is executed without ever boxing any ints (assuming CC is specialized, of course!).
Manifests are a way of doing reified types in scala (which is limited by the JVM's type erasure). This is particularly useful when you want to have a method genericized on some type T and then create an array of T (i.e. T[]) within the method. In Java this is not possible because new T[] is illegal. In scala this is possible using Manifests. In particular, and in this case it allows us to construct a primitive T-array, like double[] or int[]. (This is awesome, in case you were wondering)
Boxing is so important from a performance perspective because it creates garbage, unless all of your ints are < 127. It also, obviously, adds a level of indirection in terms of extra process steps/method calls etc. But consider that you probably don't give a hoot unless you are absolutely positively sure that you definitely do (i.e. most code does not need such micro-optimization)
So, back to the question: in order to do this with no boxing/unboxing, you must use Array (List is not specialized yet, and would be more object-hungry anyway, even if it were!). The zipped function on a pair of collections will return a collection of Tuple2s (which will not require boxing, as this is specialized).
In order to do this generically (i.e. across various numeric types) you must require a context bound on your generic parameter that it is Numeric and that a Manifest can be found (required for array creation). So I started along the lines of...
def abs[T : Numeric : Manifest](rs : Array[T], ims : Array[T]) : Array[T] = {
import math._
val num = implicitly[Numeric[T]]
(rs, ims).zipped.map { (r, i) => sqrt(num.plus(num.times(r,r), num.times(i,i))) }
// ^^^^ no SQRT function for Numeric
}
...but it doesn't quite work. The reason is that a "generic" Numeric value does not have an operation like sqrt -> so you could only do this at the point of knowing you had a Double. For example:
scala> def almostAbs[T : Manifest : Numeric](rs : Array[T], ims : Array[T]) : Array[T] = {
| import math._
| val num = implicitly[Numeric[T]]
| (rs, ims).zipped.map { (r, i) => num.plus(num.times(r,r), num.times(i,i)) }
| }
almostAbs: [T](rs: Array[T],ims: Array[T])(implicit evidence$1: Manifest[T],implicit evidence$2: Numeric[T])Array[T]
Excellent - now see this purely generic method do some stuff!
scala> val rs = Array(1.2, 3.4, 5.6); val is = Array(6.5, 4.3, 2.1)
rs: Array[Double] = Array(1.2, 3.4, 5.6)
is: Array[Double] = Array(6.5, 4.3, 2.1)
scala> almostAbs(rs, is)
res0: Array[Double] = Array(43.69, 30.049999999999997, 35.769999999999996)
Now we can sqrt the result, because we have a Array[Double]
scala> res0.map(math.sqrt(_))
res1: Array[Double] = Array(6.609841147864296, 5.481788029466298, 5.980802621722272)
And to prove that this would work even with another Numeric type:
scala> import math._
import math._
scala> val rs = Array(BigDecimal(1.2), BigDecimal(3.4), BigDecimal(5.6)); val is = Array(BigDecimal(6.5), BigDecimal(4.3), BigDecimal(2.1))
rs: Array[scala.math.BigDecimal] = Array(1.2, 3.4, 5.6)
is: Array[scala.math.BigDecimal] = Array(6.5, 4.3, 2.1)
scala> almostAbs(rs, is)
res6: Array[scala.math.BigDecimal] = Array(43.69, 30.05, 35.77)
scala> res6.map(d => math.sqrt(d.toDouble))
res7: Array[Double] = Array(6.609841147864296, 5.481788029466299, 5.9808026217222725)
Use zip and map:
scala> val reals = List(1.0, 2.0, 3.0)
reals: List[Double] = List(1.0, 2.0, 3.0)
scala> val imags = List(1.5, 2.5, 3.5)
imags: List[Double] = List(1.5, 2.5, 3.5)
scala> reals zip imags
res0: List[(Double, Double)] = List((1.0,1.5), (2.0,2.5), (3.0,3.5))
scala> (reals zip imags).map {z => math.sqrt(z._1*z._1 + z._2*z._2)}
res2: List[Double] = List(1.8027756377319946, 3.2015621187164243, 4.6097722286464435)
scala> def abs(reals: List[Double], imags: List[Double]): List[Double] =
| (reals zip imags).map {z => math.sqrt(z._1*z._1 + z._2*z._2)}
abs: (reals: List[Double],imags: List[Double])List[Double]
scala> abs(reals, imags)
res3: List[Double] = List(1.8027756377319946, 3.2015621187164243, 4.6097722286464435)
UPDATE
It is better to use zipped because it avoids creating a temporary collection:
scala> def abs(reals: List[Double], imags: List[Double]): List[Double] =
| (reals, imags).zipped.map {(x, y) => math.sqrt(x*x + y*y)}
abs: (reals: List[Double],imags: List[Double])List[Double]
scala> abs(reals, imags)
res7: List[Double] = List(1.8027756377319946, 3.2015621187164243, 4.6097722286464435)
There isn't a easy way in Java to create generic numeric computational code; the libraries aren't there as you can see from oxbow's answer. Collections also are designed to take arbitrary types, which means that there's an overhead in working with primitives with them. So the fastest code (without careful bounds checking) is either:
def abs(re: Array[Double], im: Array[Double]) = {
val a = new Array[Double](re.length)
var i = 0
while (i < a.length) {
a(i) = math.sqrt(re(i)*re(i) + im(i)*im(i))
i += 1
}
a
}
or, tail-recursively:
def abs(re: Array[Double], im: Array[Double]) = {
def recurse(a: Array[Double], i: Int = 0): Array[Double] = {
if (i < a.length) {
a(i) = math.sqrt(re(i)*re(i) + im(i)*im(i))
recurse(a, i+1)
}
else a
}
recurse(new Array[Double](re.length))
}
So, unfortunately, this code ends up not looking super-nice; the niceness comes once you package it in a handy complex number array library.
If it turns out that you don't actually need highly efficient code, then
def abs(re: Array[Double], im: Array[Double]) = {
(re,im).zipped.map((i,j) => math.sqrt(i*i + j*j))
}
will do the trick compactly and conceptually clearly (once you understand how zipped works). The penalty in my hands is that this is about 2x slower. (Using List makes it 7x slower than while or tail recursion in my hands; List with zip makes it 20x slower; generics with arrays are 3x slower even without computing the square root.)
(Edit: fixed timings to reflect a more typical use case.)
After Edit:
OK I have got running what I wanted to do. Will take two Lists of any type of number and return an Array of Doubles.
def abs[A](r:List[A], im:List[A])(implicit numeric: Numeric[A]):Array[Double] = {
var t = new Array[Double](r.length)
for( i <- r.indices) {
t(i) = math.sqrt(numeric.toDouble(r(i))*numeric.toDouble(r(i))+numeric.toDouble(im(i))*numeric.toDouble(im(i)))
}
t
}