How to merge and aggregate 2 Maps in scala most efficiently? - scala

I have the following 2 maps:
val map12:Map[(String,String),Double]=Map(("Sam","0203") -> 16216.0, ("Jam","0157") -> 50756.0, ("Pam","0129") -> 3052.0)
val map22:Map[(String,String),Double]=Map(("Jam","0157") -> 16145.0, ("Pam","0129") -> 15258.0, ("Sam","0203") -> -1638.0, ("Dam","0088") -> -8440.0,("Ham","0104") -> 4130.0,("Hari","0268") -> -108.0, ("Om","0169") -> 5486.0, ("Shiv","0181") -> 275.0, ("Brahma","0148") -> 18739.0)
In the first approach I am using foldLeft to achieve the merging and accumulation:
val t1 = System.nanoTime()
val merged1 = (map12 foldLeft map22)((map22, map12) => map22 + (map12._1 -> (map12._2 + map22.getOrElse(map12._1, 0.0))))
val t2 = System.nanoTime()
println(" First Time taken :"+ (t2-t1))
In the second approach I am trying to use aggregate() function which supports parallel operation:
def merge(map12:Map[(String,String),Double], map22:Map[(String,String),Double]):Map[(String,String),Double]=
map12 ++ map22.map{case(k, v) => k -> (v + (map12.getOrElse(k, 0.0)))}
val inArr= Array(map12,map22)
val t5 = System.nanoTime()
val mergedNew12 = inArr.par.aggregate(Map[(String,String),Double]())(merge,merge)
val t6 = System.nanoTime()
println(" Second Time taken :"+ (t6-t5))
But I notice the foldLeft is much faster than the aggregate.
I am looking for advice on how to make this operation the most efficient.

If you want an aggregate more efficient by running with par, try with Vector instead of Array, it is one of the best collections for parallel algorithms.
On the other hand, parallel working has some overhead so If you have insufficient data, it will be not convenient.
With the data you gave us, Vector.par.aggregate is better than Array.par.aggregate, but Vector.aggregate is better than foldLeft.
val inVector= Vector(map12,map22)
val t7 = System.nanoTime()
val mergedNew12_2 = inVector.aggregate(Map[(String,String),Double]())(merge,merge)
val t8 = System.nanoTime()
println(" Third Time taken :"+ (t8-t7))
These are my times
First Time taken :6431723
Second Time taken:147474028
Third Time taken :4855489

Related

How to properly measure elapsed time in Spark?

I have my code written in Spark and Scala. Now I need to measure elapsed time of particular functions of the code.
Should I use spark.time like this? But then how can I properly assign the value of df?
val df = spark.time(myObject.retrieveData(spark, indices))
Or should I do it in this way?
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
val df = time{myObject.retrieveData(spark, indices)}
Update:
As recommended in comments, I run df.rdd.count inside myObject.retrieveData in order to materialise the DataFrame.

How to filter a sorted RDD by taking top N rows

I have 2 key-value pair RDD's A and B that I work with. Let's say that B has 10000 rows and I have sorted B by its values:
B = B0.map(_.swap).sortByKey().map(_.swap)
I need to take top 5000 from B and use that to join with A. I know I could do:
B1 = B.take(5000)
or
B1 = B.zipWithIndex().filter(_._2 < 5000).map(_._1)
It seems that both will trigger computation. Since B1 is just an intermediate result, I would like to have it not trigger real computation. Is there a better way to achieve that?
As far as I know, there is no other way to achieve that using RDD. But you can leverage the dataframe to achieve the same.
First convert your RDD to a dataframe.
Then limit the dataframe to limit 5000 value.
Then you can pick the new RDD from the dataframe.
Upto this point no calculation will be triggered by spark.
Below is a sample proof of concept.
def main(arg: Array[String]): Unit = {
import spark.implicits._
val a =
Array(
Array("key_1", "value_1"),
Array("key_2", "value_2"),
Array("key_3", "value_3"),
Array("key_4", "value_4"),
Array("key_5", "value_5")
)
val rdd = spark.sparkContext.makeRDD(a)
val df = rdd.map({
case Array(key, value) => PairRdd(key, value)
}).toDF()
val dfWithTop = df.limit(3)
val rddWithTop = dfWithTop.rdd
// upto this point no computation has been triggered
// rddWithTop.take(100) will trigger computation
}
case class PairRdd(key: String, value: String)

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.
You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.

How to dynamically generate parallel futures with for-yield

I have below code:
val f1 = Future(genA1)
val f2 = Future(genA2)
val f3 = Future(genA3)
val f4 = Future(genA4)
val results: Future[Seq[A]] = for {
a1 <- f1
a2 <- f2
a3 <- f3
a4 <- f4
} yield Seq(a, b, c, d)
Now I have a requirement to optionally exclude a2, how to modified the code? ( with map or flatMap is also acceptable)
Further more, say if I have M possible future needs to be aggregated like above, and N of M could be optionally excluded against some flag (biz logic), how should I handle it?
thanks in advance!
Leon
In question1, I understand that you want to exclude one entry (e.g B) from the sequence given some logic and in question2, you want to supress N entries from a total of M, and have the future computed on those results. We could generalize both cases to something like this:
// Using a map as simple example, but 'generators' could be a function that creates the required computation
val generators = Map('a' -> genA1, 'b' -> genA1, 'c' -> genA3, 'd' -> genA4)
...
// shouldAccept(k) => Business logic to decide which computations should be executed.
val selectedGenerators = generators.filter{case (k,v) => shouldAccept(k)}
// Create Seq[Future] from the selected computations
val futures = selectedGenerators.map{case (k,v) => Future(v)}
// Create Future[Seq[_]] to have the result of computing all entries.
val result = Future.sequence(futures)
In general, what I think you are looking for is Future.sequence, which takes a Seq[Future[_]] and produces a Future[Seq[_]], which is basically what you are doing "by hand" with the for-comprehension.