How to find Sum at Each partition in Spark - scala

I have created class and used that class to create RDD. I want to calculate sum of LoudnessRate (member of class) at each partition. This sum will be later used to calculate Mean LoudnessRate at each partition.
I have tried following code but it does not calculate Sum and returns 0.0.
My code is
object sparkBAT {
def main(args: Array[String]): Unit = {
val numPartitions = 3
val N = 50
val d = 5
val MinVal = -10
val MaxVal = 10
val conf = new SparkConf().setMaster(locally("local")).setAppName("spark Sum")
val sc = new SparkContext(conf)
val ba = List.fill(N)(new BAT(d, MinVal, MaxVal))
val rdd = sc.parallelize(ba, numPartitions)
var arrSum =Array.fill(numPartitions)(0.0) // Declare Array that will hold sum for each Partition
rdd.mapPartitionsWithIndex((k,iterator) => iterator.map(x => arrSum(k) += x.LoudnessRate)).collect()
arrSum foreach println
}
}
class BAT (dim:Int, min:Double, max:Double) extends Serializable {
val random = new Random()
var position : List[Double] = List.fill(dim) (random.nextDouble() * (max-min)+min )
var velocity :List[Double] = List.fill(dim)( math.random)
var PulseRate : Double = 0.1
var LoudnessRate :Double = 0.95
var frequency :Double = math.random
var fitness :Double = math.random
var BestPosition :List[Double] = List.fill(dim)(math.random)
var BestFitness :Double = math.random
}

Changing my comment to an answer as requested. Original comment
You are modifying arrSum in executor JVMs and printing its values in the dirver JVM. You can map the iterators to singleton iterators and use collect to move the values to the driver. Also, don't use iterator.map for side-effects, iterator.foreach is meant for that.
And here is a sample snippet how to do it. First creating a RDD with two partitions, 0 -> 1,2,3 and 1 -> 4,5. Naturally you would not need this in actual code but as the sc.parallelize behaviour changes depending on environment, this will always create uniform RDDs to reproduce:
object DemoPartitioner extends Partitioner {
override def numPartitions: Int = 2
override def getPartition(key: Any): Int = key match {
case num: Int => num
}
}
val rdd = sc
.parallelize(Seq((0, 1), (0, 2), (0, 3), (1, 4), (1, 5)))
.partitionBy(DemoPartitioner)
.map(_._2)
And then the actual trick:
val sumsByPartition = rdd.mapPartitionsWithIndex {
case (partitionNum, it) => Iterator.single(partitionNum -> it.sum)
}.collect().toMap
println(sumsByPartition)
Outputs:
Map(0 -> 6, 1 -> 9)

The problem is that you're using arrSum (a regular collection) that is declared in your Driver and updated in the Executors. Whenever you're doing that you need to use Accumulators.
This should help

Related

spark scala percentile_approx with weights

How can I compute percentile 15th and percentile 50th of column students taking into consideration occ column without using array_repeat and avoiding explosion? I have huge input dataframe and explosion blows out the memory.
My DF is:
name | occ | students
aaa 1 1
aaa 3 7
aaa 6 11
...
For example, if I consider students and occ are bot arrays then to compute percentile 50th of array students with taking into consideration of occ I would normaly compute like this:
val students = Array(1,7,11)
val occ = Array(1,3,6)
it gives:
val student_repeated = Array(1,7,7,7,11,11,11,11,11,11)
then student_50th would be 50th percentile of student_repeated => 11.
My current code:
import spark.implicits._
val inputDF = Seq(
("aaa", 1, 1),
("aaa", 3, 7),
("aaa", 6, 11),
)
.toDF("name", "occ", "student")
// Solution 1
inputDF
.withColumn("student", array_repeat(col("student"), col("occ")))
.withColumn("student", explode(col("student")))
.groupBy("name")
.agg(
percentile_approx(col("student"), lit(0.5), lit(10000)).alias("student_50"),
percentile_approx(col("student"), lit(0.15), lit(10000)).alias("student_15"),
)
.show(false)
which outputs:
+----+----------+----------+
|name|student_50|student_15|
+----+----------+----------+
|aaa |11 |7 |
+----+----------+----------+
EDIT:
I am looking for scala equivalent solution:
https://stackoverflow.com/a/58309977/4450090
EDIT2:
I am proceeding with sketches-java
https://github.com/DataDog/sketches-java
I have decided to use dds sketch which has method accept which allows the sketch to be updated.
"com.datadoghq" % "sketches-java" % "0.8.2"
First, I initialize empty sketch.
Then, I accept pair of values (value, weight)
Then after all I call dds sketch method getValueAtQuantile
I do execute all as Spark Scala Aggregator.
class DDSInitAgg(pct: Double, accuracy: Double) extends Aggregator[ValueWithWeigth, SketchData, Double]{
private val precision: String = "%.6f"
override def zero: SketchData = DDSUtils.sketchToTuple(DDSketches.unboundedDense(accuracy))
override def reduce(b: SketchData, a: ValueWithWeigth): SketchData = {
val s = DDSUtils.sketchFromTuple(b)
s.accept(a.value, a.weight)
DDSUtils.sketchToTuple(s)
}
override def merge(b1: SketchData, b2: SketchData): SketchData = {
val s1: DDSketch = DDSUtils.sketchFromTuple(b1)
val s2: DDSketch = DDSUtils.sketchFromTuple(b2)
s1.mergeWith(s2)
DDSUtils.sketchToTuple(s1)
}
override def finish(reduction: SketchData): Double = {
val percentile: Double = DDSUtils.sketchFromTuple(reduction).getValueAtQuantile(pct)
precision.format(percentile).toDouble
}
override def bufferEncoder: Encoder[SketchData] = ExpressionEncoder()
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
You can execute it as udaf taking two columns as the input.
Additionaly, I developed methods for encoding/decoding back and forth from DDSSketch <---> Array[Byte]
case class SketchData(backingArray: Array[Byte], numWrittenBytes: Int)
object DDSUtils {
val emptySketch: DDSketch = DDSketches.unboundedDense(0.01)
val supplierStore: Supplier[Store] = () => new UnboundedSizeDenseStore()
def sketchToTuple(s: DDSketch): SketchData = {
val o = GrowingByteArrayOutput.withDefaultInitialCapacity()
s.encode(o, false)
SketchData(o.backingArray(), o.numWrittenBytes())
}
def sketchFromTuple(sketchData: SketchData): DDSketch = {
val i: ByteArrayInput = ByteArrayInput.wrap(sketchData.backingArray, 0, sketchData.numWrittenBytes)
DDSketch.decode(i, supplierStore)
}
}
This is how I call it as udaf
val ddsInitAgg50UDAF: UserDefinedFunction = udaf(new DDSInitAgg(0.50, 0.50), ExpressionEncoder[ValueWithWeigth])
and finally then in aggregation:
ddsInitAgg50UDAF(col("weigthCol"), col("valueCol")).alias("value_pct_50")

How to implement Levenshtein Distance Algorithm in Scala

I've a text file which contains the information about the sender and messages and the format is sender,messages. I want to use Levenshtein Distance Algorithm with threshold of 70% and want to store the similar messages to the Map. In the Map, My key is String and value is List[String]
For example my requirement is: If my messages are abc, bcd, cdf.
step1: First I should add the message 'abc' to the List. map.put("Group1",abc.toList)
step2: Next, I should compare the 'bcd'(2nd message) with 'abc'(1st message). If they meets the threshold of 70% then I should add the 'bcd' to List. Now, 'abc' and 'bcd' are added under the same key called 'Group1'.
step3: Now, I should get all the elements from Map. Currently G1 only with 2 values(abc,bcd), next compare the current message 'cdf' with 'abc' or 'bcd' (As 'abc' and 'bcd' is similar comparing with any one of them would be enough)
step4: If did not meet the threshold, I should create a new key "Group2" and add that message to the List and so on.
The 70% threshold means, For example:
message1: Dear customer! your mobile number 9032412236 has been successfully recharged with INR 500.00
message2: Dear customer! your mobile number 7999610201 has been successfully recharged with INR 500.00
Here, the Levenshtein Distance between these two is 8. We can check this here: https://planetcalc.com/1721/
8 edits needs to be done, 8 characters did not match out of (message1.length+message2.length)/2
If I assume the first message is of 100 characters and second message is of 100 characters then the average length is 100, out of 100, 8 characters did not match which means the accuracy level of this is 92%, so here, I should keep threshold 70%.
If Levenshtein distance matching at least 70%, then take them as similar.
I'm using the below library:
libraryDependencies += "info.debatty" % "java-string-similarity" % "2.0.0"
My code:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String, List[String]]
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (myMap.isEmpty) {
list += value
myMap.put("G1", list.toList)
} else {
val currentMsg = value
val valuePartOnly = myMap.valuesIterator.toString()
for (messages <- valuePartOnly) {
def levenshteinDistance(currentMsg: String, messages: String) = {
???//TODO: Implement distance
}
}
}
}
}
}
}
After the else part, I'm not sure how do I start with this algorithm.
I do not have any output sample. So, I've explained it step by step.
Please check from step1 to step4.
Thanks.
I'm not really certain about next code I did not tried it, but I hope it demonstrates the idea:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val distance: Levenshtein = new Levenshtein(); //Create object for calculation distance
def levenshteinDistance(left: String, right: String): Double = {
// I'm not really certain about this, how you would like to calculate relative distance?
// Relatevly to string with max size, min size, left or right?
l.distance(left, right) / Math.max(left.size, right.size)
}
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val messages = scala.collection.mutable.Map.empty[String, List[String]]
var group = 1
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (messages.isEmpty) {
list += value
messages.put("G$group", list.toList)
} else {
val currentMsg = value
val group = messages.values.find {
case(key, messages) => messages.forall(message => levenshteinDistance(currentMsg, message) <= 0.7)
}._1.getOrElse {
group += 1
"G$group"
}
val groupMessages = messages.getOrEse(group, ListBuffer.empty[String])
groupMessages.append(currentMsg)
messages.put(currentMsg, groupMessages)
}
}
}
}
}

how to merge RDD tuples

I want to use reduceByKey merge many tuples with same key,
here is the code:
val data = Array(DenseMatrix((2.0,1.0,5.0),(4.0,3.0,6.0)),
DenseMatrix((7.0,8.0,9.0),(10.0,12.0,11.0)))
val init = sc.parallelize(data,2)
//getColumn
def getColumn(v:DenseMatrix[Double]) : Map[Int, IndexedSeq[(Int, Double)]]={
val r = Random
val index = 0 to v.size - 1
def func(x:Int, y:DenseMatrix[Double]):(Int,(Int, Double)) =
{
( x,( r.nextInt(10), y.valueAt(x)))
}
val rest = index.map{x=> func(x,v)}.groupBy(x=>x._1).mapValues(x=>x.map(_._2))
rest
}
val out= init.flatMap{ v=> getColumn(v) }
val reduceOutput = tmp.reduceByKey(_++_)
val out2 = out.map{case(k,v)=>k}.collect() // keys here are not I want
here is two pic, the first one is the [key,value] pairs I thought it would be, the second one shows the real keys ,they are not I want,so the ouput is not right.
What should I do?

How to time scala map functions in an aggregative fashion?

I am having a hard to come up with a solution to time individual functions in a Scala map operation. Let's say I have the following line in my code:
val foo = data.map(d => func1(d).func2())
where data is a Seq of N elements. How would I go about timing how long my program has executed func1 in total and func2 in total? Since it is a map operation, these functions will be called N times, so each time record should be added to a cumulative time record.
How can I do this without breaking the Scala syntax?
NB: I want to end up with totalTime_inFunc1 and totalTime_inFunc2.
Let's say, func2() returns YourType. Then, you need to return tuple (YourType, Long, Long) from function inside map, where second tuple element is execution time of func1 and third element is exec time of func2. After that, you can easily get execution time from seq of tuples using sum:
val fooWithTime = {
data.map(d => {
def now = System.currentTimeMillis
val beforeFunc1 = now
val func1Result = func1(d)
val func1Time = now - beforeFunc1
val beforeFun2 = now
val result = func1Result.func2()
(result, func1Time, now - beforeFun2)
}
}
val foo = fooWithTime.map(_._1)
val totalTimeFunc1 = fooWithTime.map(_._2).sum
val totalTimeFunc2 = fooWithTime.map(_._3).sum
Also, you can easily use your preferred method of calculating execution time instead of System.currentTimeMillis().
Look at the Closure. You will declare your functions and make them refer to variable in scope, then pass them to map, and make them increment variable from scope.
Edit
code with closures
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
def fun1(i: Int): Int = {
time{i+1}(1)
}
def fun2(i: Int): Int = {
time{i+2}(2)
}
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val foo = data.map(d => closure.fun2(closure.fun1(d)))
closure.time1 // res4: Long = 22976
closure.time2 // res5: Long = 25438
Edit 2
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val test = new test;
val foo = data.map(d => {
val fun1 = time{test.fun1(d)}(1)
time{fun1.fun2(d)}(2)
})
}
val s = Seq(1, 2, 3)
val (mappedSeq, totalTime) = s.map(x => {
// call your map methods here and time
// x is the mapped element
val timing = 5.5
// then return the tuple with mapped element and time taken for the map function
(x, timing)
}).foldLeft((Seq.empty[Int], 0d))((accumulator, pair) => (accumulator._1 :+ pair._1, accumulator._2 + pair._2))
println(totalTime)
println(mappedSeq.mkString(", "))

Spark column wise word count

We are trying to generate column wise statistics of our dataset in spark. In addition to using the summary function from statistics library. We are using the following procedure:
We determine the columns with string values
Generate key value pair for the whole dataset, using the column number as key and value of column as value
generate a new map of format
(K,V) ->((K,V),1)
Then we use reduceByKey to find the sum of all unique value in all the columns. We cache this output to reduce further computation time.
In the next step we cycle through the columns using a for loop to find the statistics for all the columns.
We are trying to reduce the for loop by again utilizing the map reduce way but we are unable to find some way to achieve it. Doing so will allow us to generate column statistics for all columns in one execution. The for loop method is running sequentially making it very slow.
Code:
//drops the header
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def retAtrTuple(x: String) = {
val newX = x.split(",")
for (h <- 0 until newX.length)
yield (h,newX(h))
}
val line = sc.textFile("hdfs://.../myfile.csv")
val withoutHeader: RDD[String] = dropHeader(line)
val kvPairs = withoutHeader.flatMap(retAtrTuple) //generates a key-value pair where key is the column number and value is column's value
var bool_numeric_col = kvPairs.map{case (x,y) => (x,isNumeric(y))}.reduceByKey(_&&_).sortByKey() //this contains column indexes as key and boolean as value (true for numeric and false for string type)
var str_cols = bool_numeric_col.filter{case (x,y) => y == false}.map{case (x,y) => x}
var num_cols = bool_numeric_col.filter{case (x,y) => y == true}.map{case (x,y) => x}
var str_col = str_cols.toArray //array consisting the string col
var num_col = num_cols.toArray //array consisting numeric col
val colCount = kvPairs.map((_,1)).reduceByKey(_+_)
val e1 = colCount.map{case ((x,y),z) => (x,(y,z))}
var numPairs = e1.filter{case (x,(y,z)) => str_col.contains(x) }
//running for loops which needs to be parallelized/optimized as it sequentially operates on each column. Idea is to find the top10, bottom10 and number of distinct elements column wise
for(i <- str_col){
var total = numPairs.filter{case (x,(y,z)) => x==i}.sortBy(_._2._2)
var leastOnes = total.take(10)
println("leastOnes for Col" + i)
leastOnes.foreach(println)
var maxOnes = total.sortBy(-_._2._2).take(10)
println("maxOnes for Col" + i)
maxOnes.foreach(println)
println("distinct for Col" + i + " is " + total.count)
}
Let me simplify your question a bit. (A lot actually.) We have an RDD[(Int, String)] and we want to find the top 10 most common Strings for each Int (which are all in the 0–100 range).
Instead of sorting, as in your example, it is more efficient to use the Spark built-in RDD.top(n) method. Its run-time is linear in the size of the data, and requires moving much less data around than a sort.
Consider the implementation of top in RDD.scala. You want to do the same, but with one priority queue (heap) per Int key. The code becomes fairly complex:
import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private.
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[[(Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for (((k, v), count) <- items) {
heaps(k) += count -> v
}
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case(count, value) => value })
}
This is efficient, but made painful because we need to work with multiple columns at the same time. It would be way easier to have one RDD per column and just call rdd.top(10) on each.
Unfortunately the naive way to split up the RDD into N smaller RDDs does N passes:
def split(together: RDD[(Int, String)], columns: Int): Seq[RDD[String]] = {
together.cache // We will make N passes over this RDD.
(0 until columns).map {
i => together.filter { case (key, value) => key == i }.values
}
}
A more efficient solution could be to write out the data into separate files by key, then load it back into separate RDDs. This is discussed in Write to multiple outputs by key Spark - one Spark job.
Thanks for #Daniel Darabos's answer. But there are some mistakes.
mixed use of Map and collection.mutable.Map
withDefault((i: Int) => new Heap(n)) do not create a new Heap when you set heaps(k) += count -> v
mix uasage of parentheses
Here is the modified code:
//import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private. copy to your own folder and import it
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object BoundedPriorityQueueTest {
// https://stackoverflow.com/questions/28166190/spark-column-wise-word-count
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[((Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[collection.mutable.Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap]() // .withDefault((i: Int) => new Heap(n))
for (((k, v), count) <- items) {
println("\n---")
println("before add " + ((k, v), count) + ", the map is: ")
println(heaps)
if (!heaps.contains(k)) {
println("not contains key " + k)
heaps(k) = new Heap(n)
println(heaps)
}
heaps(k) += count -> v
println("after add " + ((k, v), count) + ", the map is: ")
println(heaps)
}
println(heaps)
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: collection.mutable.Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap]() //.withDefault((i: Int) => new Heap(n))
println(heaps)
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case (count, value) => value }).toMap
}
def main(args: Array[String]): Unit = {
Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val words = sc.parallelize(List((1, "s11"), (1, "s11"), (1, "s12"), (1, "s13"), (2, "s21"), (2, "s22"), (2, "s22"), (2, "s23")))
println("# words:" + words.count())
val result = top(1, words)
println("\n--result:")
println(result)
sc.stop()
print("DONE")
}
}