Why is parallel aggregation not faster in spark? - scala

As the last question that SO suggests is related to mine is from 2011 I ask anew..
I was trying to prove that aggregating over a parallelized Spark array would be faster than over a normal array (all on a Dell XPS with 4 cores).
import org.apache.spark.{SparkConf, SparkContext}
object SparkStuffer extends App {
val appName: String = "My Spark Stuffer"
val master: String = "local"
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
// Returns '4'
println("Available processors: " + Runtime.getRuntime().availableProcessors())
val data = 1 to 100000000
val distData = sc.parallelize(data)
val sequData = data
val parallelIni = java.lang.System.currentTimeMillis();
distData.reduce((a, b) => a+b)
val parallelFin = java.lang.System.currentTimeMillis();
val seqIni = java.lang.System.currentTimeMillis();
sequData.reduce((a, b) => a+b)
val seqFin = java.lang.System.currentTimeMillis();
println("Par: " + (parallelFin - parallelIni))
println("Seq: " + (seqFin - seqIni))
// Par: 3262
// Seq: 1099
}
For reference I'm adding the build.sbt:
name := "spark_stuff"
version := "0.1"
scalaVersion := "2.12.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0"
Why is the parallel aggregation not faster and if not here, what Example would show that it is faster?

There is a misconception here, first your step distData.reduce((a, b) => a+b) is doing two things here. One is distributing the data, and second is processing the data. Not only processing as you are expecting.
Spark framework has two steps during the execution of a block of code, the transformation and the action. The transformation step is when Spark is just preparing the backend with what need to be done, checking if the data exists, if does make sense what you are doing and etc. That is what happening here: sc.parallelize(data). In this moment your code is not parallelizing anything, is just preparing to parallelize, the parallelization happens when you run distData.reduce((a, b) => a+b) this is an action and after that it process the data.
I've run the same example in my cluster, and here are few results that you can use as reference:
Here we got the execution just like your code:
And here is with a small change, forcing the be parallelized before the reduce, to remove the overhead of the distribution using this code:
val data = 1 to 100000000
val distData = sc.parallelize(data)
distData.count()
distData.reduce((a, b) => a+b)
And here is the result of how fast it rans:
But, we need to know that not always a distributed algorithm will beat the iterative algorithm mostly for the overhead. Your dataset is pretty small and is mostly built in memory. So, a distributed code will beat the sequential only in a certain size of the data. What is the size? It depends. But the conclusion is, the parallel execution is slower in this case because the action step parallelize then executes the reduce.

Related

Spark, applying filters on DataFrame(or RDD) multiple times without redundant evaluations

I have a Spark DataFrame that needs heavy evaluations for the chaining of the parent RDDs.
val df: DataFrame[(String, Any)] = someMethodCalculatingDF()
val out1 = df.filter(_._1 == "Key1").map(_._2).collect()
val out2 = df.filter(_._1 == "Key2").map(_._2)
out1 is a very small data ( one or two Rows in each partition) and collected for further use.
out2 is a Dataframe and will be used to generate another RDD that will be materialized later.
So, df will be evaluated twice, which is heavy.
Caching could be a solution, but in my application, it wont be, because the data could be really really BIG. The memory would be overflowed.
Is there any genius :) who could suggest another way bypassing the redundant evaluations?
It's actually a scenario which occurs in our cluster on a daily basis. From our experience this methodology work for us the best.
When we need to use same calculated dataframe twice(on different branches) we do as follows:
Calculation phase is heavy and resulted in a rather small dataframe -> cache it.
Calculation phase is light resulted in a big dataframe -> let it calculate twice.
Calculation is heavy resulted in a big data frame -> write it to disk(HDFS or S3) split the job on splitting point to two different batch processing job. In this you don't repeat the heavy calculation and you don't shred your cache(which will either way probably use the disk).
Calculation phase is light resulting in a small Dataframe. Your life is good and you can go home :).
I'm not familiar with dataset API so will write a solution using RDD api.
val rdd: RDD[(String, Int)] = ???
//First way
val both: Map[String, Iterable[Int]] = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2")
.groupByKey().collectAsMap()
//Second way
val smallCached = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2").cache()
val out1 = smallCached.filter(_._1 == "Key1").map(_._2).collect()
val out2 = smallCached.filter(_._1 == "Key2").map(_._2).collect()

Calculating Requests Per Minute From Timetstamps in RDD during mapping

I am currently trying to enrich data for machine learning with requests per minute. The data is stored in a Kafka topic and on application start the whole content of the topic is loaded and processed - therefore it is not possible to use any window operations of spark streaming to my knowledge, as all data will arrive at the same time.
My approach was to try the following:
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()
(duration, rpm)
})
However on this approach I get the following exception:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Is there a way to achieve what I want to do?
If you need any more information just drop me a comment and I will update what you need.
EDIT:
Broadcasting an RDD does not work. Broadcasting the collected Array does not result in an acceptable performance.
What will be executed but is horribly slow and therefore not really an option:
val collected = kMeansInformationRdd.collect()
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
UPDATE:
This code is at least able to get the job done way faster - but as far as I see it still gets slower the higher the requests per minute are as the filtered array grows - interesting enough it gets slower towards the end what I cannot figure out why. If someone sees the issue - or performance issues that could be generally improved - I would be happy if you let me know.
kMeansInformationRdd = kMeansInformationRdd.cache()
kMeansInformationRdd.sortBy(_._2, true)
var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
var counter = 0
kMeansInformationRdd.collect.foreach(x => {
val ts = x._2
val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint
var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())
bufferInstance = bufferInstance ++ Array(ts)
bufferInstance = bufferInstance.filter(p => p > ts-1000)
buffer.put(identifier, bufferInstance)
val rpm = bufferInstance.size.toLong
kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
counter = counter +1
if(counter % 10000==0){
println(counter)
println((identifier, x._3, rpm))
println((instanceSizeBefore, instanceSizeAfter))
}
})
val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)
The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:
val collected = sc.broadcast(kMeansInformationRdd.collect())
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?
The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:
When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.

Get all possible combinations of 3 values from n possible elements

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

Spark program performance - GC & Task Deserialization & Concurrent execution

I have a cluster of 4 machines, 1 master and three workers, each with 128G memory and 64 cores. I'm using Spark 1.5.0 in stand alone mode. My program reads data from Oracle tables using JDBC, then does ETL, manipulating data, and does machine learning tasks like k-means.
I have a DataFrame (myDF.cache()) which is join results with two other DataFrames, and cached. The DataFrame contains 27 million rows and the size of data is around 1.5G. I need to filter the data and calculate 24 histogram as follows:
val h1 = myDF.filter("pmod(idx, 24) = 0").select("col1").histogram(arrBucket)
val h2 = myDF.filter("pmod(idx, 24) = 1").select("col1").histogram(arrBucket)
// ......
val h24 = myDF.filter("pmod(idx, 24) = 23").select("col1").histogram(arrBucket)
Problems:
Since my DataFrame is cached, I expect the filter, select, and histogram is very fast. However, the actual time is about 7 seconds for each calculation, which is not acceptable. From UI, it show the GC time takes 5 seconds and Task Deserialization Time 4 seconds. I've tried different JVM parameters but cannot improve further. Right now I'm using
-Xms25G -Xmx25G -XX:MaxPermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=32 \
-XX:ConcGCThreads=8 -XX:InitiatingHeapOccupancyPercent=70
What puzzles me is that the size of data is nothing compared with available memory. Why does GC kick in every time filter/select/histogram running? Is there any way to reduce the GC time and Task Deserialization Time?
I have to do parallel computing for h[1-24], instead of sequential. I tried Future, something like:
import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future{myDF.filter("pmod(idx, 24) = 1").count}
val f2 = Future{myDF.filter("pmod(idx, 24) = 2").count}
val f3 = Future{myDF.filter("pmod(idx, 24) = 3").count}
val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
c1 + c2 + c3
}
val summ = Await.result(future, 180 second)
The problem is that here Future only means jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled and run simultaneously. Future used here doesn't improve performance at all.
How to make the 24 computation jobs run simultaneously?
A couple of things you can try:
Don't compute pmod(idx, 24) all over again. Instead you can simply compute it once:
import org.apache.spark.sql.functions.{pmod, lit}
val myDfWithBuckets = myDF.withColumn("bucket", pmod($"idx", lit(24)))
Use SQLContext.cacheTable instead of cache. It stores table using compressed columnar storage which can be used to access only required columns and as stated in the Spark SQL and DataFrame Guide "will automatically tune compression to minimize memory usage and GC pressure".
myDfWithBuckets.registerTempTable("myDfWithBuckets")
sqlContext.cacheTable("myDfWithBuckets")
If you can, cache only the columns you actually need instead of projecting each time.
It is not clear for me what is the source of a histogram method (do you convert to RDD[Double] and use DoubleRDDFunctions.histogram?) and what is the argument but if you want to compute all histograms at the same time you can try to groupBy bucket and apply histogram once for example using histogram_numeric UDF:
import org.apache.spark.sql.functions.callUDF
val n: Int = ???
myDfWithBuckets
.groupBy($"bucket")
.agg(callUDF("histogram_numeric", $"col1", lit(n)))
If you use predefined ranges you can obtain a similar effect using custom UDF.
Notes
how to extract values computed by histogram_numeric? First lets create a small helper
import org.apache.spark.sql.Row
def extractBuckets(xs: Seq[Row]): Seq[(Double, Double)] =
xs.map(x => (x.getDouble(0), x.getDouble(1)))
now we can map using pattern matching as follows:
import org.apache.spark.rdd.RDD
val histogramsRDD: RDD[(Int, Seq[(Double, Double)])] = histograms.map{
case Row(k: Int, hs: Seq[Row #unchecked]) => (k, extractBuckets(hs)) }

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).