Spark program performance - GC & Task Deserialization & Concurrent execution - scala

I have a cluster of 4 machines, 1 master and three workers, each with 128G memory and 64 cores. I'm using Spark 1.5.0 in stand alone mode. My program reads data from Oracle tables using JDBC, then does ETL, manipulating data, and does machine learning tasks like k-means.
I have a DataFrame (myDF.cache()) which is join results with two other DataFrames, and cached. The DataFrame contains 27 million rows and the size of data is around 1.5G. I need to filter the data and calculate 24 histogram as follows:
val h1 = myDF.filter("pmod(idx, 24) = 0").select("col1").histogram(arrBucket)
val h2 = myDF.filter("pmod(idx, 24) = 1").select("col1").histogram(arrBucket)
// ......
val h24 = myDF.filter("pmod(idx, 24) = 23").select("col1").histogram(arrBucket)
Problems:
Since my DataFrame is cached, I expect the filter, select, and histogram is very fast. However, the actual time is about 7 seconds for each calculation, which is not acceptable. From UI, it show the GC time takes 5 seconds and Task Deserialization Time 4 seconds. I've tried different JVM parameters but cannot improve further. Right now I'm using
-Xms25G -Xmx25G -XX:MaxPermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=32 \
-XX:ConcGCThreads=8 -XX:InitiatingHeapOccupancyPercent=70
What puzzles me is that the size of data is nothing compared with available memory. Why does GC kick in every time filter/select/histogram running? Is there any way to reduce the GC time and Task Deserialization Time?
I have to do parallel computing for h[1-24], instead of sequential. I tried Future, something like:
import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future{myDF.filter("pmod(idx, 24) = 1").count}
val f2 = Future{myDF.filter("pmod(idx, 24) = 2").count}
val f3 = Future{myDF.filter("pmod(idx, 24) = 3").count}
val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
c1 + c2 + c3
}
val summ = Await.result(future, 180 second)
The problem is that here Future only means jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled and run simultaneously. Future used here doesn't improve performance at all.
How to make the 24 computation jobs run simultaneously?

A couple of things you can try:
Don't compute pmod(idx, 24) all over again. Instead you can simply compute it once:
import org.apache.spark.sql.functions.{pmod, lit}
val myDfWithBuckets = myDF.withColumn("bucket", pmod($"idx", lit(24)))
Use SQLContext.cacheTable instead of cache. It stores table using compressed columnar storage which can be used to access only required columns and as stated in the Spark SQL and DataFrame Guide "will automatically tune compression to minimize memory usage and GC pressure".
myDfWithBuckets.registerTempTable("myDfWithBuckets")
sqlContext.cacheTable("myDfWithBuckets")
If you can, cache only the columns you actually need instead of projecting each time.
It is not clear for me what is the source of a histogram method (do you convert to RDD[Double] and use DoubleRDDFunctions.histogram?) and what is the argument but if you want to compute all histograms at the same time you can try to groupBy bucket and apply histogram once for example using histogram_numeric UDF:
import org.apache.spark.sql.functions.callUDF
val n: Int = ???
myDfWithBuckets
.groupBy($"bucket")
.agg(callUDF("histogram_numeric", $"col1", lit(n)))
If you use predefined ranges you can obtain a similar effect using custom UDF.
Notes
how to extract values computed by histogram_numeric? First lets create a small helper
import org.apache.spark.sql.Row
def extractBuckets(xs: Seq[Row]): Seq[(Double, Double)] =
xs.map(x => (x.getDouble(0), x.getDouble(1)))
now we can map using pattern matching as follows:
import org.apache.spark.rdd.RDD
val histogramsRDD: RDD[(Int, Seq[(Double, Double)])] = histograms.map{
case Row(k: Int, hs: Seq[Row #unchecked]) => (k, extractBuckets(hs)) }

Related

Why is parallel aggregation not faster in spark?

As the last question that SO suggests is related to mine is from 2011 I ask anew..
I was trying to prove that aggregating over a parallelized Spark array would be faster than over a normal array (all on a Dell XPS with 4 cores).
import org.apache.spark.{SparkConf, SparkContext}
object SparkStuffer extends App {
val appName: String = "My Spark Stuffer"
val master: String = "local"
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
// Returns '4'
println("Available processors: " + Runtime.getRuntime().availableProcessors())
val data = 1 to 100000000
val distData = sc.parallelize(data)
val sequData = data
val parallelIni = java.lang.System.currentTimeMillis();
distData.reduce((a, b) => a+b)
val parallelFin = java.lang.System.currentTimeMillis();
val seqIni = java.lang.System.currentTimeMillis();
sequData.reduce((a, b) => a+b)
val seqFin = java.lang.System.currentTimeMillis();
println("Par: " + (parallelFin - parallelIni))
println("Seq: " + (seqFin - seqIni))
// Par: 3262
// Seq: 1099
}
For reference I'm adding the build.sbt:
name := "spark_stuff"
version := "0.1"
scalaVersion := "2.12.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0"
Why is the parallel aggregation not faster and if not here, what Example would show that it is faster?
There is a misconception here, first your step distData.reduce((a, b) => a+b) is doing two things here. One is distributing the data, and second is processing the data. Not only processing as you are expecting.
Spark framework has two steps during the execution of a block of code, the transformation and the action. The transformation step is when Spark is just preparing the backend with what need to be done, checking if the data exists, if does make sense what you are doing and etc. That is what happening here: sc.parallelize(data). In this moment your code is not parallelizing anything, is just preparing to parallelize, the parallelization happens when you run distData.reduce((a, b) => a+b) this is an action and after that it process the data.
I've run the same example in my cluster, and here are few results that you can use as reference:
Here we got the execution just like your code:
And here is with a small change, forcing the be parallelized before the reduce, to remove the overhead of the distribution using this code:
val data = 1 to 100000000
val distData = sc.parallelize(data)
distData.count()
distData.reduce((a, b) => a+b)
And here is the result of how fast it rans:
But, we need to know that not always a distributed algorithm will beat the iterative algorithm mostly for the overhead. Your dataset is pretty small and is mostly built in memory. So, a distributed code will beat the sequential only in a certain size of the data. What is the size? It depends. But the conclusion is, the parallel execution is slower in this case because the action step parallelize then executes the reduce.

How to optimize below spark code (scala)?

I have some huge files (of 19GB, 40GB etc.). I need to execute following algorithm on these files:
Read the file
Sort it on the basis of 1 column
Take 1st 70% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to train file
Take the last 30% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to test file
I tried running following code in spark (using Scala).
import scala.collection.mutable.ListBuffer
import java.io.FileWriter
import org.apache.spark.sql.functions.year
val offers = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", ",")
.load("input.txt")
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
val maxTimeTrain = trainDF.agg(max("utcDate"))
val maxtimeStamp = maxTimeTrain.collect()(0).getTimestamp(0)
val testDF = csvOuterJoin.filter(csvOuterJoin("utcDate") > maxtimeStamp)
val inputTrain = trainDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
val inputTest = testDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
inputTrain.rdd.coalesce(1,false).saveAsTextFile("train.csv")
inputTest.rdd.coalesce(1,false).saveAsTextFile("test.csv")
This is how I initiate spark-shell:
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --total-executor-cores 70 --executor-memory 10G --driver-memory 20G
I execute this code on a distributed cluster with 1 master and many slaves each having sufficient amount of RAM. As of now, this code ends up taking a lot of memory and I get java heap space issues.
Is there a way to optimize the above code (preferably in spark)? I appreciate any kind of minimal help in optimizing the above code.
The problem is you don't distribute at all. And the source is here:
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
limit operation is not designed for large scale operations and it moves all records to a single partition:
val df = spark.range(0, 10000, 1, 1000)
df.rdd.partitions.size
Int = 1000
// Take all records by limit
df.orderBy($"id").limit(10000).rdd.partitions.size
Int = 1
You can use RDD API:
val ordered = df.orderBy($"utcDate")
val cnt = df.count * 0.7
val train = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i <= cnt
}.map(_._1), ordered.schema)
val test = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i > cnt
}.map(_._1), ordered.schema)
coalesce(1,false) means merging all data into one partition, aka keeping 40GB data in memory of one node.
Never try to get all data in one file by coalesce(1,false).
Instead, you should just call saveAsTextFile(so the output looks like part-00001, part00002, etc.) and then merge these partition files outside.
The merge operation depends on your file system. In case of HDFS, you can use http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge

Spark UDF called more than once per record when DF has too many columns

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
results=results.withColumn(s"col_$i",$"result")
}
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
}
}
Now, is there a way to solve this without reducing the number of columns?
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a").cache()
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.
We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this:
SPARK-18748
We also opened a question here then, but now I see the title wasn't so good:
Trying to turn a blob into multiple columns in Spark
I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
.withColumn("result", $"tmp.a")
update:
Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
.withColumn("result", $"tmp.a")
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction
i.e. use
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

GC overhead limit exceeded with large RDD[MatrixEntry] in Apache Spark

I have a csv file stored a data of user-item of dimension 6,365x214 , and i am finding user-user similarity by using columnSimilarities() of org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.
My code looks like this:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix,
MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
def rddToCoordinateMatrix(input_rdd: RDD[String]) : CoordinateMatrix = {
// Convert RDD[String] to RDD[Tuple3]
val coo_matrix_input: RDD[Tuple3[Long,Long,Double]] = input_rdd.map(
line => line.split(',').toList
).map{
e => (e(0).toLong, e(1).toLong, e(2).toDouble)
}
// Convert RDD[Tuple3] to RDD[MatrixEntry]
val coo_matrix_matrixEntry: RDD[MatrixEntry] = coo_matrix_input.map(e => MatrixEntry(e._1, e._2, e._3))
// Convert RDD[MatrixEntry] to CoordinateMatrix
val coo_matrix: CoordinateMatrix = new CoordinateMatrix(coo_matrix_matrixEntry)
return coo_matrix
}
// Read CSV File to RDD[String]
val input_rdd: RDD[String] = sc.textFile("user_item.csv")
// Read RDD[String] to CoordinateMatrix
val coo_matrix = rddToCoordinateMatrix(input_rdd)
// Transpose CoordinateMatrix
val coo_matrix_trans = coo_matrix.transpose()
// Convert CoordinateMatrix to RowMatrix
val mat: RowMatrix = coo_matrix_trans.toRowMatrix()
// Compute similar columns perfectly, with brute force
// Return CoordinateMatrix
val simsPerfect: CoordinateMatrix = mat.columnSimilarities()
// CoordinateMatrix to RDD[MatrixEntry]
val simsPerfect_entries = simsPerfect.entries
simsPerfect_entries.count()
// Write results to file
val results_rdd = simsPerfect_entries.map(line => line.i+","+line.j+","+line.value)
results_rdd.saveAsTextFile("similarity-output")
// Close the REPL terminal
System.exit(0)
and, when i run this script on spark-shell
i got following error, after running line of code simsPerfect_entries.count() :
java.lang.OutOfMemoryError: GC overhead limit exceeded
Updated:
I tried many solutions already given by others ,but i got no success.
1 By increasing amount of memory to use per executor process spark.executor.memory=1g
2 By decreasing the number of cores to use for the driver process
spark.driver.cores=1
Suggest me some way to resolve this issue.
All Spark transformations are lazy until you actually materialize it. When you define RDD-to-RDD data manipulations, Spark just chains operations together, not performing actual computation. So when you call simsPerfect_entries.count(), the chain of operations is executed and you got your number.
Error GC overhead limit exceeded means that JVM garbage collector activity was so high that execution of your code was stopped. GC activity can be so high due to these reasons:
You produce too many small objects and immediately discarding them. Looks like that you're not.
Your data does not fit into your JVM heap. Like if you try to load 2GB text file into RAM, but have only 1GB of JVM heap. Looks like that it's your case.
To fix this issue try to increase the amount of JVM heap on:
your worker nodes if you have a distributed Spark setup.
your spark-shell app.