How to improve join performance when broadcast variable is used? - scala

I am new to Spark. I have two RDDs where one with a size of 9 GB (400 million lines) (RDD1) and the other 110 KB (4 million lines) (RDD2). I use RDD2 as a broadcast variable to decrease the shuffle process. My code works but again for the reduceByKey part it is too slow.
I have been playing around partition numbers. If I set 10,000 partitions for both RDD it starts to spill. So i increased it to 20K, 30K and 100K. It stopped spilling however it is extremely slow. On the other hand, I used
set("spark.akka.frameSize","1000") but it did not work out. How could I improve this code?
Here is my code:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.memory","128g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\.txt",30000)...RDD1
val emp_new = sc.textFile("\\.txt",10000)...RDD2
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)

Related

Spark LSH approxSimilarityJoin taking too much time

Spark LSH approxSimilarityJoin is taking too much time:
val column="name"
val new_df=df.select("id", "name", "duns_number", "country_id") 1.7 million record
val new_df_1= df.select("index", "name", "duns_number", "country_id") 0.7 million record
val n_gram = new NGram()
.setInputCol("_"+column)
.setN(4)
.setOutputCol("n_gram_column")
val n_gram_df = n_gram.transform(new_df)
val n_gram_df_1=n_gram.transform(new_df_1)
val validateEmptyVector = udf({ v: Vector => v.numNonzeros > 0 }, DataTypes.BooleanType)
val vectorModeler: CountVectorizerModel = new CountVectorizer()
.setInputCol("n_gram_column")
.setOutputCol("tokenize")
.setVocabSize(456976)
.setMinDF(1)
.fit(n_gram_df)
val vectorizedProductsDF = vectorModeler.transform(n_gram_df)
.filter(validateEmptyVector(col("tokenize")))
.select(col("id"), col(column), col("tokenize"),col("duns_number"),col("country_id"))
val vectorizedProductsDF_1 = vectorModeler.transform(n_gram_df_1)
.filter(validateEmptyVector(col("tokenize")))
.select(col("tokenize"),col(column),col("duns_number"),col("country_id"),col("index"))
val minLshConfig = new MinHashLSH().setNumHashTables(3)
.setInputCol("tokenize")
.setOutputCol("hash")
val lshModel = minLshConfig.fit(vectorizedProductsDF)
val transform_1=lshModel.transform(vectorizedProductsDF)
val transform_2=lshModel.transform(vectorizedProductsDF_1)
val result=lshModel.approxSimilarityJoin(transform_1,transform_2,0.42).toDF
Last line of code (approxSimilarityJoin) is taking too much time and in Stages last few tasks stuck.
I tried with 13 executors with 4 cores each and
spark.sql.shuffle.partitions=600.

How to optimize below spark code (scala)?

I have some huge files (of 19GB, 40GB etc.). I need to execute following algorithm on these files:
Read the file
Sort it on the basis of 1 column
Take 1st 70% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to train file
Take the last 30% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to test file
I tried running following code in spark (using Scala).
import scala.collection.mutable.ListBuffer
import java.io.FileWriter
import org.apache.spark.sql.functions.year
val offers = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", ",")
.load("input.txt")
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
val maxTimeTrain = trainDF.agg(max("utcDate"))
val maxtimeStamp = maxTimeTrain.collect()(0).getTimestamp(0)
val testDF = csvOuterJoin.filter(csvOuterJoin("utcDate") > maxtimeStamp)
val inputTrain = trainDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
val inputTest = testDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
inputTrain.rdd.coalesce(1,false).saveAsTextFile("train.csv")
inputTest.rdd.coalesce(1,false).saveAsTextFile("test.csv")
This is how I initiate spark-shell:
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --total-executor-cores 70 --executor-memory 10G --driver-memory 20G
I execute this code on a distributed cluster with 1 master and many slaves each having sufficient amount of RAM. As of now, this code ends up taking a lot of memory and I get java heap space issues.
Is there a way to optimize the above code (preferably in spark)? I appreciate any kind of minimal help in optimizing the above code.
The problem is you don't distribute at all. And the source is here:
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
limit operation is not designed for large scale operations and it moves all records to a single partition:
val df = spark.range(0, 10000, 1, 1000)
df.rdd.partitions.size
Int = 1000
// Take all records by limit
df.orderBy($"id").limit(10000).rdd.partitions.size
Int = 1
You can use RDD API:
val ordered = df.orderBy($"utcDate")
val cnt = df.count * 0.7
val train = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i <= cnt
}.map(_._1), ordered.schema)
val test = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i > cnt
}.map(_._1), ordered.schema)
coalesce(1,false) means merging all data into one partition, aka keeping 40GB data in memory of one node.
Never try to get all data in one file by coalesce(1,false).
Instead, you should just call saveAsTextFile(so the output looks like part-00001, part00002, etc.) and then merge these partition files outside.
The merge operation depends on your file system. In case of HDFS, you can use http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge

SparkError: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

Error:
ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Goal: Obtain recommendation for all the users using the model and overlap with each users test data and generate overlap ratio.
I have build a recommendation model using spark mllib. I evaluate the overlap ration of test data per user and recommended items per user and generate mean overlap ratio.
def overlapRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val testData: RDD[(Int, Iterable[Int])] = test_data.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations: RDD[(Int, Array[Int])] = model.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val overlaps = testData.join(recommendations).map(x => {
val moviesPerUserInRecs = x._2._2.toSet
val moviesPerUserInTest = x._2._1.toSet
val localHitRatio = moviesPerUserInRecs.intersect(moviesPerUserInTest)
if(localHitRatio.size > 0)
1
else
0
}).filter(x => x != 0).count
var r = 0.0
if (overlaps != 0)
r = overlaps / n
return r
}
But the problem here is that it ends up throwing above maxResultSize error. In my spark configuration I did following to increase the maxResultSize.
val conf = new SparkConf()
conf.set("spark.driver.maxResultSize", "6g")
But that didn't solve the problem, I went almost close to the amount that I allocate the driver memory yet the issue didn't get resolve. While the code is getting execute I kept eyes on my spark job and what I saw is bit puzzling.
[Stage 281:==> (47807 + 100) / 1000000]15/12/01 12:27:03 ERROR TaskSetManager: Total size of serialized results of 47809 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 GB)
At above stage code is executing MatrixFactorization code in spark-mllib recommendForAll around line 277 (not exactly sure the line number).
private def recommendForAll(
rank: Int,
srcFeatures: RDD[(Int, Array[Double])],
dstFeatures: RDD[(Int, Array[Double])],
num: Int): RDD[(Int, Array[(Int, Double)])] = {
val srcBlocks = blockify(rank, srcFeatures)
val dstBlocks = blockify(rank, dstFeatures)
val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
case ((srcIds, srcFactors), (dstIds, dstFactors)) =>
val m = srcIds.length
val n = dstIds.length
val ratings = srcFactors.transpose.multiply(dstFactors)
val output = new Array[(Int, (Int, Double))](m * n)
var k = 0
ratings.foreachActive { (i, j, r) =>
output(k) = (srcIds(i), (dstIds(j), r))
k += 1
}
output.toSeq
}
ratings.topByKey(num)(Ordering.by(_._2))
}
recommendForAll method get called in from recommendProductsForUsers method.
But looks like the method is spinning off 1M tasks. Data that get fed comes from 2000 part files so I am confuse how it started to spit 1M tasks and I think that might be the problem.
My question is how can I actually resolve this problem. Without using this approach its really hard to calculate overlap ratio or recall#K. This is on spark 1.5 (cloudera 5.5)
the 2GB problem is not new to the Spark community: https://issues.apache.org/jira/browse/SPARK-6235
Re/ the partition size greater than 2GB, try to repartition (myRdd.repartition(parallelism)) your RDD to a greater number of partitions (w/r/t/ your current level of parallelism), thus reducing each single partition's size.
Re/ the number of tasks spinned (hence partitions created), my hypothesis is that it might come out of the srcBlocks.cartesian(dstBlocks) API call, which produces an output RDD made of (z = srcBlocks's number of partitions * dstBlocks's number of partitions) partitions.
In this case, you might consider leveraging myRdd.coalesce(parallelism) API instead of the repartition one to avoid shuffle (and partitions seralialization related problems).

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

how to design the Spark application so that Shuffle data will be automatically cleaned up after some iterations

In the Spark core "example" directory (I am using Spark 1.2.0), there is an example called "SparkPageRank.scala",
val sparkConf = new SparkConf().setAppName("PageRank")
val iters = if (args.length > 0) args(1).toInt else 10
val ctx = new SparkContext(sparkConf)
val lines = ctx.textFile(args(0), 1)
val links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.collect()
ctx.stop()
}
I realize that in this example, the lineage will keep extending after each iteration. As a result, when I monitored the directory that holds the shuffle data, the shuffle data storage keeps increasing after each iteration.
How should I structure the application code, so that the ContextCleaner's doCleanupShuffle will be activated after certain interval (say, several iterations), so that I can prevent the ever-increasing of the shuffle data storage for computation that takes many iterations?
Jun
Apparently the clean-up of the shuffle files happens when the objects used for the shuffle are GCed. Since your snippet is a simple example of Page rank, I assume you are running it on a very small dataset, therefore your memory never exceeds the size of the heap and the objects are never GCed. Try with a bigger file or trigger the GC manually (although it is usually not recommended).
More information on : https://github.com/apache/spark/pull/5074/files
By the way in your example, it would be more efficient to partition the data instead of shuffling it every time.