Joining process with broadcast variable ends up endless spilling - scala

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()

I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

Related

Calculating Requests Per Minute From Timetstamps in RDD during mapping

I am currently trying to enrich data for machine learning with requests per minute. The data is stored in a Kafka topic and on application start the whole content of the topic is loaded and processed - therefore it is not possible to use any window operations of spark streaming to my knowledge, as all data will arrive at the same time.
My approach was to try the following:
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()
(duration, rpm)
})
However on this approach I get the following exception:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Is there a way to achieve what I want to do?
If you need any more information just drop me a comment and I will update what you need.
EDIT:
Broadcasting an RDD does not work. Broadcasting the collected Array does not result in an acceptable performance.
What will be executed but is horribly slow and therefore not really an option:
val collected = kMeansInformationRdd.collect()
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
UPDATE:
This code is at least able to get the job done way faster - but as far as I see it still gets slower the higher the requests per minute are as the filtered array grows - interesting enough it gets slower towards the end what I cannot figure out why. If someone sees the issue - or performance issues that could be generally improved - I would be happy if you let me know.
kMeansInformationRdd = kMeansInformationRdd.cache()
kMeansInformationRdd.sortBy(_._2, true)
var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
var counter = 0
kMeansInformationRdd.collect.foreach(x => {
val ts = x._2
val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint
var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())
bufferInstance = bufferInstance ++ Array(ts)
bufferInstance = bufferInstance.filter(p => p > ts-1000)
buffer.put(identifier, bufferInstance)
val rpm = bufferInstance.size.toLong
kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
counter = counter +1
if(counter % 10000==0){
println(counter)
println((identifier, x._3, rpm))
println((instanceSizeBefore, instanceSizeAfter))
}
})
val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)
The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:
val collected = sc.broadcast(kMeansInformationRdd.collect())
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?
The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:
When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.

Understanding scala collection execution

the following code takes almost forever:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(0.0))
.map(_.size)
.sum
While this is very fast:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(0.0).size)
.sum
Why is that? I don't quit understand in which order the statements are executed. In the first case, are first 10000 Seqs of size 10000 created, and then all those mapped to the size? Or is each Seq mapped to the size individually (and thus garbage-collected)?
Your assumption is correct. In the first snippet, you create 10.000 Seq instances and only after that, in a second iteration, those instances are mapped to their size. In the second snippet, there is not only no need for storing each Seq (since you are only interested in their size), but there also is no need for an additional iteration.
For the sake of clarity, let's look at it without method chaining:
val range = (1 to 10000)
val a1 = range.map(_ => Seq.fill(10000)(0.0)) // all collections are maintained in memory
val a2 = a1.map(_.size)
val b = range.map(_ => Seq.fill(10000)(0.0).size) // each collection can be thrown away asap

Get all possible combinations of 3 values from n possible elements

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

how to design the Spark application so that Shuffle data will be automatically cleaned up after some iterations

In the Spark core "example" directory (I am using Spark 1.2.0), there is an example called "SparkPageRank.scala",
val sparkConf = new SparkConf().setAppName("PageRank")
val iters = if (args.length > 0) args(1).toInt else 10
val ctx = new SparkContext(sparkConf)
val lines = ctx.textFile(args(0), 1)
val links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.collect()
ctx.stop()
}
I realize that in this example, the lineage will keep extending after each iteration. As a result, when I monitored the directory that holds the shuffle data, the shuffle data storage keeps increasing after each iteration.
How should I structure the application code, so that the ContextCleaner's doCleanupShuffle will be activated after certain interval (say, several iterations), so that I can prevent the ever-increasing of the shuffle data storage for computation that takes many iterations?
Jun
Apparently the clean-up of the shuffle files happens when the objects used for the shuffle are GCed. Since your snippet is a simple example of Page rank, I assume you are running it on a very small dataset, therefore your memory never exceeds the size of the heap and the objects are never GCed. Try with a bigger file or trigger the GC manually (although it is usually not recommended).
More information on : https://github.com/apache/spark/pull/5074/files
By the way in your example, it would be more efficient to partition the data instead of shuffling it every time.

Scala distributed execution of function objects

Given the following function objects,
val f : Int => Double = (i:Int) => i + 0.1
val g1 : Double => Double = (x:Double) => x*10
val g2 : Double => Double = (x:Double) => x/10
val h : (Double,Double) => Double = (x:Double,y:Double) => x+y
and for instance 3 remote servers or nodes (IP xxx.xxx.xxx.1, IP 2 and IP 3), how to distribute the execution of this program,
val fx = f(1)
val g1x = g1( fx )
val g2x = g2( fx )
val res = h ( g1x, g2x )
so that
fx is computed in IP 1,
g1x is computed in IP 2,
g2x is computed in IP 3,
res is computed in IP 1
May Scala Akka or Apache Spark provide a simple approach to this ?
Update
RPC (Remote Procedure Call) Finagle as suggested by #pkinsky may be a feasible choice.
Consider load-balancing policies as a mechanism for selecting a node for execution, at least any free available node policy.
I can speak for Apache Spark. It can do what you are looking for with the code below. But it's not designed for this kind of parallel computation. It is designed for parallel computation where you also have a large amount of parallel data distributed on many machines. So the solution looks a bit silly, as we distribute a single integer across a single machine for example (for f(1)).
Also, Spark is designed to run the same computation on all the data. So running g1() and g2() in parallel goes a bit against the design. (It's possible, but not elegant, as you see.)
// Distribute the input (1) across 1 machine.
val rdd1 = sc.parallelize(Seq(1), numSlices = 1)
// Run f() on the input, collect the results and take the first (and only) result.
val fx = rdd1.map(f(_)).collect.head
// The next stage's input will be (1, fx), (2, fx) distributed across 2 machines.
val rdd2 = sc.parallelize(Seq((1, fx), (2, fx)), numSlices = 2)
// Run g1() on one machine, g2() on the other.
val gxs = rdd2.map {
case (1, x) => g1(x)
case (2, x) => g2(x)
}.collect
val g1x = gxs(0)
val g2x = gxs(1)
// Same deal for h() as for f(). The input is (g1x, g2x), distributed to 1 machine.
val rdd3 = sc.parallelize(Seq((g1x, g2x)), numSlices = 1)
val res = rdd3.map { case (g1x, g2x) => h(g1x, g2x) }.collect.head
You can see that Spark code is based around the concept of RDDs. An RDD is like an array, except it's partitioned across multiple machines. sc.parallelize() creates such a parallel collection from a local collection. For example rdd2 in the above code will be created from the local collection Seq((1, fx), (2, fx)) and split across two machines. One machine will have Seq((1, fx)), the other will have Seq((2, fx)).
Next we do a transformation on the RDD. map is a common transformation that creates a new RDD of the same length by applying a function to each element. (Same as Scala's map.) The map we run on rdd2 will replace (1, x) with g1(x) and (2, x) with g2(x). So on one machine it will cause g1() to run, while on the other g2() will run.
Transformations run lazily, only when you want to access the results. The methods that access the results are called actions. The most straightforward example is collect, which downloads the contents of the entire RDD from the cluster to the local machine. (It is exactly the opposite of sc.parallelize().)
You can try and see all this if you download Spark, start bin/spark-shell, and copy your function definitions and the above code into the shell.