I ran the following job in the spark-shell:
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)).persist
d.join(d.reduceByKey(_ + _)).collect
The Spark UI shows three stages. Stage 4 and 5 correspond to the computation of d, and stage 6 corresponds to the computation of the collect action. Since d is persisted, I would expect only two stages. However stage 5 is present not connected to any other stages.
So tried running the same computation without using persist, and the DAG looks like identically, except without the green dots indicating the RDD has been persisted.
I would expect the output of stage 11 to be connect to the input of stage 12, but it is not.
Looking at the stage descriptions, the stages seem to indicate that d is being persisted, because stage 5 has input, but I am still confused as to why stage 5 even exists.
Input RDD is cached and cached part is not recomputed.
This can be validated with a simple test:
import org.apache.spark.SparkContext
def f(sc: SparkContext) = {
val counter = sc.longAccumulator("counter")
val rdd = sc.parallelize(0 until 100).map(i => {
counter.add(1L)
(i%10, i)
}).persist
rdd.join(rdd.reduceByKey(_ + _)).foreach(_ => ())
counter.value
}
assert(f(spark.sparkContext) == 100)
Caching doesn't remove stages from DAG.
If data is cached corresponding stages can be marked as skipped but are still part of the DAG. Lineage can be truncated using checkpoints but it is not the same thing and it doesn't remove stages from visualization.
Input stages contain more than cached computations.
Spark stages group together operations which can be chained without performing shuffle.
While part of the input stage is cached it doesn't cover all the operations required to prepare shuffle files. This is why you don't see skipped tasks.
The rest (detachment) is just a limitation of the graph visualization.
If you repartition data first:
import org.apache.spark.HashPartitioner
val d = sc.parallelize(0 until 1000000)
.map(i => (i%100000, i))
.partitionBy(new HashPartitioner(20))
d.join(d.reduceByKey(_ + _)).collect
you'll get DAG you're most likely looking for:
Adding to user6910411's detailed answer, RDD is not persisted in memory until the first action runs and it computes the whole DAG, due to lazy evaluation of RDDs. So when you run collect() first time, RDD "d" gets persisted in memory for the first time, but nothing gets read from the memory. If you run collect() second time, the cached RDD is read.
Also, if you do a toDebugString on the final RDD, it shows the below output:
scala> d.join(d.reduceByKey(_ + _)).toDebugString
res5: String =
(4) MapPartitionsRDD[19] at join at <console>:27 []
| MapPartitionsRDD[18] at join at <console>:27 []
| CoGroupedRDD[17] at join at <console>:27 []
+-(4) MapPartitionsRDD[15] at map at <console>:24 []
| | ParallelCollectionRDD[14] at parallelize at <console>:24 []
| ShuffledRDD[16] at reduceByKey at <console>:27 []
+-(4) MapPartitionsRDD[15] at map at <console>:24 []
| ParallelCollectionRDD[14] at parallelize at <console>:24 []
A rough graphical representation of above can be shown as:RDD Stages
Related
Given the following code:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"y")) // withColumn2 from y
Is it guaranteed that withColumn1 will execute before withColumn2?
A better example:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"A")) // withColumn2 from A!!
Note that withColumn2 operates on A that is calculated from withColumn1.
I'm asking because I'm having inconsistent results over multiple runs of the same code and I started to think that this could be the source of the issue.
EDIT: Added more detailed code sample
val result = dataFrame
.groupBy("key")
.agg(
collect_list($"itemList").as("A"), // all items
collect_list(when($"click".isNotNull, $"itemList")).as("B") // subset of A
)
// create sparse item vector from all list of items A
.withColumn("vectorA", aggToSparseUdf($"A"))
// create sparse item vector from all list of items B (subset of A)
.withColumn("vectorB", aggToSparseUdf($"B"))
// calculate ratio vector B / A
.withColumn("ratio", divideVectors($"vectorB", $"vectorA"))
val keys: Seq[String] = result.head.getAs[Seq[String]]("key")
val values: Seq[SparseVector] = result.head.getAs[Seq[SparseVector]]("ratio")
It IS guaranteed that for each specific record in dataFrame, myUdf1 will be applied before myUdf2; However:
It is NOT guaranteed that myUdf1 will be applied to all records of dataFrame before myUdf2 is applied to any record - in other words, myUdf2 might be applied to some records before myUdf1 has been applied to other records
This is true because Spark would likely combine both operations together into a single stage, and execute this stage (applying myUdf1 and myUdf2) on each record of each partition.
This shouldn't pose any problem if your UDFs are "purely functional", or "idempotent", or cause no side effects - and they should be, because Spark assumes all transformations are such. If they weren't, Spark wouldn't be able to optimize execution by "combining" transformations together, running them in parallel on different partitions, retrying transformations etc.
EDIT: if you want to force UDF1 to be completely applied before applying UDF2 to any record, you'd have to force them into separate stages - this can be done, for example, by repartitioning the DataFrame:
// sample data:
val dataFrame = Seq("A", "V", "D").toDF("x")
// two UDFs with "side effects" (printing to console):
val myUdf1 = udf[String, String](x => {
println("In UDF1")
x.toLowerCase
})
val myUdf2 = udf[String, String](x => {
println("In UDF2")
x.toUpperCase
})
// repartitioning between UDFs
dataFrame
.withColumn("A", myUdf1($"x"))
.repartition(dataFrame.rdd.partitions.length + 1)
.withColumn("B", myUdf2($"A"))
.show()
// prints:
// In UDF1
// In UDF1
// In UDF1
// In UDF2
// In UDF2
// In UDF2
NOTE that this isn't bullet-proof either - if, for example, there are failures and retries, order can be once again non deterministic.
In a cogroup transformation, e.g. RDD1.cogroup(RDD2, ...), I used to assume that Spark only shuffles/moves RDD2 and retains RDD1's partitioning and in-memory storage if:
RDD1 has an explicit partitioner
RDD1 is cached.
In my other projects most of the shuffling behaviour seems to be consistent with this assumption. So yesterday I wrote a short scala program to prove it once and for all:
// sc is the SparkContext
val rdd1 = sc.parallelize(1 to 10, 4).map(v => v->v)
.partitionBy(new HashPartitioner(4))
rdd1.persist().count()
val rdd2 = sc.parallelize(1 to 10, 4).map(v => (11-v)->v)
val cogrouped = rdd1.cogroup(rdd2).map {
v =>
v._2._1.head -> v._2._2.head
}
val zipped = cogrouped.zipPartitions(rdd1, rdd2) {
(itr1, itr2, itr3) =>
itr1.zipAll(itr2.map(_._2), 0->0, 0).zipAll(itr3.map(_._2), (0->0)->0, 0)
.map {
v =>
(v._1._1._1, v._1._1._2, v._1._2, v._2)
}
}
zipped.collect().foreach(println)
If rdd1 doesn't move the first column of zipped should have the same value as the third column, so I ran the programs, oops:
(4,7,4,1)
(8,3,8,2)
(1,10,1,3)
(9,2,5,4)
(5,6,9,5)
(6,5,2,6)
(10,1,6,7)
(2,9,10,0)
(3,8,3,8)
(7,4,7,9)
(0,0,0,10)
The assumption is not true. Spark probably did some internal optimisation and decided that regenerating rdd1's partitions is much faster than keeping them in cache.
So the question is: If my programmatic requirement to not move RDD1 (and keep it cached) is because of other reasons than speed (e.g. resource locality), or in some occasions Spark internal optimisation is not preferrable, is there a way to explicitly instruct the framework to not move an operand in all cogroup-like operations? This also include join, outer join, and groupWith.
Thanks a lot for your help. So far I'm using broadcast join as a not-so-scalable makeshift solution, it is not going to last long before crashing my cluster. I'm expecting a solution consistent with the distributed computing principal.
If rdd1 doesn't move the first column of zipped should have the same value as the third column
This assumption is just incorrect. Creating CoGroupedRDD is not only about shuffle, but also about generating internal structures required for matching corresponding records. Internally Spark will use its own ExternalAppendOnlyMap which uses custom open hash table implementation (AppendOnlyMap) which doesn't provide any ordering guarantees.
If you check debug string:
zipped.toDebugString
(4) ZippedPartitionsRDD3[8] at zipPartitions at <console>:36 []
| MapPartitionsRDD[7] at map at <console>:31 []
| MapPartitionsRDD[6] at cogroup at <console>:31 []
| CoGroupedRDD[5] at cogroup at <console>:31 []
| ShuffledRDD[2] at partitionBy at <console>:27 []
| CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
+-(4) MapPartitionsRDD[1] at map at <console>:26 []
| ParallelCollectionRDD[0] at parallelize at <console>:26 []
+-(4) MapPartitionsRDD[4] at map at <console>:29 []
| ParallelCollectionRDD[3] at parallelize at <console>:29 []
| ShuffledRDD[2] at partitionBy at <console>:27 []
| CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
+-(4) MapPartitionsRDD[1]...
you'll see that Spark indeed uses CachedPartitions to compute zipped RDD. If you also skip map transformations, which removes partitioner, you'll see that coGroup reuses partitioner provided by rdd1:
rdd1.cogroup(rdd2).partitioner == rdd1.partitioner
Boolean = true
May be I am missing something but I expected the data to be sorted based on the key
scala> val x=sc.parallelize(Array( "cat", "ant", "1"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at <console>:22
scala> val xxx=x.map(v=> (v,v.length))
xxx: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[161] at map at <console>:26
scala> xxx.sortByKey().foreach(println)
(1,1)
(cat,3)
(ant,3)
scala> xxx.sortByKey().foreach(println)
(cat,3)
(1,1)
(ant,3)
It works if I tell spark to use only 1 partitions as below but how to make this work in a cluster or more than 1 workers?
scala> xxx.sortByKey(numPartitions=1).foreach(println)
(1,1)
(ant,3)
(cat,3)
UPDATE:
I think I got the answer. It is being sorted correctly as it works when I use the collect
scala> xxx.sortByKey().collect
res170: Array[(String, Int)] = Array((1,1), (ant,3), (cat,3))
Keeping the question open to validate my understanding.
That makes sense. foreach runs in parallel across the partitions which creates non-deterministic ordering. The order may be mixed. collect gives you an array of the partitions concatenated in their sorted order.
Have a look at spark documentation why collect() method fixed the issue for you.
e.g.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.
Calling collect() on the resulting RDD will return or output an ordered list of records
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Remember doing a collect() action operation on a very large distributed RDD can cause your driver program to run out of memory and crash. So, do not use collect() except for when you are prototyping your Spark program on a small dataset.
Have a look at this article for more details
EDIT:
sortByKey(): Sort the RDD by key, so that each partition contains a sorted range of the elements. Since all partitions may not reside in same Executor node, you will not get ordered set unless you call collect()
I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).
I have an RDD that is too large to consistently perform a distinct statement without spurious errors (e.g. SparkException stage failed 4 times, ExecutorLostFailure, HDFS Filesystem closed, Max number of executor failures reached, Stage cancelled because SparkContext was shut down, etc.)
I am trying to count distinct IDs in a particular column, for example:
print(myRDD.map(a => a._2._1._2).distinct.count())
is there an easy, consistent, less-shuffle-intensive way to do the command above, possibly using mapPartitions, reduceByKey, flatMap, or other commands that use fewer shuffles than distinct?
See also What are the Spark transformations that causes a Shuffle?
It might be better to figure out if there is another underlying issue, but the below will do what you want...rather round about way to do it, but it sounds like it will fit your bill:
myRDD.map(a => (a._2._1._2, a._2._1._2))
.aggregateByKey(Set[YourType]())((agg, value) => agg + value, (agg1, agg2) => agg1 ++ agg2)
.keys
.count
Or even this seems to work, but it isn't associative and commutative. It works due to how the internals of Spark works...but I might be missing a case...so while simpler, I'm not sure I trust it:
myRDD.map(a => (a._2._1._2, a._2._1._2))
.aggregateByKey(YourTypeDefault)((x,y)=>y, (x,y)=>x)
.keys.count
As I see it there are 2 possible solutions for this matter:
With a reduceByKey
With a mapPartitions
Let's see both of them with an example.
I have a dataset of 100.000 movie ratings with the format (idUser, (idMovie, rating)). Let's say we would like to know how many different users have rated a movie:
Lets first take a look using distinct:
val numUsers = rddSplitted.keys.distinct()
println(s"numUsers is ${numUsers.count()}")
println("*******toDebugString of rddSplitted.keys.distinct*******")
println(numUsers.toDebugString)
We will get the following results:
numUsers is 943
*******toDebugString of rddSplitted.keys.distinct*******
(2) MapPartitionsRDD[6] at distinct at MovieSimilaritiesRicImproved.scala:98 []
| ShuffledRDD[5] at distinct at MovieSimilaritiesRicImproved.scala:98 []
+-(2) MapPartitionsRDD[4] at distinct at MovieSimilaritiesRicImproved.scala:98 []
| MapPartitionsRDD[3] at keys at MovieSimilaritiesRicImproved.scala:98 []
| MapPartitionsRDD[2] at map at MovieSimilaritiesRicImproved.scala:94 []
| C:/spark/ricardoExercises/ml-100k/u.data MapPartitionsRDD[1] at textFile at MovieSimilaritiesRicImproved.scala:90 []
| C:/spark/ricardoExercises/ml-100k/u.data HadoopRDD[0] at textFile at MovieSimilaritiesRicImproved.scala:90 []
With the toDebugString function, we can analyze in a better way what is happening with our RDD's.
Now, let's use reduceByKey, for instance, counting how many times each user has voted and at the same time obtaining the number of different users:
val numUsers2 = rddSplitted.map(x => (x._1, 1)).reduceByKey({case (a, b) => a })
println(s"numUsers is ${numUsers2.count()}")
println("*******toDebugString of rddSplitted.map(x => (x._1, 1)).reduceByKey(_+_)*******")
println(numUsers2.toDebugString)
We will get now these results:
numUsers is 943
*******toDebugString of rddSplitted.map(x => (x._1, 1)).reduceByKey(_+_)*******
(2) ShuffledRDD[4] at reduceByKey at MovieSimilaritiesRicImproved.scala:104 []
+-(2) MapPartitionsRDD[3] at map at MovieSimilaritiesRicImproved.scala:104 []
| MapPartitionsRDD[2] at map at MovieSimilaritiesRicImproved.scala:94 []
| C:/spark/ricardoExercises/ml-100k/u.data MapPartitionsRDD[1] at textFile at MovieSimilaritiesRicImproved.scala:90 []
| C:/spark/ricardoExercises/ml-100k/u.data HadoopRDD[0] at textFile at MovieSimilaritiesRicImproved.scala:90 []
Analyzing the RDD's produced, we can see that reduceByKey performs the same action in a more efficient way than the distinct before.
Finally, let's use mapPartitions. The main goal is to try to distinct first the users in each partition of our dataset, and then obtain the final different users.
val a1 = rddSplitted.map(x => (x._1))
println(s"Number of elements in a1: ${a1.count}")
val a2 = a1.mapPartitions(x => x.toList.distinct.toIterator)
println(s"Number of elements in a2: ${a2.count}")
val a3 = a2.distinct()
println("There are "+ a3.count()+" different users")
println("*******toDebugString of map(x => (x._1)).mapPartitions(x => x.toList.distinct.toIterator).distinct *******")
println(a3.toDebugString)
We will get the following:
Number of elements in a1: 100000
Number of elements in a2: 1709
There are 943 different users
*******toDebugString of map(x => (x._1)).mapPartitions(x => x.toList.distinct.toIterator).distinct *******
(2) MapPartitionsRDD[7] at distinct at MovieSimilaritiesRicImproved.scala:124 []
| ShuffledRDD[6] at distinct at MovieSimilaritiesRicImproved.scala:124 []
+-(2) MapPartitionsRDD[5] at distinct at MovieSimilaritiesRicImproved.scala:124 []
| MapPartitionsRDD[4] at mapPartitions at MovieSimilaritiesRicImproved.scala:122 []
| MapPartitionsRDD[3] at map at MovieSimilaritiesRicImproved.scala:120 []
| MapPartitionsRDD[2] at map at MovieSimilaritiesRicImproved.scala:94 []
| C:/spark/ricardoExercises/ml-100k/u.data MapPartitionsRDD[1] at textFile at MovieSimilaritiesRicImproved.scala:90 []
| C:/spark/ricardoExercises/ml-100k/u.data HadoopRDD[0] at textFile at MovieSimilaritiesRicImproved.scala:90 []
We can see now that mapPartition first gets the distinct number of user in each partition of the dataset, shorting the number of instances from 100,000 to 1,709 without performing any shuffle. Then, with this much lower amount of data, a distinct over the whole RDD can be carried out without worrying for the shuffle and getting the result much faster.
I would recommend using this last proposal with mapPartitions rather than the reduceByKey, as it manages a lower amount of data. Another solution could be using both functions, first mapPartitions as mentioned before and then instead of distinct, using the reduceByKey in the same way as also mentioned before.