In pyspark, I have some operations that need to be looped over and constantly change certain dataframe.
df = spark.sql("select * from my_table")
for iter in range(0,num_iter):
for v in var_list:
//....
//Operations here that changes df
df = f1()
..
df = join(df, another dataset)
df = f2(df)
..
//There are some action here that should conclude DAG of this loop
I notice that as the loop progresses, performance is dragged down. Initially it can take only seconds; after a couple of loops, each iterations take minutes to hours. Checking from YARN, it seems that the DAG grows as we move along. At some point the executor just fail with error showing a very long DAG
The DAG will grow like this [Example only, it's not solely join operation that grow the DAG]
[]
Is there a way to improve performance in this situation ? What is causing performance hit ? If it's objects hogging memory, is there a way to flush them after each loop ?
Is cache() (with follow-up action) at each loop a good work-around to avoid DAG build-up ? I did try caching and still the performance drags on.
Related
I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop. This helps in skipping those additional stages which gets created with every new micro batch.
My questions -
1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time?
2) What happens when somehow some part or all of cached RDD is not available? (node/executor failure). Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF?
// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
.add("product_id", LongType)
.add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)
// Note : streamingDF was created from Kafka source
streamingDF.writeStream
.trigger(Trigger.ProcessingTime(10000L))
.foreachBatch {
(micro_batch_DF: DataFrame) => {
// fetching max created_at for each product_id in current micro-batch
val staging_df = micro_batch_DF.groupBy("product_id")
.agg(max("created").alias("created"))
// Updating staticDF using current micro batch
staticDF = staticDF.unionByName(staging_df)
staticDF = staticDF
.withColumn("rnk",
row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
).filter("rnk = 1")
.drop("rnk")
.cache()
}
Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches. This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.
To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)
I have an iterative application running on Spark that I simplified to the following code:
var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000))
var c: Long = Int.MaxValue
var iteration: Int = 0
while (c > 0) {
iteration += 1
// Manipulate the RDD and cache the new RDD
anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint()
// Actually compute the RDD and spawn a new job
c = anRDD.count()
println(s"Iteration: $iteration, Values: $c")
}
What happens to the memory allocation within consequent jobs?
Does the current anRDD "override" the previous ones or are they all kept into memory? In the long run, this can throw some memory exception
Do localCheckpoint and cache have different behaviors? If localCheckpoint is used in place of cache, as localCheckpoint truncates the RDD lineage, then I would expect the previous RDDs to be overridden
Unfortunately seems that Spark is not good for things like that.
Your original implementation is not viable because on each iteration the newer RDD will have an internal reference to the older one so all RDDs pile up in memory.
localCheckpoint is an approximation of what you are trying to achieve. It does truncate RDD's lineage but you lose fault tolerance. It's clearly stated in the documentation for this method.
checkpoint is also an option. It is safe but it would dump the data to hdfs on each iteration.
Consider redesigning the approach. Such hacks could bite sooner or later.
RDDs are immutable so each transformation will return a new RDD.
All anRDD will be kept in memory. See below(running two iteration for your code), id will be different for all the RDDs
So yes, In the long run, this can throw some memory exception. And you
should unpersist rdd after you are done processing on it.
localCheckpoint has different use case than cache. It is used to truncate the lineage of RDD. It doesn't store RDD to disk/local It improves performance but decreases fault tolerance in turn.
I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.
Here is job I'm running from Spark shell :
val l = sc.parallelize((1 to 5000000).toList)
val m = l.map(m => m*23)
m.take(5000000)
Workers appear to be "LOADING" state :
What is "LOADING" state ?
Update :
As I understand take will perform job on cluster and then return results to Driver. So "LOADING" state equates to the data being loaded onto driver ?
I believe that if you do something like this,
(1 to 5000000 ).toList
You are bound to encounter java.lang.OutOfMemoryError: GC overhead limit exceeded.
This happens when JVM realizes that it is spending too much time in Grabage Collection. By default the JVM is configured to throw this error if you are spending more than 98% of the total time in GC and after the GC less than 2% of the heap is recovered.
In this particular case you are creating new instance of List for every iteration, ( immutability, so each time a new instance of List is returned ). Which means each iteration leaves an useless instance of List, and for List with size in millions, it will take lot of memory and trigger GC very frequently. Also, each time GC has to free lot of memory hence take lot of time.
This ultimately leads to error - java.lang.OutOfMemoryError: GC overhead limit exceeded.
What happens if this was not there ? -> This means that the little amount GC was able to clean will be quickly filled again thus forcing GC to restart the cleaning process again.This forms a vicious cycle where the CPU is 100% busy with GC and no actual work can be done. The application will face extreme slowdowns – operations which used to be completed in milliseconds will now likely to take minutes to finish.
This is a pre-emptive fail-fast safeguard implemented in JVM's.
You can disable this safeguard by using following Java Option.
-XX:-UseGCOverheadLimit
But I will strongly recommend NOT doing this.
And even if you disable this feature ( or if your Spark-Cluster avoids this to some extent by allocating large heap space ), some thing like
(1 to 5000000 ).toList
will take a long long time.
Also, I have a strong feeling that systems like Spark which are supposed to be running multiple jobs are configured ( by default, may be you can override ) to pause or reject such jobs as soon as they realize extreme GC which can lead to starvation of other jobs. And this may be the main reason your job is always loading.
You can get a lot of relief by using a mutable List and appending values to it with a for loop. Now you can parallelise your mutable list.
val mutableList = scala.collection.mutable.MutableList.empty[ Int ]
for ( i <- 1 to 5000000 ) {
mutableList.append( i )
}
val l = sc.parallelize( mutableList )
But even this will lead to multiple ( but many times less severe ) memeory allocations( hence GC exectuions ) whenever the List is half-full, which result in memory relocation of whole List with double of previously allocated memory.
When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.
Watch this space
https://issues.apache.org/jira/browse/SPARK-5997
This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.
When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.
If you want the process to go the other way, you could just force some kind of partitioning:
[RDD].partitionBy(new HashPartitioner(100))
I'm not sure that's what you're looking for, but hope so.
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action