Spark dataframe.map() processed each row more than once - scala

Running this code:
(1)
val resultDf = myDataFrame.map(row => { println(s"$row"); return row })
I can see exactly one print out (use "yarn logs -applicationId xxxx" to get the log) for each row. However when the processing code is more complex:
(2)
val resultDf = myDataFrame.map(row => { println(s"$row"); /* complex processing code */})
I find about 2 or 3 times more print out than the actual row count. But in both cases myDataFrame.count == resultDf.count
Question: in case (2) I see more print out, is that because Spark runs dataFrame.map() in more containers for redundancy, and throws away extra results when redundant executions all return successfully? Thanks.
BTW I run spark jobs on aws emr, spark 3.1.2

The result dataframe passed on downstream, at one point it's not cached but referenced 3 times, resulting in multiple executions of myDataFrame.map().

Related

Unnecessary Extra forEachPartition causing extra time to complete the Job

I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.

How do I understand that caching is used in Spark?

In my Scala/Spark application, I create DataFrame. I plan to use this Dataframe several times throughout the program. For that's why I decided to used .cache() method for that DataFrame. As you can see inside the loop I filter DataFrame several times with different values. For some reason .count() method returns me the always the same result. In fact, it must return two different count values. Also, I notice strange behavior in Mesos. It feels like the .cache() method is not being executed. After creating the DataFrame, the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time. I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly. What do you think is the problem?
import org.apache.spark.sql.DataFrame
var df: DataFrame = spark
.read
.option("delimiter", "|")
.csv("/path_to_the_files/")
.filter(col("col5").isin("XXX", "YYY", "ZZZ"))
df.cache()
var array1 = Array("111", "222")
var array2 = Array("333")
var storage = Array(array1, array2)
if (!df.head(1).isEmpty) {
for (item <- storage) {
df.filter(
col("col1").isin(item:_*)
)
println("count: " + df.count())
}
}
In fact, it must return two different count values.
Why? You are calling it on the same df. Maybe you meant something like
val df1 = df.filter(...)
println("count: " + df1.count())
I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly.
It does, but only when the first action which depends on this dataframe is executed, and head is that action. So you should expect exactly
the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time
Without caching, you'd also get the same time for both df.count() calls, unless Spark detects it and enables caching on its own.

Scala: Process dataframe while value in column meets condition

I have to process a huge dataframe, download files from a service by the id column of the dataframe. The logic to download, and all the changes are prepared, but I am not sure what is the best way to make a loop around this. I run this on Databricks, which is why I need to perform the processes in chunks.
The dataframe has a "status" column, which can hold the following values:
"todo", "processing", "failed", "succeeded"
In the while loop I want to perform the following tasks:
while (there are rows with status "todo") {
- get the first 10 rows if status is todo (DONE)
- start processing the dataframe, update status to processing (DONE)
- download files (call UDF), update status to succeeded or failed
(DONE, not in the code here)
}
I would like to run this until all the rows' status are other then todo! The problem is that this while loop is not finishing, because the dataframe itself is not updated. It needs to be assigned to another dataframe, but then how to add the new one to the loop?
My code right now:
while(statusDoc.where("status == 'todo'").count > 0) {
val todoDF = test.filter("status == 'todo'")
val processingDF = todoDF.limit(10).withColumn("status", when(col("status") === "todo", "processing")
.otherwise(col("status")))
statusDoc.join(processingDF, Seq("id"), "outer")
.select($"id", \
statusDoc("fileUrl"), \
coalesce(processingDF("status"), statusDoc("status")).alias("status"))
}
The join should go like this:
val update = statusDoc.join(processingDF, Seq("id"), "outer")
.select($"id", statusDoc("fileUrl"),\
coalesce(processingDF("status"), statusDoc("status")).alias("status"))
Then this new update dataframe should be used for the next round of loop.
One thing to remember here is that DataFrame (Spark) are not mutable because they are distributed. You have no guarantee that a given modification would be properly propagated across all the network of executors, if you make some. And you also have no guarantee that a given portion of the data has not already been used somewhere else (in another node for example).
One thing you can do though is add another column with the updated values and remove the old column.
val update = statusDoc.
.withColumnRenamed("status", "status_doc")
.join(processingDF, Seq("id"), "outer")
.withColumn("updated_status", udf((stold: String, stold: String) => if (stnew != null) stnew else stold).apply(col("status"), col("status_doc"))
.drop("status_doc", "status")
.withColumnRenamed("updated_status", "status")
.select("id", "fileUrl", "status")
Then make sure you replace "statusDoc" with the "update" DataFrame. Do not forget to make the DataFrame a "var" instead of a "val". I'm surprised your IDE has not yelled yet.
Also, I'm sure you can think of a way of distributing the problem so that you avoid the while loop - I can help you do that but I need a clearer description of you issue. If you use a while loop, you won't use the full capabilities of your cluster because the while loop is only executed on the master. Then, you'll treat only 10 lines at a time, each time. I'm sure you can append all data you need to the whole DataFrame in a single map operation.

Time gap between Two tasks in Spark

I am inserting data to hive table with iterations in spark.
For example : Lets say 10 000 items, firstly these items are separated to 5 list, each list has 2000 items. After that I am doing iteration on that 5 lists.
In each iteration, 2000 items maps to much more rows so at the end of iteration 15M records are inserted to hive table. Each iteration is completed in 40 mins.
Issue is after each iteration. spark is waiting for starting the other 2000 K items. The waiting time is about 90 mins ! In that time gap, there is no active tasks in spark web UI below.
By the way, iterations are directly start with spark process. no any scala or java code is exist at the begging or at the end of iterations.
Any idea?
Thanks
val itemSeq = uniqueIDsDF.select("unique_id").map(r => r.getLong(0)).collect.toSeq // Get 10K items
val itemList = itemSeq.sliding(2000,2000).toList // Create 5 Lists
itemList.foreach(currItem => {
//starting code. (iteration start)
val currListDF = currItem.toDF("unique_id")
val currMetadataDF = hive_raw_metadata.join(broadcast(currListDF),Seq("unique_id"),"inner")
currMetadataDF.registerTempTable("metaTable")
// further logic here ....
}
I got the reason, even if the insert task seems completed in spark ui, in background insert process still continue. After writing to hdfs is completed, new iteration is starting. That is the reason for gap in web ui
AFAIK, I understand that you are trying to divide DataFrame and pass the data in batches and do some processing as your pseudo code, which was not so clear.
As you mentioned above in your answer, when ever action happens it
will take some time for insertion in to sink.
But basically, what I feel your logic of sliding can be improved like this...
Based on that above assumption, I have 2 options for you. you can choose most suitable one...
Option #1:(foreachPartitionAsync : AsyncRDDActions)
I would suggest you to use DataFrame iterator grouping capabilities
df.repartition(numofpartitionsyouwant) // numPartitions
df.rdd.foreachPartitionAsync // since its partition wise processing to sink it would be faster than the approach you are adopting...
{
partitionIterator =>
partitionIterator.grouped(2000).foreach {
group => group.foreach {
// do your insertions here or what ever you wanted to ....
}
}
}
Note : RDD will be executed in the background. All of these executions will be submitted to the Spark scheduler and run concurrently. Depending on your Spark cluster size that some of the jobs may wait until Executors become available for processing.
Option #2 :
Second approach is dataframe as randomSplit I think you can use in this case to divide equal sized dataframes. which will return you equal sized array of dataframes if sum of their weights > 1
Note : weights(first argument of dataframe) for splits, will be normalized if they don't sum to 1.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
refer randomSplit code here
it will be like ..
val equalsizeddfArray = yourdf.randomSplit(Array(0.2,0.2,0.2,0.2,0.2,0.2, 0.3) // intentionally gave sum of weights > 1 (in your case 10000 records of dataframe to array of 5 dataframes of each 2000 records in it)
and then...
for (i <- 0 until equalsizeddfArray.length) {
// your logic ....
}
Note :
Above logic is sequential...
If you want to execute them in parallel (if they are independent) you can use
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
// Now wait for the tasks to finish before exiting the app Await.result(Future.sequence(Seq(yourtaskfuncOndf1(),yourtaskfuncOndf2()...,yourtaskfuncOndf10())), Duration(10, MINUTES))
Out of above 2 options, I would prefer approach #2 since randomSplit function will take care(by normalizing weights) about dividing equal sized to process them

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.