Unnecessary Extra forEachPartition causing extra time to complete the Job - scala

I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment

You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.

Related

Spark performance issues when iteratively adding multiple columns

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

How do I understand that caching is used in Spark?

In my Scala/Spark application, I create DataFrame. I plan to use this Dataframe several times throughout the program. For that's why I decided to used .cache() method for that DataFrame. As you can see inside the loop I filter DataFrame several times with different values. For some reason .count() method returns me the always the same result. In fact, it must return two different count values. Also, I notice strange behavior in Mesos. It feels like the .cache() method is not being executed. After creating the DataFrame, the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time. I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly. What do you think is the problem?
import org.apache.spark.sql.DataFrame
var df: DataFrame = spark
.read
.option("delimiter", "|")
.csv("/path_to_the_files/")
.filter(col("col5").isin("XXX", "YYY", "ZZZ"))
df.cache()
var array1 = Array("111", "222")
var array2 = Array("333")
var storage = Array(array1, array2)
if (!df.head(1).isEmpty) {
for (item <- storage) {
df.filter(
col("col1").isin(item:_*)
)
println("count: " + df.count())
}
}
In fact, it must return two different count values.
Why? You are calling it on the same df. Maybe you meant something like
val df1 = df.filter(...)
println("count: " + df1.count())
I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly.
It does, but only when the first action which depends on this dataframe is executed, and head is that action. So you should expect exactly
the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time
Without caching, you'd also get the same time for both df.count() calls, unless Spark detects it and enables caching on its own.

Time gap between Two tasks in Spark

I am inserting data to hive table with iterations in spark.
For example : Lets say 10 000 items, firstly these items are separated to 5 list, each list has 2000 items. After that I am doing iteration on that 5 lists.
In each iteration, 2000 items maps to much more rows so at the end of iteration 15M records are inserted to hive table. Each iteration is completed in 40 mins.
Issue is after each iteration. spark is waiting for starting the other 2000 K items. The waiting time is about 90 mins ! In that time gap, there is no active tasks in spark web UI below.
By the way, iterations are directly start with spark process. no any scala or java code is exist at the begging or at the end of iterations.
Any idea?
Thanks
val itemSeq = uniqueIDsDF.select("unique_id").map(r => r.getLong(0)).collect.toSeq // Get 10K items
val itemList = itemSeq.sliding(2000,2000).toList // Create 5 Lists
itemList.foreach(currItem => {
//starting code. (iteration start)
val currListDF = currItem.toDF("unique_id")
val currMetadataDF = hive_raw_metadata.join(broadcast(currListDF),Seq("unique_id"),"inner")
currMetadataDF.registerTempTable("metaTable")
// further logic here ....
}
I got the reason, even if the insert task seems completed in spark ui, in background insert process still continue. After writing to hdfs is completed, new iteration is starting. That is the reason for gap in web ui
AFAIK, I understand that you are trying to divide DataFrame and pass the data in batches and do some processing as your pseudo code, which was not so clear.
As you mentioned above in your answer, when ever action happens it
will take some time for insertion in to sink.
But basically, what I feel your logic of sliding can be improved like this...
Based on that above assumption, I have 2 options for you. you can choose most suitable one...
Option #1:(foreachPartitionAsync : AsyncRDDActions)
I would suggest you to use DataFrame iterator grouping capabilities
df.repartition(numofpartitionsyouwant) // numPartitions
df.rdd.foreachPartitionAsync // since its partition wise processing to sink it would be faster than the approach you are adopting...
{
partitionIterator =>
partitionIterator.grouped(2000).foreach {
group => group.foreach {
// do your insertions here or what ever you wanted to ....
}
}
}
Note : RDD will be executed in the background. All of these executions will be submitted to the Spark scheduler and run concurrently. Depending on your Spark cluster size that some of the jobs may wait until Executors become available for processing.
Option #2 :
Second approach is dataframe as randomSplit I think you can use in this case to divide equal sized dataframes. which will return you equal sized array of dataframes if sum of their weights > 1
Note : weights(first argument of dataframe) for splits, will be normalized if they don't sum to 1.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
refer randomSplit code here
it will be like ..
val equalsizeddfArray = yourdf.randomSplit(Array(0.2,0.2,0.2,0.2,0.2,0.2, 0.3) // intentionally gave sum of weights > 1 (in your case 10000 records of dataframe to array of 5 dataframes of each 2000 records in it)
and then...
for (i <- 0 until equalsizeddfArray.length) {
// your logic ....
}
Note :
Above logic is sequential...
If you want to execute them in parallel (if they are independent) you can use
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
// Now wait for the tasks to finish before exiting the app Await.result(Future.sequence(Seq(yourtaskfuncOndf1(),yourtaskfuncOndf2()...,yourtaskfuncOndf10())), Duration(10, MINUTES))
Out of above 2 options, I would prefer approach #2 since randomSplit function will take care(by normalizing weights) about dividing equal sized to process them

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.