I am inserting data to hive table with iterations in spark.
For example : Lets say 10 000 items, firstly these items are separated to 5 list, each list has 2000 items. After that I am doing iteration on that 5 lists.
In each iteration, 2000 items maps to much more rows so at the end of iteration 15M records are inserted to hive table. Each iteration is completed in 40 mins.
Issue is after each iteration. spark is waiting for starting the other 2000 K items. The waiting time is about 90 mins ! In that time gap, there is no active tasks in spark web UI below.
By the way, iterations are directly start with spark process. no any scala or java code is exist at the begging or at the end of iterations.
Any idea?
Thanks
val itemSeq = uniqueIDsDF.select("unique_id").map(r => r.getLong(0)).collect.toSeq // Get 10K items
val itemList = itemSeq.sliding(2000,2000).toList // Create 5 Lists
itemList.foreach(currItem => {
//starting code. (iteration start)
val currListDF = currItem.toDF("unique_id")
val currMetadataDF = hive_raw_metadata.join(broadcast(currListDF),Seq("unique_id"),"inner")
currMetadataDF.registerTempTable("metaTable")
// further logic here ....
}
I got the reason, even if the insert task seems completed in spark ui, in background insert process still continue. After writing to hdfs is completed, new iteration is starting. That is the reason for gap in web ui
AFAIK, I understand that you are trying to divide DataFrame and pass the data in batches and do some processing as your pseudo code, which was not so clear.
As you mentioned above in your answer, when ever action happens it
will take some time for insertion in to sink.
But basically, what I feel your logic of sliding can be improved like this...
Based on that above assumption, I have 2 options for you. you can choose most suitable one...
Option #1:(foreachPartitionAsync : AsyncRDDActions)
I would suggest you to use DataFrame iterator grouping capabilities
df.repartition(numofpartitionsyouwant) // numPartitions
df.rdd.foreachPartitionAsync // since its partition wise processing to sink it would be faster than the approach you are adopting...
{
partitionIterator =>
partitionIterator.grouped(2000).foreach {
group => group.foreach {
// do your insertions here or what ever you wanted to ....
}
}
}
Note : RDD will be executed in the background. All of these executions will be submitted to the Spark scheduler and run concurrently. Depending on your Spark cluster size that some of the jobs may wait until Executors become available for processing.
Option #2 :
Second approach is dataframe as randomSplit I think you can use in this case to divide equal sized dataframes. which will return you equal sized array of dataframes if sum of their weights > 1
Note : weights(first argument of dataframe) for splits, will be normalized if they don't sum to 1.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
refer randomSplit code here
it will be like ..
val equalsizeddfArray = yourdf.randomSplit(Array(0.2,0.2,0.2,0.2,0.2,0.2, 0.3) // intentionally gave sum of weights > 1 (in your case 10000 records of dataframe to array of 5 dataframes of each 2000 records in it)
and then...
for (i <- 0 until equalsizeddfArray.length) {
// your logic ....
}
Note :
Above logic is sequential...
If you want to execute them in parallel (if they are independent) you can use
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
// Now wait for the tasks to finish before exiting the app Await.result(Future.sequence(Seq(yourtaskfuncOndf1(),yourtaskfuncOndf2()...,yourtaskfuncOndf10())), Duration(10, MINUTES))
Out of above 2 options, I would prefer approach #2 since randomSplit function will take care(by normalizing weights) about dividing equal sized to process them
Related
I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.
I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.
Here is my dataframe:
The underlying RDD has 2 partitions
When I do a df.count, the DAG produced is
When I do a df.rdd.count, the DAG produced is:
Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the DataFrame.’. Now, when I perform the count on the dataframe why is a shuffle occurring? Besides, when I do the same on the underlying RDD no shuffle occurs.
It makes no sense to me why a shuffle would occur anyway. I tried to go through the source code of count here spark github But it doesn’t make sense to me fully. Is the “groupby” being supplied to the action the culprit?
PS. df.coalesce(1).count does not cause any shuffle
It seems that DataFrame's count operation uses groupBy resulting in shuffle. Below is the code from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
* Returns the number of rows in the Dataset.
* #group action
* #since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) {
plan =>
plan.executeCollect().head.getLong(0)
}
While if you look at RDD's count function, it passes on the aggregate function to each of the partitions, which returns the sum of each partition as Array and then use .sum to sum elements of array.
Code snippet from this link:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
When spark is doing dataframe operation, it does first compute partial counts for every partition and then having another stage to sum those up together. This is particularly good for large dataframes, where distributing counts to multiple executors actually adds to performance.
The place to verify this is SQL tab of Spark UI, which would have some sort of the following physical plan description:
*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])
In the shuffle stage, the key is empty, and the value is count of the partition, and all these (key,value) pairs are shuffled to one single partition.
That is, the data moved in the shuffle stage is very little.
For example, we have a parquet file with 2000 stock symbols' closing price in the past 3 years, and we want to calculate the 5-day moving average for each symbol.
So I create a spark SQLContext and then
val marketData = sqlcontext.sql("select DATE, SYMBOL, PRICE from stockdata order by DATE").cache()
To get the symbol list,
val symbols = marketData.select("SYMBOL").distinct().collect()
and here is the for loop:
for (symbol <- symbols) {
marketData.filter(symbol).rdd.sliding(5).map(...calculating the avg...).save()
}
Obviously, doing the for loop on spark is slow, and save() for each small result also slows down the process (I have tried define a var result outside the for loop and union all the output to make the IO operation together, but I got a stackoverflow exception), so how can I parallelize the for loop and optimize the IO operation?
The program you write runs in a driver ("master") spark node. Expressions in this program can only be parallelized if you are operating on parallel structures (RDDs).
Try this:
marketdata.rdd.map(symbolize).reduceByKey{ case (symbol, days) => days.sliding(5).map(makeAvg) }.foreach{ case (symbol,averages) => averages.save() }
where symbolize takes a Row of symbol x day and returns a tuple (symbol, day).
For the first part of the answer I don't agree with Carlos. The program does not run in the driver ("master").
The loop does run sequentially, but for each symbol the execution of:
marketData.filter(symbol).rdd.sliding(5).map(...calculating the avg...).save()
is done in parallel since markedData is a Spark DataFrame and it is distributed.
I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?
You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)
The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.