How can I parallelize a for loop in spark with scala? - scala

For example, we have a parquet file with 2000 stock symbols' closing price in the past 3 years, and we want to calculate the 5-day moving average for each symbol.
So I create a spark SQLContext and then
val marketData = sqlcontext.sql("select DATE, SYMBOL, PRICE from stockdata order by DATE").cache()
To get the symbol list,
val symbols = marketData.select("SYMBOL").distinct().collect()
and here is the for loop:
for (symbol <- symbols) {
marketData.filter(symbol).rdd.sliding(5).map(...calculating the avg...).save()
}
Obviously, doing the for loop on spark is slow, and save() for each small result also slows down the process (I have tried define a var result outside the for loop and union all the output to make the IO operation together, but I got a stackoverflow exception), so how can I parallelize the for loop and optimize the IO operation?

The program you write runs in a driver ("master") spark node. Expressions in this program can only be parallelized if you are operating on parallel structures (RDDs).
Try this:
marketdata.rdd.map(symbolize).reduceByKey{ case (symbol, days) => days.sliding(5).map(makeAvg) }.foreach{ case (symbol,averages) => averages.save() }
where symbolize takes a Row of symbol x day and returns a tuple (symbol, day).

For the first part of the answer I don't agree with Carlos. The program does not run in the driver ("master").
The loop does run sequentially, but for each symbol the execution of:
marketData.filter(symbol).rdd.sliding(5).map(...calculating the avg...).save()
is done in parallel since markedData is a Spark DataFrame and it is distributed.

Related

Convert collect-map-foreach scala code block to spark/sql library functions

I have a spark dataframe (let's call it "records") like the following one:
id
name
a1
john
b"2
alice
c3'
joe
If you notice, the primary key column (id) values may have single/double quotes in them (like the second and third row in the dataframe).
I wrote following scala code to check for quotes in primary key column values:
def checkForQuotesInPrimaryKeyColumn(primaryKey: String, records: DataFrame): Boolean = {
// Extract primary key column values
val pkcValues = records.select(primaryKey).collect().map(_(0)).toList
// Check for single and double quotes in the values
var checkForQuotes = false // indicates no quotes
breakable {
pkcValues.foreach(pkcValue => {
if (pkcValue.toString.contains("\"") || pkcValue.toString.contains("\'")) {
checkForQuotes = true
println("Value that has quotes: " + pkcValue.toString)
break()
}
})}
checkForQuotes
}
This code works. But it doesn't take advantage of spark functionalities. I wish to make use of spark executors (and other features) that can complete this task faster.
The updated function looks like the following:
def checkForQuotesInPrimaryKeyColumnsUpdated(primaryKey: String, records: DataFrame): Boolean = {
val findQuotes = udf((s: String) => if (s.contains("\"") || s.contains("\'")) true else false)
records
.select(findQuotes(col(primaryKey)) as "quotes")
.filter(col("quotes") === true)
.collect()
.nonEmpty
}
The unit tests give similar runtimes on my machine for both the functions when run on a dataframe with 100 entries.
Is the updated function any faster (and/or better) than the original function? Is there any way the function can be improved?
Your first approach collects the entire dataframe to the driver. If your data does not fit into the driver's memory, it is going to break. Also you are right, you do not take advantage of spark.
The second approach uses spark to detect quotes. That's better. The problem is that you then collect a dataframe containing one boolean per record containing a quote to the driver just to see if there is at least one. This is a waste of time, especially if many records contain quotes. It is also a shame to use a UDF for this, since they are known to be slower than spark SQL primitives.
You could simply use spark to count the number records containing a quote, without collecting anything.
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.count > 0
Since, you do not actually care about the number of records. You just want to check if there is at least one, you could use limit(1). SparkSQL will be able to further optimize the query:
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.limit(1).count > 0
NB: it makes sense that in unit tests, with little data, both of your queries take the same time. Spark is meant for big data and has some overhead. With real data, your second approach should be faster than the first and the one I propose even so. Also, your first approach will get an OOM on the driver as soon as you add in more data.

How to avoid the need for nested calls to Spark dataframes - which dont' work

Suppose I have a Spark dataframe called trades which has in its schema a few columns, some dimensions (let's say Product and Type) and some facts (let's say Price and Volume).
Rows in the dataframe which have the same dimension columns belong logically to the same group.
What I need is to map each dimension set (Product, Type) to a numeric value, so to obtain in the end a dataframe stats which has as many rows as the distinct number of dimensions and a value - this is the critical part - which is obtained from all the rows in trades of that (Product, Type) and which must be computed sequentially in order, because the function applied row by row is neither associative nor commutative, and it cannot be parallelized.
I managed to handle the sequential function I need to apply to each subset by repartitioning to 1 single chunk each dataframe and sorting the rows, so to get exactly what I need.
The thing I am struggling with is how to do the map from trades to stats as a Spark job: in my scenario master is remote and can leverage multiple executors, while the deploy mode is local and local machine is poorly equipped.
So I don't want to do looping over the driver, but push it down to the cluster.
If this was not Spark, I'd have done something like:
val dimensions = trades.select("Product", "Type").distinct()
val stats = dimensions.map( row =>
val product = row.getAs[String]("Product")
val type = row.getAs[String]("Type")
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
Row(product, type, callSequentialFunction(tradesInScope))
)
This seemed fine to me, but it's absolutely not working: I am trying to do a nested call on trades, and it seem they are not supported. Indeed, when running this the spark job compile but when actually performing an action I get a NullPointerException because the dataframe trades is null within the map
I am new to Spark, and I don't know any other way of achieving the same intent in a valid way. Could you help me?
you get a NullpointerExecptionbecause you cannot use dataframes within executor-side code, they only live on the driver.Also, your code would not ensure thatcallSequentialFunction will be called sequentially, because map on a dataframe will run in parallel (if you have more than 1 partition). What you can do is something like this:
val dimensions = trades.select("Product", "Type").distinct().as[(String,String)].collect()
val stats = dimensions.map{case (product,type) =>
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
(product, type, callSequentialFunction(tradesInScope))
}
But note that the order in dimensionsis somewhat arbitrary, so you should sort dimensionsaccording to your needs

Time gap between Two tasks in Spark

I am inserting data to hive table with iterations in spark.
For example : Lets say 10 000 items, firstly these items are separated to 5 list, each list has 2000 items. After that I am doing iteration on that 5 lists.
In each iteration, 2000 items maps to much more rows so at the end of iteration 15M records are inserted to hive table. Each iteration is completed in 40 mins.
Issue is after each iteration. spark is waiting for starting the other 2000 K items. The waiting time is about 90 mins ! In that time gap, there is no active tasks in spark web UI below.
By the way, iterations are directly start with spark process. no any scala or java code is exist at the begging or at the end of iterations.
Any idea?
Thanks
val itemSeq = uniqueIDsDF.select("unique_id").map(r => r.getLong(0)).collect.toSeq // Get 10K items
val itemList = itemSeq.sliding(2000,2000).toList // Create 5 Lists
itemList.foreach(currItem => {
//starting code. (iteration start)
val currListDF = currItem.toDF("unique_id")
val currMetadataDF = hive_raw_metadata.join(broadcast(currListDF),Seq("unique_id"),"inner")
currMetadataDF.registerTempTable("metaTable")
// further logic here ....
}
I got the reason, even if the insert task seems completed in spark ui, in background insert process still continue. After writing to hdfs is completed, new iteration is starting. That is the reason for gap in web ui
AFAIK, I understand that you are trying to divide DataFrame and pass the data in batches and do some processing as your pseudo code, which was not so clear.
As you mentioned above in your answer, when ever action happens it
will take some time for insertion in to sink.
But basically, what I feel your logic of sliding can be improved like this...
Based on that above assumption, I have 2 options for you. you can choose most suitable one...
Option #1:(foreachPartitionAsync : AsyncRDDActions)
I would suggest you to use DataFrame iterator grouping capabilities
df.repartition(numofpartitionsyouwant) // numPartitions
df.rdd.foreachPartitionAsync // since its partition wise processing to sink it would be faster than the approach you are adopting...
{
partitionIterator =>
partitionIterator.grouped(2000).foreach {
group => group.foreach {
// do your insertions here or what ever you wanted to ....
}
}
}
Note : RDD will be executed in the background. All of these executions will be submitted to the Spark scheduler and run concurrently. Depending on your Spark cluster size that some of the jobs may wait until Executors become available for processing.
Option #2 :
Second approach is dataframe as randomSplit I think you can use in this case to divide equal sized dataframes. which will return you equal sized array of dataframes if sum of their weights > 1
Note : weights(first argument of dataframe) for splits, will be normalized if they don't sum to 1.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
refer randomSplit code here
it will be like ..
val equalsizeddfArray = yourdf.randomSplit(Array(0.2,0.2,0.2,0.2,0.2,0.2, 0.3) // intentionally gave sum of weights > 1 (in your case 10000 records of dataframe to array of 5 dataframes of each 2000 records in it)
and then...
for (i <- 0 until equalsizeddfArray.length) {
// your logic ....
}
Note :
Above logic is sequential...
If you want to execute them in parallel (if they are independent) you can use
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
// Now wait for the tasks to finish before exiting the app Await.result(Future.sequence(Seq(yourtaskfuncOndf1(),yourtaskfuncOndf2()...,yourtaskfuncOndf10())), Duration(10, MINUTES))
Out of above 2 options, I would prefer approach #2 since randomSplit function will take care(by normalizing weights) about dividing equal sized to process them

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.