Pyspark improving performance for multiple column operations - pyspark

I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?

My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.

Related

Spark Scala - Comparing Datasets Column by Column

I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

What is the fastest way to group values based on multiple key columns in RDD using Scala? [duplicate]

This question already has an answer here:
Spark groupByKey alternative
(1 answer)
Closed 5 years ago.
My data is a file containing over 2 million rows of employee records. Each row has 15 fields of employee features, including name, DOB, ssn, etc. Example:
ID|name|DOB|address|SSN|...
1|James Bond|10/01/1990|1000 Stanford Ave|123456789|...
2|Jason Bourne|05/17/1987|2000 Yale Rd|987654321|...
3|James Bond|10/01/1990|5000 Berkeley Dr|123456789|...
I need to group the data by a number of columns and aggregate the employee's ID (first column) with the same key. The number and name of the key columns are passed into the function as parameters.
For example, if the key columns include "name, DOB, SSN", the data will be grouped as
(James Bond, 10/01/1990, 123456789), List(1,3)
(Jason Bourne, 05/17/1987, 987654321), List(2)
And the final output is
List(1,3)
List(2)
I am new to Scala and Spark. What I did to solve this problem is: read the data as RDD, and tried using groupBy, reduceByKey, and foldByKey to implement the function based on my research on StackOverflow. Among them, I found groupBy was the slowest, and foldByKey was the fastest. My implementation with foldByKey is:
val buckets = data.map(row => (idx.map(i => row(i)) -> (row(0) :: Nil)))
.foldByKey(List[String]())((acc, e) => acc ::: e).values
My question is: Is there faster implementation than mine using foldByKey on RDD?
Update: I've read posts on StackOverflow and understand groupByKey may be very slow on large dataset. This is why I did avoid groupByKey and ended up with foldByKey. However, this is not the question I asked. I am looking for an even faster implementation, or the optimal implementation in terms of processing time with the fixed hardware setting. (The processing of 2 million records now requires ~15 minutes.) I was told that converting RDD to DataFrame and call groupBy can be faster.
Here are some details on each of these first to understand how they work.
groupByKey runs slow as all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network.
reduceByKey works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
So avoid groupbyKey. Hoping this helps.
Cheers !

Filter columns in large dataset with Spark

I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?
You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)
The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.