Spark grouped map UDF in Scala - scala

I am trying to write some code that would allow me to compute some action on a group of rows of a dataframe. In PySpark, this is possible by defining a Pandas UDF of type GROUPED_MAP. However, in Scala, I only found a way to create custom aggregators (UDAFs) or classic UDFs.
My temporary solution is to generate a list of keys that would encode my groups which would allow me to filter the dataframe and perform my action for each subset of dataframe. However, this approach is not optimal and very slow.
The performed actions are made sequentially, thus taking a lot of time. I could parallelize the loop but I'm sure this would show any improvement since Spark is already distributed.
Is there any better way to do what I want ?
Edit: Tried parallelizing using Futures but there was no speed improvement, as expected

To the best of my knowledge, this is something that's not possible in Scala. Depending on what you want, I think there could be other ways of applying a transformation to a group of rows in Spark / Scala:
Do a groupBy(...).agg(collect_list(<column_names>)), and use a UDF that operates on the array of values. If desired, you can use a select statement with explode(<array_column>) to revert to the original format
Try rewriting what you want to achieve using window functions. You can add a new column with an aggregate expression like so:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val result = spark.range(100)
.withColumn("group", pmod('id, lit(3)))
.withColumn("group_sum", sum('id).over(w))

Related

scala rapids using an opaque UDF for a single column dataframe that produces another column

I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.
Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:
val newDf = df.withColumn(colB, opaqueUdf(col("colA")))
Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.
Yes #Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.
It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.

Dynamic IIF in Spark

I have a metadata-driven Spark Transformation Engine. That performs a set of operations on the data frames stored in the Scala Map[String, DataFrame].
I have a scenario where user have IIF (ternary if) to be implemented:
if(<condition>,<iftrue>,<iffalse>)
if(cola=1,to_date(colb),null)
My Approach: I am making use of where clause to evaluate the and UDF to return and perform any functions if any like (to_date. to_decimal etc). For the above code:
dfMap(source).where(<condn>).withColumn("<tarCol>",CustomUDF(<iftrue>))
I have read UDF impacts the performance in Spark and I am facing the same. Please suggest any alternative for the same.

Pre-cogrouping tables on HDFS and reading in Spark with zero shuffling

Context
I have two tables that I am joining/cogrouping as part of my spark jobs, which incurs a large shuffle each time I run a job. I want to amortise the cost across all jobs by storing cogrouped data once, and use the already cogrouped data as part of my regular Spark runs to avoid the shuffle.
To try and achieve this, I have some data in HDFS stored in parquet format. I am using Parquet repeated fields to achieve the following schema
(date, [aRecords], [bRecords])
Where [aRecords] indicates an array of aRecord. I am also partitioning the data by date on HDFS using the usual write.partitionBy($"date").
In this situation, aRecords and bRecords appear to be effectively cogrouped by date. I can perform operations like the following:
case class CogroupedData(date: Date, aRecords: Array[Int], bRecords: Array[Int])
val cogroupedData = spark.read.parquet("path/to/data").as[CogroupedData]
//Dataset[(Date,Int)] where the Int in the two sides multiplied
val results = cogroupedData
.flatMap(el => el.aRecords.zip(el.bRecords).map(pair => (el.date, pair._1 * pair._2)))
and get the results that I get from using the equivalent groupByKey operations on two separate tables for aRecords and bRecords keyed by date.
The difference between the two is that I avoid a shuffle with the already cogrouped data, the cogrouped cost is amortised by persisting on HDFS.
Question
Now for the question. From the cogrouped dataset, I would like to derive the two grouped datasets so I can use standard Spark SQL operators (like cogroup, join etc) without incurring a shuffle. This seems possible since the first code example works, but Spark still insists on hashing/shuffling data when I join/groupByKey/cogroup etc.
Take the below code sample. I expect there is a way that we can run the below without incurring a shuffle when the join is performed.
val cogroupedData = spark.read.parquet("path/to/data").as[CogroupedData]
val aRecords = cogroupedData
.flatMap(cog => cog.aRecords.map(a => (cog.date,a)))
val bRecords = cogroupedData
.flatMap(cog => cog.bRecords.map(b => (cog.date,b)))
val joined = aRecords.join(bRecords,Seq("date"))
Looking at the literature, if cogroupedData has a known partitioner, then the operations that follow should not incur a shuffle since they can use the fact that the RDD is already partitioned and preserve the partitioner.
What I think I need to achieve this is to get a cogroupedData Dataset/rdd with a known partitioner without incurring a shuffle.
Other things I have tried already:
Hive metadata - Works fine for simple joins, but only optimises the initial join and not subsequent transformations. Hive also does not help with cogroups at all
Anyone have any ideas?
You've made two mistakes here.
Today (Spark 2.3) Spark doesn't use partitioning information for query optimization beyond partition pruning. Only bucketing is used. For details see Does Spark know the partitioning key of a DataFrame?.
Conclusion: To have any opportunity to optimize you have to use metastore and bucketing.
In general Spark cannot optimize operations on "strongly typed" datasets. For details see Spark 2.0 Dataset vs DataFrame and Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?
The right way to do it is to:
Use bucketing.
val n: Int
someDF.write.bucketBy(n, "date").saveAsTable("df")
Drop functional API in favor of SQL API:
import org.apache.spark.sql.functions.explode
val df = spark.table("df")
val adf = df.select($"date", explode($"aRecords").alias("aRecords"))
val bdf = df.select($"date", explode($"bRecords").alias("bRecords"))
adf.join(bdf, Seq("date"))

Spark Dataframe performance for overwrite

Is there any performance difference or considerations between the following two pyspark statements:
df5 = df5.drop("Ratings")
and
df6 = df5.drop("Ratings)
Not specifically targeting the drop function, but any operation. Was wondering what happens under the hood when you overwrite a variable compared to creating a new one.
Also, is the behavior and performance considerations the same if this was an RDD and not a dataframe ?
No, There won't be any difference in the operation.
In case of Numpy, There is a option of flag which shows whether its own the data or not.
variable_name.flag
In case of Pyspark, the Dataframe is immutable and every change in the dataframe creates a new Dataframe. How does it do ? well, Dataframe is stored in distributed fashion. So, to move data in memory costs. Therefore, they change the ownership of data from a Dataframe to another, more particularly where index of the data is stored.
and
Dataframe is way better than RDD. Here is a good blog.
Dataframe RDD and dataset

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.