Efficient way to join and group in spark and minimize shuffling - scala

I have two large dataframes with around couple of million records in each.
val df1 = Seq(
("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")
val df2 = Seq(
("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")
I am trying to join and perform a groupBy as below, but it is taking forever for the result. There are around one million unique instances of grp1+grp2 data in df1.
val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")
Is there a way to reduce shuffling? Is mapPartitions right approach for these two scenarios? Can anyone suggest an efficient way with an example.

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

Unexpected caching behaviour for groupBy/join operations in spark

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Weird behavior of DataFrame operations

Consider the code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable")
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This is producing the warning WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.The code fails java.lang.OutOfMemoryError: GC overhead limit exceeded.
But if I run the following code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable limit 2000000")//only difference here
//ORC table has 1651343 rows so doesn't exceed limit 2000000
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This produces the correct output. I'm at a loss why this happens and what changes. Can someone help make some sense of this?
To answer my own question: The Spark physical execution plan are different for the two ways of generating the same dataframe which can be checked by calling the .explain() method.
The first way uses the broadcast-hash join which causes java.lang.OutOfMemoryError: GC overhead limit exceeded whereas the latter way runs the sort-merge join which is typically slower but does not strain the garbage collection as much.
This difference in physical execution plans is introduced by the additional filteroperation on the df2 dataframe.

How to divide dataset in two parts based on filter in Spark-scala

Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?