Unexpected caching behaviour for groupBy/join operations in spark - scala

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Filter Spark Dataframe with list of values in Scala

I am trying to create a dataframe from hive table using SparkSession like below. Once created I am filtering the rows by a list of Ids.
val myDF = spark.sql("select * from myhivetable")
val someDF = mfiDF.where(mfiDF("id").isin(myList:_*))
Instead of this approach is there a way I can query the hive table as below:
val myDF = spark.sql("select * from myhivetable").where (("id").isin(myList:_*))
When I try like this I am getting a compilation error.
Could someone suggest a best approach for this. Thanks.
You could also do an inner join to remove unwanted ids, something like below may work.
val ids = sc.parallelize(myList).toDF("id")
someDF.join(ids, ids.id === someDF.id)

Efficient way to join and group in spark and minimize shuffling

I have two large dataframes with around couple of million records in each.
val df1 = Seq(
("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")
val df2 = Seq(
("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")
I am trying to join and perform a groupBy as below, but it is taking forever for the result. There are around one million unique instances of grp1+grp2 data in df1.
val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")
Is there a way to reduce shuffling? Is mapPartitions right approach for these two scenarios? Can anyone suggest an efficient way with an example.

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")