Subquery vs Dataframe filter function in spark - scala

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.

I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

Related

How to drop specific column and then select all columns from spark dataframe

I have a scenario here - Have 30 columns in one dataframe, need to drop specific column and select remaining columns and put it to another dataframe. How can I achieve this? I tried below.
val df1: DataFrame = df2.as(a).join( df3.as(b),col(a.key) === col(b.key), inner).drop(a.col1)
.select(" a.star ")
when I do show of df1, its still show col1. Any advise on resolving this.
drop requires a string without table alias, so you can try:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.drop("col1")
.select("a.*")
Or instead of dropping the column, you can filter the columns to be selected:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.select(df2.columns.filterNot(_ == "col1").map("a." + _): _*)
This really just seems like you need to use a "left_semi" join.
val df1 = df2.drop('col1).join(df3, df2("key") === df3("key"), "left_semi")
If key is an actual column you can simplify the syntax even further
val df1 = df2.drop('col1).join(df3, Seq("key"), "left_semi")
The best syntax depends on the details of what your real data looks like. If you need to refer to col1 in df2 specifically because there's ambiguity, then use df2("col1")
left_semi joins take all the columns from the left table for rows finding a match in the right table.

Unexpected caching behaviour for groupBy/join operations in spark

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

Filter Spark Dataframe with list of values in Scala

I am trying to create a dataframe from hive table using SparkSession like below. Once created I am filtering the rows by a list of Ids.
val myDF = spark.sql("select * from myhivetable")
val someDF = mfiDF.where(mfiDF("id").isin(myList:_*))
Instead of this approach is there a way I can query the hive table as below:
val myDF = spark.sql("select * from myhivetable").where (("id").isin(myList:_*))
When I try like this I am getting a compilation error.
Could someone suggest a best approach for this. Thanks.
You could also do an inner join to remove unwanted ids, something like below may work.
val ids = sc.parallelize(myList).toDF("id")
someDF.join(ids, ids.id === someDF.id)

Efficient way to join and group in spark and minimize shuffling

I have two large dataframes with around couple of million records in each.
val df1 = Seq(
("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")
val df2 = Seq(
("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")
I am trying to join and perform a groupBy as below, but it is taking forever for the result. There are around one million unique instances of grp1+grp2 data in df1.
val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")
Is there a way to reduce shuffling? Is mapPartitions right approach for these two scenarios? Can anyone suggest an efficient way with an example.

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()