how to increase performance on Spark distinct() on multiple columns - scala

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

Unexpected caching behaviour for groupBy/join operations in spark

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

how to replace distinct() with reducebykey

I have a scenario where the below code overall take more than 10 hours for >2 Billion records. even i tried with 35 instance of the i3 cluster but still the performance was bad. I am looking for an option to replace distinct() with reduceByKey() and to get suggestion to improve the performance...
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes")
val df2 = df1.
repartition(
List(col("ID"), col("col2"), col("suffix"), col("date"),
col("year"), col("codes")): _*
).distinct()
val df3 = df2.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df3.createOrReplaceTempView("df3")
val df4 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df3
LATERAL VIEW explode(codes) exploded_table as d
""")
df4.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)
I think distinct() is implemented with reduceByKey(reduce), but if you want to implement it by yourself, you could do something
val array=List((1,2),(1,3),(1,5),(1,2),(2,2),(2,2),(3,2),(3,2),(4,1),(1,3))
val pairRDD=session.sparkContext.parallelize(array)
val distinctResult=pairRDD.map(x => (x, null)).reduceByKey((x, _) => x)

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Spark DataFrame's `except()` removes different items every time

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.except(df2)
df3.show()
Surprisingly, I found that except is not working the way it should. Here is my output:
df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.
You need to put a spark "action" to collect the data that is required for "df2" before performing the "except" operation, which will ensure that the dataframe df2 get computed before hand and has the fixed content which will be subtracted from df.
Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:
Returns a new Dataset by taking the first n rows. The difference between this function
and head is that head is an action and returns an array (by triggering query execution)
while limit returns a new Dataset.
since, it return a datset, will be lazy evaluation,
Below code will give you a consistent output.
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 = df.except(df2)
df3.show()
Best way to test this is to just create a new DF that has the values you want to diff.
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 = df.except(df2)
df3.show()
Alternatively, just write a deterministic filter to select the rows you want:
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 = df.except(df2)
df3.show()
One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.
Example:
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.join(df2,Seq("num"),"leftanti")
df3.show()