Speed up spark dataframe groupBy - scala

I am fairly inexperienced in Spark, and need help with groupBy and aggregate functions on a dataframe. Consider the following dataframe:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 1|
| 1| b| 3|
| 1| c| 6|
| 2| a| 9|
| 2| c| 10|
| 1| b| 8|
| 2| c| 3|
| 3| r| 19|
+----+----+----+
I need to group by col1 and col2 and calculate the mean of col3, which I can do using:
val col1df = df.groupBy("col1").agg(round(mean("col3"),2).alias("mean_col1"))
val col2df = df.groupBy("col2").agg(round(mean("col3"),2).alias("mean_col2"))
However, on a large dataframe with a few million rows and tens of thousands of unique elements in the columns to group by, it takes a very long time. Besides, I have many more columns to group by and it takes insanely long, which I am looking to reduce. Is there a better way to do the groupBy followed by the aggregation?

You could use ideas from Multiple Aggregations, it might do everything in one shuffle operations, which is the most expensive operation.
Example:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.createOrReplaceTempView("data")
val grpRes = spark.sql("""select grouping_id() as gid, col1, col2, round(mean(col3), 2) as res
from data group by col1, col2 grouping sets ((col1), (col2)) """)
grpRes.show(100, false)
Output:
+---+----+----+----+
|gid|col1|col2|res |
+---+----+----+----+
|1 |3 |null|19.0|
|2 |null|b |5.5 |
|2 |null|c |6.33|
|1 |1 |null|4.5 |
|2 |null|a |5.0 |
|1 |2 |null|7.33|
|2 |null|r |19.0|
+---+----+----+----+
gid is a bit funny to use, as it has some binary calculations underneath. But if your grouping columns can not have nulls, than you can use it for selecting the correct groups.
Execution Plan:
scala> grpRes.explain
== Physical Plan ==
*(2) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[avg(cast(col3#9 as double))])
+- Exchange hashpartitioning(col1#111, col2#112, spark_grouping_id#108, 200)
+- *(1) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[partial_avg(cast(col3#9 as double))])
+- *(1) Expand [List(col3#9, col1#109, null, 1), List(col3#9, null, col2#110, 2)], [col3#9, col1#111, col2#112, spark_grouping_id#108]
+- LocalTableScan [col3#9, col1#109, col2#110]
As you can see there is single Exchange operation, the expensive shuffle.

Related

How to filter columns in one table based on the same columns in another table using Spark

I need to filter columns in one table (fixTablehb004_p) based on the same columns in another table (filtredTable109_p)
I first wanted to use this code:
val filtredTablehb004_p = fixTablehb004_p
.where($"servizio_rap" === filtredTable109_p.col("servizio_rap"))
.where($"filiale_rap" === filtredTable109_p.col("filiale_rap"))
.where($"codice_rap" === filtredTable109_p.col("codice_rap"))
But it gave out an error.
Then I tried the code based on this stackoverflow question, and I get this code. But the problem is that there are extra columns, I know what you can do drop(columnName), but I want to ask you if I'm doing it right and if there is another better option
val filtredTablehb004_p = sparkSession.sql("SELECT * FROM fixTablehb004_p " +
"JOIN filtredTable109_p " +
"ON fixTablehb004_p.servizio_rap = filtredTable109_p.servizio_rap AND " +
"fixTablehb004_p.filiale_rap = filtredTable109_p.filiale_rap AND " +
"fixTablehb004_p.codice_rap = filtredTable109_p.codice_rap ")
Let's take 2 sample dataframes and see how we can select required columns or avoid duplicate key column names in joined output dataframe.
USING DATAFRAME API:
val df1 = Seq(("A1", "A2", 1), ("A3", "A4", 2), ("A1", "A3", 3))
.toDF("c1", "c2", "c3")
val df2 = Seq(("A1", "A2", 10), ("A3", "A4", 11))
.toDF("c1", "c2", "c4")
df1.createOrReplaceTempView("tab1")
df2.createOrReplaceTempView("tab2")
If column names which you used for joining condition from both dataframes are same then output dataframe will have duplicate columns. To avoid this you can pass all those columns as Seq to join().
df1.join(df2, Seq("c1", "c2")).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A1| A2| 1| 10|
| A3| A4| 2| 11|
+---+---+---+---+
To select required columns from specific dataframe you can use below syntax:
df1.join(df2, Seq("c1", "c2")).select('c1, 'c2, df1("c3")).show()
// OR
df1.join(df2, df1("c1") === df2("c1") && df1("c2") === df2("c2"))
.select(df1("c1"), df1("c2"), df1("c3")).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| A1| A2| 1|
| A3| A4| 2|
+---+---+---+
USING SQL API:
spark.sql(
"""
|SELECT t2.c1, t2.c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2
|""".stripMargin).show()
//OR
spark.sql(
"""
|SELECT c1, c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 USING(c1, c2)
|""".stripMargin).show()
+---+---+---+
| c1| c2| c4|
+---+---+---+
| A1| A2| 10|
| A3| A4| 11|
+---+---+---+

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

Parse nested data using SCALA

I have the dataframe as follows :
ColA ColB ColC
1 [2,3,4] [5,6,7]
I need to convert it to the below
ColA ColB ColC
1 2 5
1 3 6
1 4 7
Can someone please help with the Code in SCALA?
You can zip the two array columns by means of a UDF and explode the zipped column as follows:
val df = Seq(
(1, Seq(2, 3, 4), Seq(5, 6, 7))
).toDF("ColA", "ColB", "ColC")
def zip = udf(
(x: Seq[Int], y: Seq[Int]) => x zip y
)
val df2 = df.select($"ColA", zip($"ColB", $"ColC").as("BzipC")).
withColumn("BzipC", explode($"BzipC"))
val df3 = df2.select($"ColA", $"BzipC._1".as("ColB"), $"BzipC._2".as("ColC"))
df3.show
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
The idea I am presenting here is a bit complex which requires you to use map to combine the two arrays of ColB and ColC. Then use the explode function to explode the combined array. and finally extract the exploded combined array to different columns.
import org.apache.spark.sql.functions._
val tempDF = df.map(row => {
val colB = row(1).asInstanceOf[mutable.WrappedArray[Int]]
val colC = row(2).asInstanceOf[mutable.WrappedArray[Int]]
var array = Array.empty[(Int, Int)]
for(loop <- 0 to colB.size-1){
array = array :+ (colB(loop), colC(loop))
}
(row(0).asInstanceOf[Int], array)
})
.toDF("ColA", "ColB")
.withColumn("ColD", explode($"ColB"))
tempDF.withColumn("ColB", $"ColD._1").withColumn("ColC", $"ColD._2").drop("ColD").show(false)
this would give you result as
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|1 |2 |5 |
|1 |3 |6 |
|1 |4 |7 |
+----+----+----+
You can also use a combination of posexplode and lateral view from HiveQL
sqlContext.sql("""
select 1 as colA, array(2,3,4) as colB, array(5,6,7) as colC
""").registerTempTable("test")
sqlContext.sql("""
select
colA , b as colB, c as colC
from
test
lateral view
posexplode(colB) columnB as seqB, b
lateral view
posexplode(colC) columnC as seqC, c
where
seqB = seqC
""" ).show
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
Credits: https://stackoverflow.com/a/40614822/7224597 ;)

Scala LEFT JOIN on dataframes using two columns (case insensitive)

I have created the below method which takes two Dataframes; lhs & rhs and their respective first and second columns as input. The method should return the result of a left join between these two frames using the two columns provided for each dataframe (ignoring their case sensitivity).
The problem I am facing is that it is doing more of an inner join. It is is returning 3 times the number of the rows that is in the lhs data frame (due to duplicate values in rhs), but as it is a left join the duplication and number of rows in rhs dataframe should not matter.
def leftJoinCaseInsensitive(lhs: DataFrame, rhs: DataFrame, leftTableColumn: String, rightTableColumn: String, leftTableColumn1: String, rightTableColumn1: String): DataFrame = {
val joined: DataFrame = lhs.join(rhs, upper(lhs.col(leftTableColumn)) === upper(rhs.col(rightTableColumn)) && upper(lhs.col(leftTableColumn1)) === upper(rhs.col(rightTableColumn1)), "left");
return joined
}
If there are duplicate values in rhs, then it is normal for lhs to get replicated. If a joining values in joining columns from lhs row matches with multiple rhs rows then joined dataframe should have multiple rows from lhs matching the rows from rhs.
for example
lhs dataframe
+--------+--------+--------+
|col1left|col2left|col3left|
+--------+--------+--------+
|a |1 |leftside|
+--------+--------+--------+
And
rhs dataframe
+---------+---------+---------+
|col1right|col2right|col3right|
+---------+---------+---------+
|a |1 |rightside|
|a |1 |rightside|
+---------+---------+---------+
Then it is normal to have left join as
left joined lhs with rhs
+--------+--------+--------+---------+---------+---------+
|col1left|col2left|col3left|col1right|col2right|col3right|
+--------+--------+--------+---------+---------+---------+
|a |1 |leftside|a |1 |rightside|
|a |1 |leftside|a |1 |rightside|
+--------+--------+--------+---------+---------+---------+
You can have more information here
but as it is a left join the duplication and number of rows in rhs
dataframe should not matter
Not true. Your leftJoinCaseInsensitive method looks good to me. A left join would still produce more rows than the left table's if the right table has duplicated key column(s), as shown below:
val dfR = Seq(
(1, "a", "x"),
(1, "a", "y"),
(2, "b", "z")
).toDF("k1", "k2", "val")
val dfL = Seq(
(1, "a", "u"),
(2, "b", "v"),
(3, "c", "w")
).toDF("k1", "k2", "val")
leftJoinCaseInsensitive(dfL, dfR, "k1", "k1", "k2", "k2")
res1.show
+---+---+---+----+----+----+
| k1| k2|val| k1| k2| val|
+---+---+---+----+----+----+
| 1| a| u| 1| a| y|
| 1| a| u| 1| a| x|
| 2| b| v| 2| b| z|
| 3| c| w|null|null|null|
+---+---+---+----+----+----+

How to pivot dataset?

I use Spark 2.1.
I have some data in a Spark Dataframe, which looks like below:
**ID** **type** **val**
1 t1 v1
1 t11 v11
2 t2 v2
I want to pivot up this data using either spark Scala (preferably) or Spark SQL so that final output should look like below:
**ID** **t1** **t11** **t2**
1 v1 v11
2 v2
You can use groupBy.pivot:
import org.apache.spark.sql.functions.first
df.groupBy("ID").pivot("type").agg(first($"val")).na.fill("").show
+---+---+---+---+
| ID| t1|t11| t2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note: depending on the actual data, i.e. how many values there are for each combination of ID and type, you might choose a different aggregation function.
Here's one way to do it:
val df = Seq(
(1, "T1", "v1"),
(1, "T11", "v11"),
(2, "T2", "v2")
).toDF(
"id", "type", "val"
).as[(Int, String, String)]
val df2 = df.groupBy("id").pivot("type").agg(concat_ws(",", collect_list("val")))
df2.show
+---+---+---+---+
| id| T1|T11| T2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note that if there are different vals associated with a given type, they will be grouped (comma-delimited) under the type in df2.
This one should work
val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))
val seq = Seq((1,"t1","v1"),(1,"t11","v11"),(2,"t2","v2"))
val df = seq.toDF("id","type","val")
val pivotedDF = df.groupBy("id").pivot("type").agg(first("val"))
pivotedDF.show
Output:
+---+----+----+----+
| id| t1| t11| t2|
+---+----+----+----+
| 1| v1| v11|null|
| 2|null|null| v2|
+---+----+----+----+