Spark join produces wrong results - scala

Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
But now the result is wrong again!
I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
Is the behaviour a bug, or is this expected behaviour? If expected, why?

If you want an expected behavior use either join on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
or aliases:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
You can use <=> as well:
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")

Related

How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark?

I have a customized function that looks like this that returns a different dataframe as the output
def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}
and I want to apply this function to each group of
df.groupBy("type")
then append the output dataframes from each type into one dataframe.
This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type").
What's the best way to do this?
You can filter down the original df to the different groups, call customizedfun for each group and then union the results.
I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:
def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
data.withColumn("newCol", lit(s"$param2 $param1"))
I need two helper function that calculate the values of param1 and param2 dependent on the value of type. In a real world application, these functions could be for example a lookup into a dictionary.
def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"
Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:
//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
.toDF("type", "val1", "val2")
//+----+----+----+
//|type|val1|val2|
//+----+----+----+
//| 1| A| a|
//| 1| B| b|
//| 1| C| c|
//| 2| D| d|
//| 2| E| e|
//| 3| F| f|
//+----+----+----+
//get the distinct values of column type
val distinctTypes = df.select("type").distinct().as[Integer].collect()
//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))
//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)
//+----+----+----+---------------+
//|type|val1|val2| newCol|
//+----+----+----+---------------+
//| 1| A| a|type is 1 false|
//| 1| B| b|type is 1 false|
//| 1| C| c|type is 1 false|
//| 3| F| f|type is 3 false|
//| 2| D| d| type is 2 true|
//| 2| E| e| type is 2 true|
//+----+----+----+---------------+

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

Apply same common header to distinct fields of dataframes in scala spark

I want to apply a same common header to all dataframes I generate. The application must know which ones have to change/add/remove and in which position.
The distinct dataframes come with different column order, some columns left, some columns added. What I want is:
If there are more columns than the common common header, these ones will be removed.
If some column(s) left, add the columns left and fill in the rows with null values
// df with common header to apply
val mainDF = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
mainDF.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// Case 1: distinct column order
val df1 = Seq(("a", "c","b","d","e")).toDF("first","third","second","fourth","fifth")
df1.show()
+-----+-----+------+------+-----+
|first|third|second|fourth|fifth|
+-----+-----+------+------+-----+
| a| c| b| d| e|
+-----+-----+------+------+-----+
// Result desired:
val df1_correct = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
df1_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// Case 2: columns left
val df2 = Seq(("a", "b","c","d")).toDF("first","second","third","fourth")
df2.show()
+-----+------+-----+------+
|first|second|third|fourth|
+-----+------+-----+------+
| a| b| c| d|
+-----+------+-----+------+
// Result desired:
val df2_correct = Seq(("a","b","c","d","null")).toDF("first","second","third","fourth","fifth")
df2_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| null|
+-----+------+-----+------+-----+
// Case 3: columns added
val df3 = Seq(("a", "b","c","d","e","f")).toDF("first","second","third","fourth","fifth","sixth")
df3.show()
+-----+------+-----+------+-----+-----+
|first|second|third|fourth|fifth|sixth|
+-----+------+-----+------+-----+-----+
| a| b| c| d| e| f|
+-----+------+-----+------+-----+-----+
// Result desired:
val df3_correct = Seq(("a","b","c","d","e")).toDF("first","second","third","fourth","fifth")
df3_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| e|
+-----+------+-----+------+-----+
// case 4: distinct column order and e.g a column left
val df4 =
Seq(("a", "c","b","d")).toDF("first","third","second","fourth")
df4.show()
+-----+-----+------+------+
|first|third|second|fourth|
+-----+-----+------+------+
| a| c| b| d|
+-----+-----+------+------+
// Result desired:
val df4_correct = Seq(("a","b","c","d","null")).toDF("first","second","third","fourth","fifth")
df4_correct.show()
+-----+------+-----+------+-----+
|first|second|third|fourth|fifth|
+-----+------+-----+------+-----+
| a| b| c| d| null|
+-----+------+-----+------+-----+
This should cover all your cases:
val selectExp = mainDF.columns.map( c => if(df4.columns.contains(c)) col(c)
else lit(null).as(c) )
You map over mainDF.columns which is an Array[String] of the column names of mainDF
Array[String] = Array(first, second, third, fourth, fifth)
Replace df4 with whichever dataframe you want to generate expression for. If the column in dfx matches with mainDF, then select it, otherwise generate a null with the column name fetched from mainDF
You will get a Array[org.apache.spark.sql.Column]
Array[org.apache.spark.sql.Column] = Array(first, second, third, fourth, NULL AS `fifth`)
which you can use on the df as
df4.select(selectExp : _*).show
//+-----+------+-----+------+-----+
//|first|second|third|fourth|fifth|
//+-----+------+-----+------+-----+
//| a| b| c| d| null|
//+-----+------+-----+------+-----+

Scala: Any better way to join two DataFrames by the relationship from the third one

I have to two DataFrames, and want to outer join them. But the joining mapping is in another dataframe.
Now I am using below way, it works, but I hope there is more efficient way for I have >1,000,000 rows
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
scala> ta.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
scala> tb.show
+---+---+
| C| D|
+---+---+
| 2| 1|
+---+---+
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
scala> tc.show
+---+---+---+
| D| E| F|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
scala> val tmp = ta.join(tb, Seq("C"), "left_outer")
tmp: org.apache.spark.sql.DataFrame = [C: int, A: int, B: int, D: int]
scala> tmp.show
+---+---+---+----+
| C| A| B| D|
+---+---+---+----+
| 1| 1| 1|null|
| 2| 2| 2| 1|
+---+---+---+----+
scala> tmp.join(tc, Seq("D"), "outer").show
+----+----+----+----+----+----+
| D| C| A| B| E| F|
+----+----+----+----+----+----+
|null| 1| 1| 1|null|null|
| 1| 2| 2| 2| 1| 1|
| 2|null|null|null| 2| 2|
+----+----+----+----+----+----+
As Umberto noted, a good reference on how to improve performance of your joins is Holden Karau and Rachel Warren's High Performance Spark > Chapter 4. Joins (SQL & Core).
From the standpoint of your code, running it as you noted or the SQL equivalent (as noted below) should result in about the same performance.
// Create initial tables
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
// _.createOrReplaceTempView
ta.createOrReplaceTempView("ta")
tb.createOrReplaceTempView("tb")
tc.createOrReplaceTempView("tc")
// SQL Query
spark.sql("
select tc.D, ta.A, ta.B, ta.C, tc.E, tc.F
from ta
left outer join tb
on tb.C = ta.C
full outer join tc
on tc.D = tb.D
")
The reason why is because the Spark SQL Catalyst Optimizer (as noted in the diagram below) takes the DataFrame query and builds up an optimized logical plan. A number of physical plans are developed and Spark SQL Engine's Cost Optimizer chooses the best physical plan and generates the code to produce the RDDs.
Saying this, the key concern is that when you're working with a lot of rows that use up a lot of memory, you have to take into account of the partitioning. For example, if you can ensure that the mapping DataFrame (tc) have the same / similar partitioning scheme as the other DataFrames (ta, tb) so that way you can have a co-located join (this is Figure 4-3 within High Performance Spark > Chapter 4. Join).
If the partitions for your three DataFrames (ta, tb, tc) all have different partitioning, this means the keys for your DataFrames will not have a 1-to-1 matching between the partitions. That is, this will result in a shuffle join (this is Figure 4-2 within High Performance Spark > Chapter 4. Join) which potentially could be more costly.
Basically, from the standpoint of your query, the concern is less about the query itself and more about the partitioning schemes for your DataFrames. But before experimenting too much with the partitioning schemes of your DataFrames, experiment with your queries to see if the default Spark SQL / DataFrame queries are able to take care of the partitioning by itself.