How to join Datasets on multiple columns? - scala

Given two Spark Datasets, A and B I can do a join on single column as follows:
a.joinWith(b, $"a.col" === $"b.col", "left")
My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")

You can do it exactly the same way as with Dataframe:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
In Spark < 2.0.0 you can use something like this:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")

There's another way of joining by chaining where one after another. You first specify a join (and optionally its type) followed by where operator(s), i.e.
scala> case class A(id: Long, name: String)
defined class A
scala> case class B(id: Long, name: String)
defined class B
scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]
scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]
scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 0|zero| 0| zero|
| 1| one| 1|jeden|
+---+----+---+-----+
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
| 0|zero| 0|zero|
+---+----+---+----+
The reason for such a goodie is that the Spark optimizer will join (no pun intended) consecutive wheres into one with join. Use explain operator to see the underlying logical and physical plans.
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]

In Java, the && operator does not work. The correct way to join based on multiple columns in Spark-Java is as below:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
The and function works like the && operator.

Related

Why does Spark cast udf-generated columns before calling another udf but not raw columns?

I am trying to use an udf defined with Seq[Double] => Seq[Double].
When I am trying to use it with a "raw" array<int> defined at the creation of the Dataframe, Spark does not cast it in array<double> before using my udf.
However, when I generate an array<int> from another udf, Spark casts the column in array<double> before calling my udf.
What is the philosophy behind these casts? What Analyzer's rule is responsible for this cast?
Here some code to illustrate/reproduce:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(Seq(1)))),
StructType(
StructField("array_int", ArrayType(IntegerType, true), false) ::
Nil
)
)
df.show
/**
+---------+
|array_int|
+---------+
| [1]|
+---------+
*/
val f = udf ((v: Seq[Double]) => v)
val generateIntArrays = udf(() => Array.fill(2)(1))
val df1 = df.withColumn("f", f(col("array_int"))) // df1.show fails at runtime, Spark does not cast array_int before calling f
val df2 = df.withColumn("b", generateIntArrays()).withColumn("f", f(col("b"))) // df2.show works at rnutime, Spark explicitly casts the output of col("b") before calling f
df1.explain // not cast
/**
== Physical Plan ==
*(1) Project [array_int#778, UDF(array_int#778) AS f#781]
+- *(1) Scan ExistingRDD[array_int#778]
*/
df2.explain // cast in array<double> before calling `f`
/**
== Physical Plan ==
*(1) Project [array_int#778, UDF() AS b#804, UDF(cast(UDF() as array<double>)) AS f#807]
+- *(1) Scan ExistingRDD[array_int#778]
*/
It seems that if you set the array element to non-nullable, then it will cast to double.
val f = udf ((v: Seq[Double]) => v)
// your code: nullable array element
spark.createDataFrame(
sc.parallelize(Seq(Row(Seq(1)))),
StructType(List(StructField("array_int", ArrayType(IntegerType, true), false)))
).withColumn("f", f(col("array_int"))).explain
== Physical Plan ==
*(1) Project [array_int#313, UDF(array_int#313) AS f#315]
+- *(1) Scan ExistingRDD[array_int#313]
// non-nullable array element
spark.createDataFrame(
sc.parallelize(Seq(Row(Seq(1)))),
StructType(List(StructField("array_int", ArrayType(IntegerType, false), false)))
).withColumn("f", f(col("array_int"))).explain
== Physical Plan ==
*(1) Project [array_int#301, UDF(cast(array_int#301 as array<double>)) AS f#303]
+- *(1) Scan ExistingRDD[array_int#301]
There are also some interesting observations for UDFs that take doubles and called on an integer column. Again the query plan depends on the nullability of the column.
val f = udf ((v: Double) => v)
// nullable integer
spark.createDataFrame(
sc.parallelize(Seq(Row(1))),
StructType(List(StructField("int", IntegerType, true)))
).withColumn("F", f(col("int"))).explain
== Physical Plan ==
*(1) Project [int#359, if (isnull(cast(int#359 as double))) null else UDF(knownnotnull(cast(int#359 as double))) AS F#361]
+- *(1) Scan ExistingRDD[int#359]
// non-nullable integer
spark.createDataFrame(
sc.parallelize(Seq(Row(1))),
StructType(List(StructField("int", IntegerType, false)))
).withColumn("F", f(col("int"))).explain
== Physical Plan ==
*(1) Project [int#365, UDF(cast(int#365 as double)) AS F#367]
+- *(1) Scan ExistingRDD[int#365]
I suppose the reason behind this behaviour is null handling (because the UDF cannot accept null / array of nulls, and they need to be handled before the UDF is called). Perhaps Spark cannot figure out a way to handle nulls inside an array.

How does spark DAG works when joining two derived dataframe from a same parent one?

Let say I have some kind of transformation as the below snippet where i want to join two data frames derived from the very same parent one in spark. How would the DAG be optimized or not for those computing and is a persist on the initial read value of any use ?
val dataFrame = readDataframe() // .persist() ?
val derived1 = dataFrame.transform(/* tranformation1 */)
val derived2 = dataFrame.transform(/* tranformation2 */)
val result = derived1.join(derived2, /* condition*/)
result.show()
persist is not useful here because there is no actual operation done throughout your code due to lazy evaluation. The physical plan below shows that persisting doesn't optimize the physical plan at all.
However, if you call something like .count() or .show() during the transformation, then you force Spark to evaluate your query, and persist will be helpful in this case.
Without persist:
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val df1 = df.transform(x => x.select($"id", $"id" * 2))
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, (id * 2): bigint]
scala> val df2 = df.transform(x => x.select($"id", $"id" + 2))
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, (id + 2): bigint]
scala> val result = df1.join(df2, "id")
result: org.apache.spark.sql.DataFrame = [id: bigint, (id * 2): bigint ... 1 more field]
scala> result.explain()
== Physical Plan ==
*(2) Project [id#8L, (id * 2)#15L, (id + 2)#18L]
+- *(2) BroadcastHashJoin [id#8L], [id#21L], Inner, BuildRight
:- *(2) Project [id#8L, (id#8L * 2) AS (id * 2)#15L]
: +- *(2) Range (0, 10, step=1, splits=24)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#39]
+- *(1) Project [id#21L, (id#21L + 2) AS (id + 2)#18L]
+- *(1) Range (0, 10, step=1, splits=24)
With persist:
scala> val df0 = df.persist()
df0: df.type = [id: bigint]
scala> val df1 = df0.transform(x => x.select($"id", $"id" * 2))
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, (id * 2): bigint]
scala> val df2 = df0.transform(x => x.select($"id", $"id" + 2))
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, (id + 2): bigint]
scala> val result = df1.join(df2, "id")
result: org.apache.spark.sql.DataFrame = [id: bigint, (id * 2): bigint ... 1 more field]
scala> result.explain()
== Physical Plan ==
*(2) Project [id#8L, (id * 2)#50L, (id + 2)#53L]
+- *(2) BroadcastHashJoin [id#8L], [id#56L], Inner, BuildRight
:- *(2) Project [id#8L, (id#8L * 2) AS (id * 2)#50L]
: +- *(2) ColumnarToRow
: +- InMemoryTableScan [id#8L]
: +- InMemoryRelation [id#8L], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(1) Range (0, 10, step=1, splits=24)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#100]
+- *(1) Project [id#56L, (id#56L + 2) AS (id + 2)#53L]
+- *(1) ColumnarToRow
+- InMemoryTableScan [id#56L]
+- InMemoryRelation [id#56L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Range (0, 10, step=1, splits=24)

re-assign value to column doesn't work, but create a new column that cannot be selected

I have written a scala function to join 2 dataframes with same schema, says df1 and df2. For every key in df1, if df1's key matches with df2, then we pick up values from df2 for this key, if no then leave df1's value. It supposed to return dataframe with same number of df1 but different value, but the function doesn't work and return same df as df1.
def joinDFwithConditions(df1: DataFrame, df2: DataFrame, key_seq: Seq[String]) ={
var final_df = df1.as("a").join(df2.as("b"), key_seq, "left_outer")
//set of non-key columns
val col_str = df1.columns.toSet -- key_seq.toSet
for (c <- col_str){ //for every match-record, check values from both dataframes
final_df = final_df
.withColumn(s"$c",
when(col(s"b.$c").isNull || col(s"b.$c").isNaN,col(s"a.$c"))
.otherwise(col(s"b.$c")))
// I used to re-assign value with reference "t.$c",
// but return error says no t.col found in schema
}
final_df.show()
final_df.select(df1.columns.map(x => df1(x)):_*)
}
def main(args: Array[String]) {
val sparkSession = SparkSession.builder().appName(this.getClass.getName)
.config("spark.hadoop.validateOutputSpecs", "false")
.enableHiveSupport()
.getOrCreate()
import sparkSession.implicits._
val df1 = List(("key1",1),("key2",2),("key3",3)).toDF("x","y")
val df2 = List(("key1",9),("key2",8)).toDF("x","y")
joinDFwithConditions(df1, df2, Seq("x")).show()
sparkSession.stop()
}
df1 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||1 |
| key2 ||2 |
| key3 ||3 |
--------------------------------------
df2 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
--------------------------------------
expected results:
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
| key3 ||3 |
--------------------------------------
what really shows:
+-------+---+---+
| x | y| y|
+-------+---+---+
| key1 | 9| 9|
| key2 | 8| 8|
| key3 | 3| 3|
+-------+---+---+
error message
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Resolved attribute(s) y#6 missing from x#5,y#21,y#22 in operator !Project [x#5, y#6]. Attribute(s) with the same name appear in the operation: y. Please check if the right attribute(s) are used.;;
!Project [x#5, y#6]
+- Project [x#5, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#21, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#22]
+- Project [x#5, y#6, y#15]
+- Join LeftOuter, (x#5 = x#14)
:- SubqueryAlias `a`
: +- Project [_1#2 AS x#5, _2#3 AS y#6]
: +- LocalRelation [_1#2, _2#3]
+- SubqueryAlias `b`
+- Project [_1#11 AS x#14, _2#12 AS y#15]
+- LocalRelation [_1#11, _2#12]
When you do df.as("a"), you do not rename the column of the dataframe. You simply allow to access them with a.columnName in order to lift an ambiguity. Therefore, your when goes well because you use aliases but you end up with multiple y columns. I am quite surprised by the way that it manages to replace one of the y columns...
When you try to access it with its name y however (without prefix), spark does know which one you want and throws an error.
To avoid errors, you could simply do everything you need with one select like this:
df1.as("a").join(df2.as("b"), key_cols, "left_outer")
.select(key_cols.map(col) ++
df1
.columns
.diff(key_cols)
.map(c => when(col(s"b.$c").isNull || col(s"b.$c").isNaN, col(s"a.$c"))
.otherwise(col(s"b.$c"))
.alias(c)
) : _*)

Why does filtering on a non-existing (non-selected) column work?

The following minimal example
val df1 = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("foo", "bar")
val df2 = df1.select($"foo")
val df3 = df2.filter($"bar" === lit("a"))
df1.printSchema
df1.show
df2.printSchema
df2.show
df3.printSchema
df3.show
Runs with non errors:
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
+---+---+
|foo|bar|
+---+---+
| 0| a|
| 1| b|
+---+---+
root
|-- foo: integer (nullable = false)
+---+
|foo|
+---+
| 0|
| 1|
+---+
root
|-- foo: integer (nullable = false)
+---+
|foo|
+---+
| 0|
+---+
However, I expected something like
org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given input columns: [foo];
for the same reason, I get
org.apache.spark.sql.AnalysisException: cannot resolve '`asdasd`' given input columns: [foo];
when I do
val df4 = df2.filter($"asdasd" === lit("a"))
But it does not happen. Why?
I'm leaning towards calling it a bug. An explain plan tells a little more:
val df1 = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
df1.select("foo").where($"bar" === "a").explain(true)
// == Parsed Logical Plan ==
// 'Filter ('bar = a)
// +- Project [foo#4]
// +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
// +- LocalRelation [_1#0, _2#1]
//
// == Analyzed Logical Plan ==
// foo: int
// Project [foo#4]
// +- Filter (bar#5 = a)
// +- Project [foo#4, bar#5]
// +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
// +- LocalRelation [_1#0, _2#1]
//
// == Optimized Logical Plan ==
// LocalRelation [foo#4]
//
// == Physical Plan ==
// LocalTableScan [foo#4]
Apparently, both the parsed logical plan and analyzed (or resolved) logical plan still consist of bar in their Project nodes (i.e. projections) and the filtering operations continue to honor the supposedly removed column.
On a related note, the logical plans for the following query also consist of the dropped column, thus exhibiting similar anomaly:
df1.drop("bar").where($"bar" === "a")

Left join operation runs forever

I have to DataFrames that I want to join applying Left joining.
df1 =
+----------+---------------+
|product_PK| rec_product_PK|
+----------+---------------+
| 560| 630|
| 710| 240|
| 610| 240|
df2 =
+----------+---------------+-----+
|product_PK| rec_product_PK| rank|
+----------+---------------+-----+
| 560| 610| 1|
| 560| 240| 1|
| 610| 240| 0|
The problem is that df1 contains only 500 rows, while df2 contains 600.000.000 rows and 24 partitions. My Left joining takes a while to execute. I am waiting for 5 hours and it is not finished.
val result = df1.join(df2,Seq("product_PK","rec_product_PK"),"left")
The result should contain 500 rows. I execute the code from spark-shell using the following parameters:
spark-shell -driver-memory 10G --driver-cores 4 --executor-memory 10G --num-executors 2 --executor-cores 4
How can I speed up the process?
UPDATE:
The output of df2.explain(true):
== Parsed Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Analyzed Logical Plan ==
product_PK: bigint, rec_product_PK: bigint, rank: int
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank_product_family#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Optimized Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L, array_elem#184.product_PK AS rec_product_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Physical Plan ==
Exchange RoundRobinPartitioning(5000)
+- *Project [product_PK#15L, array_elem#184.product_PK AS rec_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- *FileScan parquet [product_PK#15L,products#16] Batched: false, Format: Parquet, Location: InMemoryFileIndex[s3://data/result/2017-11-27/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<product_PK:bigint,products:array<struct<product_PK:bigint,col2:int>>>
You should probably use a different type of join. By default the join you are making assumes both dataframes are large and therefore a lot of shuffling is done (Generally each row would be hashed, the data would be shuffled based on the hashing, then a per executor joining would be done). You can see this by typing using explain on the result to see the execution plan.
Instead consider using the broadcast hint:
val result = df2.join(broadcast(df1),Seq("product_PK","rec_product_PK"),"right")
note that I flipped the join order so the broadcast would appear in the join parameters. The broadcast function is part of org.apache.spark.sql.functions
This would do a broadcast join instead, df1 would be copied to all executors and the joining would be done locally avoiding the need to shuffle the large df2.
Given the exceptionally small size of your df1, it might be worth considering to first collect it into a list, and filter the large df2 with the list down to a comparably small dataframe, which is then used for a left join with df1:
val df1 = Seq(
(560L, 630L),
(710L, 240L),
(610L, 240L)
).toDF("product_PK", "rec_product_PK")
val df2 = Seq(
(560L, 610L, 1),
(560L, 240L, 1),
(610L, 240L, 0)
).toDF("product_PK", "rec_product_PK", "rank")
import org.apache.spark.sql.Row
val pkList = df1.collect.map{
case Row(pk1: Long, pk2: Long) => (pk1, pk2)
}.toList
// pkList: List[(Long, Long)] = List((560,630), (710,240), (610,240))
def inPkList(pkList: List[(Long, Long)]) = udf(
(pk1: Long, pk2: Long) => pkList.contains( (pk1, pk2) )
)
val df2Filtered = df2.where( inPkList(pkList)($"product_PK", $"rec_product_PK") )
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 610| 240| 0|
// +----------+--------------+----+
df1.join(df2Filtered, Seq("product_PK", "rec_product_PK"), "left_outer")
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 560| 630|null|
// | 710| 240|null|
// | 610| 240| 0|
// +----------+--------------+----+