Left join operation runs forever - scala

I have to DataFrames that I want to join applying Left joining.
df1 =
+----------+---------------+
|product_PK| rec_product_PK|
+----------+---------------+
| 560| 630|
| 710| 240|
| 610| 240|
df2 =
+----------+---------------+-----+
|product_PK| rec_product_PK| rank|
+----------+---------------+-----+
| 560| 610| 1|
| 560| 240| 1|
| 610| 240| 0|
The problem is that df1 contains only 500 rows, while df2 contains 600.000.000 rows and 24 partitions. My Left joining takes a while to execute. I am waiting for 5 hours and it is not finished.
val result = df1.join(df2,Seq("product_PK","rec_product_PK"),"left")
The result should contain 500 rows. I execute the code from spark-shell using the following parameters:
spark-shell -driver-memory 10G --driver-cores 4 --executor-memory 10G --num-executors 2 --executor-cores 4
How can I speed up the process?
UPDATE:
The output of df2.explain(true):
== Parsed Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Analyzed Logical Plan ==
product_PK: bigint, rec_product_PK: bigint, rank: int
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank_product_family#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Optimized Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L, array_elem#184.product_PK AS rec_product_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Physical Plan ==
Exchange RoundRobinPartitioning(5000)
+- *Project [product_PK#15L, array_elem#184.product_PK AS rec_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- *FileScan parquet [product_PK#15L,products#16] Batched: false, Format: Parquet, Location: InMemoryFileIndex[s3://data/result/2017-11-27/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<product_PK:bigint,products:array<struct<product_PK:bigint,col2:int>>>

You should probably use a different type of join. By default the join you are making assumes both dataframes are large and therefore a lot of shuffling is done (Generally each row would be hashed, the data would be shuffled based on the hashing, then a per executor joining would be done). You can see this by typing using explain on the result to see the execution plan.
Instead consider using the broadcast hint:
val result = df2.join(broadcast(df1),Seq("product_PK","rec_product_PK"),"right")
note that I flipped the join order so the broadcast would appear in the join parameters. The broadcast function is part of org.apache.spark.sql.functions
This would do a broadcast join instead, df1 would be copied to all executors and the joining would be done locally avoiding the need to shuffle the large df2.

Given the exceptionally small size of your df1, it might be worth considering to first collect it into a list, and filter the large df2 with the list down to a comparably small dataframe, which is then used for a left join with df1:
val df1 = Seq(
(560L, 630L),
(710L, 240L),
(610L, 240L)
).toDF("product_PK", "rec_product_PK")
val df2 = Seq(
(560L, 610L, 1),
(560L, 240L, 1),
(610L, 240L, 0)
).toDF("product_PK", "rec_product_PK", "rank")
import org.apache.spark.sql.Row
val pkList = df1.collect.map{
case Row(pk1: Long, pk2: Long) => (pk1, pk2)
}.toList
// pkList: List[(Long, Long)] = List((560,630), (710,240), (610,240))
def inPkList(pkList: List[(Long, Long)]) = udf(
(pk1: Long, pk2: Long) => pkList.contains( (pk1, pk2) )
)
val df2Filtered = df2.where( inPkList(pkList)($"product_PK", $"rec_product_PK") )
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 610| 240| 0|
// +----------+--------------+----+
df1.join(df2Filtered, Seq("product_PK", "rec_product_PK"), "left_outer")
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 560| 630|null|
// | 710| 240|null|
// | 610| 240| 0|
// +----------+--------------+----+

Related

re-assign value to column doesn't work, but create a new column that cannot be selected

I have written a scala function to join 2 dataframes with same schema, says df1 and df2. For every key in df1, if df1's key matches with df2, then we pick up values from df2 for this key, if no then leave df1's value. It supposed to return dataframe with same number of df1 but different value, but the function doesn't work and return same df as df1.
def joinDFwithConditions(df1: DataFrame, df2: DataFrame, key_seq: Seq[String]) ={
var final_df = df1.as("a").join(df2.as("b"), key_seq, "left_outer")
//set of non-key columns
val col_str = df1.columns.toSet -- key_seq.toSet
for (c <- col_str){ //for every match-record, check values from both dataframes
final_df = final_df
.withColumn(s"$c",
when(col(s"b.$c").isNull || col(s"b.$c").isNaN,col(s"a.$c"))
.otherwise(col(s"b.$c")))
// I used to re-assign value with reference "t.$c",
// but return error says no t.col found in schema
}
final_df.show()
final_df.select(df1.columns.map(x => df1(x)):_*)
}
def main(args: Array[String]) {
val sparkSession = SparkSession.builder().appName(this.getClass.getName)
.config("spark.hadoop.validateOutputSpecs", "false")
.enableHiveSupport()
.getOrCreate()
import sparkSession.implicits._
val df1 = List(("key1",1),("key2",2),("key3",3)).toDF("x","y")
val df2 = List(("key1",9),("key2",8)).toDF("x","y")
joinDFwithConditions(df1, df2, Seq("x")).show()
sparkSession.stop()
}
df1 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||1 |
| key2 ||2 |
| key3 ||3 |
--------------------------------------
df2 sample
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
--------------------------------------
expected results:
+--------------++--------------------+
|x ||y |
+--------------++--------------------+
| key1 ||9 |
| key2 ||8 |
| key3 ||3 |
--------------------------------------
what really shows:
+-------+---+---+
| x | y| y|
+-------+---+---+
| key1 | 9| 9|
| key2 | 8| 8|
| key3 | 3| 3|
+-------+---+---+
error message
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Resolved attribute(s) y#6 missing from x#5,y#21,y#22 in operator !Project [x#5, y#6]. Attribute(s) with the same name appear in the operation: y. Please check if the right attribute(s) are used.;;
!Project [x#5, y#6]
+- Project [x#5, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#21, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#22]
+- Project [x#5, y#6, y#15]
+- Join LeftOuter, (x#5 = x#14)
:- SubqueryAlias `a`
: +- Project [_1#2 AS x#5, _2#3 AS y#6]
: +- LocalRelation [_1#2, _2#3]
+- SubqueryAlias `b`
+- Project [_1#11 AS x#14, _2#12 AS y#15]
+- LocalRelation [_1#11, _2#12]
When you do df.as("a"), you do not rename the column of the dataframe. You simply allow to access them with a.columnName in order to lift an ambiguity. Therefore, your when goes well because you use aliases but you end up with multiple y columns. I am quite surprised by the way that it manages to replace one of the y columns...
When you try to access it with its name y however (without prefix), spark does know which one you want and throws an error.
To avoid errors, you could simply do everything you need with one select like this:
df1.as("a").join(df2.as("b"), key_cols, "left_outer")
.select(key_cols.map(col) ++
df1
.columns
.diff(key_cols)
.map(c => when(col(s"b.$c").isNull || col(s"b.$c").isNaN, col(s"a.$c"))
.otherwise(col(s"b.$c"))
.alias(c)
) : _*)

Why does filtering on a non-existing (non-selected) column work?

The following minimal example
val df1 = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("foo", "bar")
val df2 = df1.select($"foo")
val df3 = df2.filter($"bar" === lit("a"))
df1.printSchema
df1.show
df2.printSchema
df2.show
df3.printSchema
df3.show
Runs with non errors:
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
+---+---+
|foo|bar|
+---+---+
| 0| a|
| 1| b|
+---+---+
root
|-- foo: integer (nullable = false)
+---+
|foo|
+---+
| 0|
| 1|
+---+
root
|-- foo: integer (nullable = false)
+---+
|foo|
+---+
| 0|
+---+
However, I expected something like
org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given input columns: [foo];
for the same reason, I get
org.apache.spark.sql.AnalysisException: cannot resolve '`asdasd`' given input columns: [foo];
when I do
val df4 = df2.filter($"asdasd" === lit("a"))
But it does not happen. Why?
I'm leaning towards calling it a bug. An explain plan tells a little more:
val df1 = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
df1.select("foo").where($"bar" === "a").explain(true)
// == Parsed Logical Plan ==
// 'Filter ('bar = a)
// +- Project [foo#4]
// +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
// +- LocalRelation [_1#0, _2#1]
//
// == Analyzed Logical Plan ==
// foo: int
// Project [foo#4]
// +- Filter (bar#5 = a)
// +- Project [foo#4, bar#5]
// +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
// +- LocalRelation [_1#0, _2#1]
//
// == Optimized Logical Plan ==
// LocalRelation [foo#4]
//
// == Physical Plan ==
// LocalTableScan [foo#4]
Apparently, both the parsed logical plan and analyzed (or resolved) logical plan still consist of bar in their Project nodes (i.e. projections) and the filtering operations continue to honor the supposedly removed column.
On a related note, the logical plans for the following query also consist of the dropped column, thus exhibiting similar anomaly:
df1.drop("bar").where($"bar" === "a")

How do I use a column I created in a Spark Join? - Ambiguous Error

I have been fighting with this for a while in scala, and I can not seem to find a clear solution for it.
I have 2 dataframes:
val Companies = Seq(
(8, "Yahoo"),
(-5, "Google"),
(12, "Microsoft"),
(-10, "Uber")
).toDF("movement", "Company")
val LookUpTable = Seq(
("B", "Buy"),
("S", "Sell")
).toDF("Code", "Description")
I need to create a column in Companies that allows me to join the lookup table. Its a simple case statement that checks if the movement is negative, then sell, else buy. I then need to join on the lookup table on this newly created column.
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
However, I keep getting the following error:
org.apache.spark.sql.AnalysisException: Reference 'Code' is ambiguous, could be: Code, LookUpTable.Code.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:888)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:887)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105
I have tried adding the alias for Code, but that does not work:
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
org.apache.spark.sql.AnalysisException: cannot resolve '`Companies.Code`' given input columns: [Code, LookUpTable.Code, LookUpTable.Description, Companies.Company, Companies.movement];;
'Join LeftOuter, (Code#102625 = 'Companies.Code)
:- Project [movement#102616, Company#102617, CASE WHEN (movement#102616 > 0) THEN B ELSE S END AS Code#102629]
: +- SubqueryAlias `Companies`
: +- Project [_1#102613 AS movement#102616, _2#102614 AS Company#102617]
: +- LocalRelation [_1#102613, _2#102614]
+- SubqueryAlias `LookUpTable`
+- Project [_1#102622 AS Code#102625, _2#102623 AS Description#102626]
+- LocalRelation [_1#102622, _2#102623]
The only work around that I found was to alias the newly created column, however that then creates an additional column which feels incorrect.
val joined = Companies.as("Companies")
.withColumn("_Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Code")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
joined.show()
+--------+---------+-----+----+-----------+
|movement| Company|_Code|Code|Description|
+--------+---------+-----+----+-----------+
| 8| Yahoo| B| B| Buy|
| 8| Yahoo| B| S| Sell|
| -5| Google| S| B| Buy|
| -5| Google| S| S| Sell|
| 12|Microsoft| B| B| Buy|
| 12|Microsoft| B| S| Sell|
| -10| Uber| S| B| Buy|
| -10| Uber| S| S| Sell|
+--------+---------+-----+----+-----------+
Is there a way to join on the newly created column without having to create a new dataframe or new column through an alias?
have you tried using Seq in Spark dataframe.
1.Using Seq
Without duplicate column
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), Seq("Code"), "left_outer")
alias after withColumn but it will generate duplicate column
val joined = Companies.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
Aliasing would be required if you need the columns from two different dataframes having same name. This is because Spark dataframe API creates a schema for the said dataframe, and in a given schema, you can never have two or more columns with same name.
This is also the reason that, in SQL, the SELECT query without aliasing works but if you were to do a CREATE TABLE AS SELECT, it would throw an error like - duplicate columns.
Expression can be used for join:
val codeExpression = expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")
val joined = Companies.as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === codeExpression, "left_outer")

spark aggregating column into Set efficiently

How can I aggregate a column into an Set (Array of unique elements) in spark efficiently?
case class Foo(a:String, b:String, c:Int, d:Array[String])
val df = Seq(Foo("A", "A", 123, Array("A")),
Foo("A", "A", 123, Array("B")),
Foo("B", "B", 123, Array("C", "A")),
Foo("B", "B", 123, Array("C", "E", "A")),
Foo("B", "B", 123, Array("D"))
).toDS()
Will result in
+---+---+---+---------+
| a| b| c| d|
+---+---+---+---------+
| A| A|123| [A]|
| A| A|123| [B]|
| B| B|123| [C, A]|
| B| B|123|[C, E, A]|
| B| B|123| [D]|
+---+---+---+---------+
what I am Looking for is (ordering of d column is not important):
+---+---+---+------------+
| a| b| c| d |
+---+---+---+------------+
| A| A|123| [A, B]. |
| B| B|123|[C, A, E, D]|
+---+---+---+------------+
this may be a bit similar to How to aggregate values into collection after groupBy? or the example from HighPerformanceSpark of https://github.com/high-performance-spark/high-performance-spark-examples/blob/57a6267fb77fae5a90109bfd034ae9c18d2edf22/src/main/scala/com/high-performance-spark-examples/transformations/SmartAggregations.scala#L33-L43
Using the following code:
import org.apache.spark.sql.functions.udf
val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten.distinct)
val d = flatten(collect_list($"d")).alias("d")
df.groupBy($"a", $"b", $"c").agg(d).show
will produce the desired result, but I wonder if there are any possibilities to improve performance using the RDD API as outlined in the book. And would like to know how to formulate it using data set API.
Details about the execution for this minimal sample follow below:
== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
+- Aggregate [a#45, b#46, c#47], [a#45, b#46, c#47, UDF(collect_list(d#48, 0, 0)) AS d#82]
+- LocalRelation [a#45, b#46, c#47, d#48]
== Physical Plan ==
CollectLimit 21
+- SortAggregate(key=[a#45, b#46, c#47], functions=[collect_list(d#48, 0, 0)], output=[a#45, b#46, c#47, d#82])
+- *Sort [a#45 ASC NULLS FIRST, b#46 ASC NULLS FIRST, c#47 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#45, b#46, c#47, 200)
+- LocalTableScan [a#45, b#46, c#47, d#48]
edit
The problems of this operation are outlined very well https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
edit2
As you can see the DAG for the dataSet query suggested below is more complicated and instead of 0.4 seem to take 2 seconds.
Try this
df.groupByKey(foo => (foo.a, foo.b, foo.c)).
reduceGroups{
(foo1, foo2) =>
foo1.copy(d = (foo1.d ++ foo2.d).distinct )
}.map(_._2)

How to join Datasets on multiple columns?

Given two Spark Datasets, A and B I can do a join on single column as follows:
a.joinWith(b, $"a.col" === $"b.col", "left")
My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")
You can do it exactly the same way as with Dataframe:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
In Spark < 2.0.0 you can use something like this:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
There's another way of joining by chaining where one after another. You first specify a join (and optionally its type) followed by where operator(s), i.e.
scala> case class A(id: Long, name: String)
defined class A
scala> case class B(id: Long, name: String)
defined class B
scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]
scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]
scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 0|zero| 0| zero|
| 1| one| 1|jeden|
+---+----+---+-----+
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
| 0|zero| 0|zero|
+---+----+---+----+
The reason for such a goodie is that the Spark optimizer will join (no pun intended) consecutive wheres into one with join. Use explain operator to see the underlying logical and physical plans.
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]
In Java, the && operator does not work. The correct way to join based on multiple columns in Spark-Java is as below:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
The and function works like the && operator.