Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
.where((col('Country')==Country) & (col('Year')>startYear))
I can do the where conditions both ways. I think the one below add readability. Is there any other difference and which is the best?
.where(col('Country')==Country)
.where(col('Year')>startYear)
if the question is readability, i would suggest something like this :
.where(F.expr("Country <=> 'Country' and Year > 'startYear'")
Here <=> is used for equality null safe, there is a something in spark where nulls values are ignored in condition.
I had worked with a sample , both are giving same results. So there would be no other differences.
data.show()
+---+---------+----+
| id| Country|year|
+---+---------+----+
| 1| india|2018|
| 2| usa|2018|
| 3| france|2019|
| 4| china|2019|
| 5| india|2020|
| 6|australia|2021|
| 7| india|2016|
| 8| usa|2019|
+---+---------+----+
data.where((col('Country')=='india') & (col('Year')>2017)).show()
#consider country:'india',startyear:2017
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
data.where(col('Country')=='india')\
.where(col('Year')>2017).show()
+---+-------+----+
| id|Country|year|
+---+-------+----+
| 1| india|2018|
| 5| india|2020|
+---+-------+----+
The method explain is useful in order to understand how a query is performed. It shows the execution plan with all the steps involved and it can be used in this case to compare the two filtering strategies.
Give the following example:
from pyspark.sql.functions import col
df = spark.createDataFrame([("Spain", 2020),
("Italy", 2020),
("Andorra", 2021),
("Spain", 2021),
("Spain", 2022)], ("Country", "Year"))
df.show()
Country = "Spain"
startYear = 2020
The extended output of the AND strategy is:
df.where((col('Country') == Country) & (col('Year') > startYear)).explain(True)
== Parsed Logical Plan ==
'Filter (('Country = Spain) AND ('Year > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter ((Country#80 = Spain) AND (Year#81L > cast(2020 as bigint)))
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
while the plan of the multiple where strategy is:
df.where(col('Country') == Country).where(col('Year') > startYear).explain(True)
== Parsed Logical Plan ==
'Filter ('Year > 2020)
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Analyzed Logical Plan ==
Country: string, Year: bigint
Filter (Year#81L > cast(2020 as bigint))
+- Filter (Country#80 = Spain)
+- LogicalRDD [Country#80, Year#81L], false
== Optimized Logical Plan ==
Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- LogicalRDD [Country#80, Year#81L], false
== Physical Plan ==
*(1) Filter (((isnotnull(Country#80) AND isnotnull(Year#81L)) AND (Country#80 = Spain)) AND (Year#81L > 2020))
+- *(1) Scan ExistingRDD[Country#80,Year#81L]
The query engine came up with the same physical plan regardless the filtering strategy, thus the queries are equivalent. I agree with you the second one is better for readability
Related
I am using spark-sql-2.4.1v. In my use-case using window spec/feature to find the latest records using the rank() function.
I have to find a latest record on certain partitioning keys and order by insertion_date.
It is extremely slow. Can this window-spec rank() can be used in production-grade code?
Or is there any alternative way recommended? Specifically to improve performance.
Please advice.
I'm currently using the below code:
Dataset<Row> data = sqlContext.read.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host", hosts)
.options(Map( "table" -> "source_table", "keyspace" -> "calc")).load()
.where(col("error").equalTo(lit(200)))
.filter(col("insert_date").gt(lit("2015-01-01")))
.filter(col("insert_date").lt(lit("2016-01-01")))
.where(col("id").equalTo(lit(mId)))
Explained plan
== Physical Plan ==
*(1) Project [cast(cast(unix_timestamp(insert_date#199, yyyy-MM-dd, Some(America/New_York)) as timestamp) as date) AS insert_date#500, 3301 AS id#399, create_date#201, company_id#202, ... 76 more fields]
+- *(1) Filter (((((((cast(error#263 as int) = 200) && (cast(insert_date#199 as string) >= 2018-01-01)) && (cast(insert_date#199 as string) <= 2018-01-01)) && isnotnull(id#200)) && isnotnull(insert_date#199)) && isnotnull(error#263)) && (id#200 = 3301))
+- *(1) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#261d7ee2 [... 76 more fields] PushedFilters: [IsNotNull(id), IsNotNull(insert_date), IsNotNull(error), EqualTo(id,3301)], ReadSchema: struct<...
To read read
+-----------+------+
|partitionId| count|
+-----------+------+
| 1829| 29|
| 1959| 16684|
| 496| 3795|
| 2659| 524|
| 1591| 87|
| 2811| 2436|
| 2235| 620|
| 2563| 252|
| 1721| 12|
| 737| 1695|
| 858| 182|
| 2580| 73106|
| 3179| 694|
| 1460| 13|
| 1990| 66|
| 1522| 951|
| 540| 11|
| 1127|823084|
| 2999| 9|
| 623| 6629|
+-----------+------+
only showing top 20 rows
20/05/19 06:32:53 WARN ReaderCassandra: Processed started : 1589864993863 ended : 1589869973496 timetaken : 4979 s
val ws = Window.partitionBy("id").orderBy(desc("insert_date"),desc("update_date"));
Dataset<Row> ranked_data = data.withColumn("rank",rank().over(ws))
.where($"rank".===(lit(1)))
.select("*")
I have been fighting with this for a while in scala, and I can not seem to find a clear solution for it.
I have 2 dataframes:
val Companies = Seq(
(8, "Yahoo"),
(-5, "Google"),
(12, "Microsoft"),
(-10, "Uber")
).toDF("movement", "Company")
val LookUpTable = Seq(
("B", "Buy"),
("S", "Sell")
).toDF("Code", "Description")
I need to create a column in Companies that allows me to join the lookup table. Its a simple case statement that checks if the movement is negative, then sell, else buy. I then need to join on the lookup table on this newly created column.
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
However, I keep getting the following error:
org.apache.spark.sql.AnalysisException: Reference 'Code' is ambiguous, could be: Code, LookUpTable.Code.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:888)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:887)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105
I have tried adding the alias for Code, but that does not work:
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
org.apache.spark.sql.AnalysisException: cannot resolve '`Companies.Code`' given input columns: [Code, LookUpTable.Code, LookUpTable.Description, Companies.Company, Companies.movement];;
'Join LeftOuter, (Code#102625 = 'Companies.Code)
:- Project [movement#102616, Company#102617, CASE WHEN (movement#102616 > 0) THEN B ELSE S END AS Code#102629]
: +- SubqueryAlias `Companies`
: +- Project [_1#102613 AS movement#102616, _2#102614 AS Company#102617]
: +- LocalRelation [_1#102613, _2#102614]
+- SubqueryAlias `LookUpTable`
+- Project [_1#102622 AS Code#102625, _2#102623 AS Description#102626]
+- LocalRelation [_1#102622, _2#102623]
The only work around that I found was to alias the newly created column, however that then creates an additional column which feels incorrect.
val joined = Companies.as("Companies")
.withColumn("_Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Code")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
joined.show()
+--------+---------+-----+----+-----------+
|movement| Company|_Code|Code|Description|
+--------+---------+-----+----+-----------+
| 8| Yahoo| B| B| Buy|
| 8| Yahoo| B| S| Sell|
| -5| Google| S| B| Buy|
| -5| Google| S| S| Sell|
| 12|Microsoft| B| B| Buy|
| 12|Microsoft| B| S| Sell|
| -10| Uber| S| B| Buy|
| -10| Uber| S| S| Sell|
+--------+---------+-----+----+-----------+
Is there a way to join on the newly created column without having to create a new dataframe or new column through an alias?
have you tried using Seq in Spark dataframe.
1.Using Seq
Without duplicate column
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), Seq("Code"), "left_outer")
alias after withColumn but it will generate duplicate column
val joined = Companies.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
Aliasing would be required if you need the columns from two different dataframes having same name. This is because Spark dataframe API creates a schema for the said dataframe, and in a given schema, you can never have two or more columns with same name.
This is also the reason that, in SQL, the SELECT query without aliasing works but if you were to do a CREATE TABLE AS SELECT, it would throw an error like - duplicate columns.
Expression can be used for join:
val codeExpression = expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")
val joined = Companies.as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === codeExpression, "left_outer")
I have to DataFrames that I want to join applying Left joining.
df1 =
+----------+---------------+
|product_PK| rec_product_PK|
+----------+---------------+
| 560| 630|
| 710| 240|
| 610| 240|
df2 =
+----------+---------------+-----+
|product_PK| rec_product_PK| rank|
+----------+---------------+-----+
| 560| 610| 1|
| 560| 240| 1|
| 610| 240| 0|
The problem is that df1 contains only 500 rows, while df2 contains 600.000.000 rows and 24 partitions. My Left joining takes a while to execute. I am waiting for 5 hours and it is not finished.
val result = df1.join(df2,Seq("product_PK","rec_product_PK"),"left")
The result should contain 500 rows. I execute the code from spark-shell using the following parameters:
spark-shell -driver-memory 10G --driver-cores 4 --executor-memory 10G --num-executors 2 --executor-cores 4
How can I speed up the process?
UPDATE:
The output of df2.explain(true):
== Parsed Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Analyzed Logical Plan ==
product_PK: bigint, rec_product_PK: bigint, rank: int
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank_product_family#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Optimized Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L, array_elem#184.product_PK AS rec_product_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Physical Plan ==
Exchange RoundRobinPartitioning(5000)
+- *Project [product_PK#15L, array_elem#184.product_PK AS rec_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- *FileScan parquet [product_PK#15L,products#16] Batched: false, Format: Parquet, Location: InMemoryFileIndex[s3://data/result/2017-11-27/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<product_PK:bigint,products:array<struct<product_PK:bigint,col2:int>>>
You should probably use a different type of join. By default the join you are making assumes both dataframes are large and therefore a lot of shuffling is done (Generally each row would be hashed, the data would be shuffled based on the hashing, then a per executor joining would be done). You can see this by typing using explain on the result to see the execution plan.
Instead consider using the broadcast hint:
val result = df2.join(broadcast(df1),Seq("product_PK","rec_product_PK"),"right")
note that I flipped the join order so the broadcast would appear in the join parameters. The broadcast function is part of org.apache.spark.sql.functions
This would do a broadcast join instead, df1 would be copied to all executors and the joining would be done locally avoiding the need to shuffle the large df2.
Given the exceptionally small size of your df1, it might be worth considering to first collect it into a list, and filter the large df2 with the list down to a comparably small dataframe, which is then used for a left join with df1:
val df1 = Seq(
(560L, 630L),
(710L, 240L),
(610L, 240L)
).toDF("product_PK", "rec_product_PK")
val df2 = Seq(
(560L, 610L, 1),
(560L, 240L, 1),
(610L, 240L, 0)
).toDF("product_PK", "rec_product_PK", "rank")
import org.apache.spark.sql.Row
val pkList = df1.collect.map{
case Row(pk1: Long, pk2: Long) => (pk1, pk2)
}.toList
// pkList: List[(Long, Long)] = List((560,630), (710,240), (610,240))
def inPkList(pkList: List[(Long, Long)]) = udf(
(pk1: Long, pk2: Long) => pkList.contains( (pk1, pk2) )
)
val df2Filtered = df2.where( inPkList(pkList)($"product_PK", $"rec_product_PK") )
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 610| 240| 0|
// +----------+--------------+----+
df1.join(df2Filtered, Seq("product_PK", "rec_product_PK"), "left_outer")
// +----------+--------------+----+
// |product_PK|rec_product_PK|rank|
// +----------+--------------+----+
// | 560| 630|null|
// | 710| 240|null|
// | 610| 240| 0|
// +----------+--------------+----+
How can I aggregate a column into an Set (Array of unique elements) in spark efficiently?
case class Foo(a:String, b:String, c:Int, d:Array[String])
val df = Seq(Foo("A", "A", 123, Array("A")),
Foo("A", "A", 123, Array("B")),
Foo("B", "B", 123, Array("C", "A")),
Foo("B", "B", 123, Array("C", "E", "A")),
Foo("B", "B", 123, Array("D"))
).toDS()
Will result in
+---+---+---+---------+
| a| b| c| d|
+---+---+---+---------+
| A| A|123| [A]|
| A| A|123| [B]|
| B| B|123| [C, A]|
| B| B|123|[C, E, A]|
| B| B|123| [D]|
+---+---+---+---------+
what I am Looking for is (ordering of d column is not important):
+---+---+---+------------+
| a| b| c| d |
+---+---+---+------------+
| A| A|123| [A, B]. |
| B| B|123|[C, A, E, D]|
+---+---+---+------------+
this may be a bit similar to How to aggregate values into collection after groupBy? or the example from HighPerformanceSpark of https://github.com/high-performance-spark/high-performance-spark-examples/blob/57a6267fb77fae5a90109bfd034ae9c18d2edf22/src/main/scala/com/high-performance-spark-examples/transformations/SmartAggregations.scala#L33-L43
Using the following code:
import org.apache.spark.sql.functions.udf
val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten.distinct)
val d = flatten(collect_list($"d")).alias("d")
df.groupBy($"a", $"b", $"c").agg(d).show
will produce the desired result, but I wonder if there are any possibilities to improve performance using the RDD API as outlined in the book. And would like to know how to formulate it using data set API.
Details about the execution for this minimal sample follow below:
== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
+- Aggregate [a#45, b#46, c#47], [a#45, b#46, c#47, UDF(collect_list(d#48, 0, 0)) AS d#82]
+- LocalRelation [a#45, b#46, c#47, d#48]
== Physical Plan ==
CollectLimit 21
+- SortAggregate(key=[a#45, b#46, c#47], functions=[collect_list(d#48, 0, 0)], output=[a#45, b#46, c#47, d#82])
+- *Sort [a#45 ASC NULLS FIRST, b#46 ASC NULLS FIRST, c#47 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#45, b#46, c#47, 200)
+- LocalTableScan [a#45, b#46, c#47, d#48]
edit
The problems of this operation are outlined very well https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
edit2
As you can see the DAG for the dataSet query suggested below is more complicated and instead of 0.4 seem to take 2 seconds.
Try this
df.groupByKey(foo => (foo.a, foo.b, foo.c)).
reduceGroups{
(foo1, foo2) =>
foo1.copy(d = (foo1.d ++ foo2.d).distinct )
}.map(_._2)
Given two Spark Datasets, A and B I can do a join on single column as follows:
a.joinWith(b, $"a.col" === $"b.col", "left")
My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")
You can do it exactly the same way as with Dataframe:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
In Spark < 2.0.0 you can use something like this:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
There's another way of joining by chaining where one after another. You first specify a join (and optionally its type) followed by where operator(s), i.e.
scala> case class A(id: Long, name: String)
defined class A
scala> case class B(id: Long, name: String)
defined class B
scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]
scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]
scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 0|zero| 0| zero|
| 1| one| 1|jeden|
+---+----+---+-----+
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
| 0|zero| 0|zero|
+---+----+---+----+
The reason for such a goodie is that the Spark optimizer will join (no pun intended) consecutive wheres into one with join. Use explain operator to see the underlying logical and physical plans.
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]
In Java, the && operator does not work. The correct way to join based on multiple columns in Spark-Java is as below:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
The and function works like the && operator.