Why does my column exist in my pyspark dataframe after dropping it? - pyspark

I am using pyspark version 2.4.5 and Databricks runtime 6.5 and I have run into unexpected behavior. My code is as follows:
import pyspark.sql.functions as F
df_A = spark.table(...)
df_B = df_A.drop(
F.col("colA")
)
df_C = df_B.filter(
F.col("colA") > 0
)
When I assign df_C by filtering on df_B I expect an error to be thrown as "colA" has been dropped. But this code works fine when I run it. Is this expected or am I missing something?

Spark constructs an explain plan that makes sense and applies the drop after the filter. You can see that from the explain plan e.g.
spark.createDataFrame([('foo','bar')]).drop(col('_2')).filter(col('_2') == 'bar').explain()
Gives:
== Physical Plan ==
*(1) Project [_1#0]
+- *(1) Filter (isnotnull(_2#1) && (_2#1 = bar))
+- Scan ExistingRDD[_1#0,_2#1]
In the above explain plan, the projection of the dropped column happens after the filter.

Related

Spark is pushing down a filter even when the column is not in the dataframe

I have a DataFrame with the columns:
field1, field1_name, field3, field5, field4, field2, field6
I am selecting it so that I only keep field1, field2, field3, field4. Note that there is no field5 after the select.
After that, I have a filter that uses field5 and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame (before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Physical Plan ==
*Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
A reproducible example of the issue:
val df = Seq(
("", "", "")
).toDF("field1", "field2", "field3")
val selected = df.select("field1", "field2")
val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
shouldFail.show()
Output:
+------+------+
|field1|field2|
+------+------+
+------+------+
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.

Join two large spark dataframes persisted in parquet using Scala

I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.
I have the following two text files:
dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)
First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.
val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")
val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")
Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.
Next I read the parquet files and join them, I took the following approaches:
Approach 1: Basic spark join operation
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This however didn't perform well and shuffled over 1 TB of data.
Approach 2: Broadcast hash join
Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
broadcast(dfPerson).as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?
Approach 3: Bucket and sort the dataframe
I first try to bucket and sort the dataframes before writing to parquet and then join:
val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfPerson")
.saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")
val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfWorld")
.saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
With the following execution plan:
== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
:- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
: +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
: +- *(1) Filter isnotnull(PersonId#71)
: +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
+- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4
Also this didn't perform well.
To conclude
All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:
DAG visualization
I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.
Any help would be greatly appreciated.
What is the goal you're trying to achieve?
Why do you need to have it joined?
Join for a sake of join will take you nowhere, unless you have enough memory/disk space to collect 1TB x 100MB worth of data
Edited based on response
If you only need records related to persons that are presented in dfPerson then you don't need right/left join, inner join would be what you want.
Broadcast will only work if your DF is less than broadcast settings in your Spark (10 Mb by default), it's ignored otherwise.
dfPerson.as("p").join(
dfWorld.select(
$"GroupId", $"PersonId",
$"<feature1YouNeed>", $"<feature2YouNeed>"
).as("w"),
Seq("GroupId", "PersonId")
)
This should give you feature you're up to
NB: Replace < feature1YouNeed > and < feature2YouNeed > with actual column names.

Groupby regular expression Spark Scala

Let us suppose I have a dataframe that looks like this:
val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()
How can I group it by a regular expression like "job_" and then take the first element to end in something like :
|A:job_1, B:whatever1|
|A:job_2, B:whatever3|
Thank a lot and kind regards
You should probably just create a new column with regexp_extract and drop this column !
import org.apache.spark.sql.{functions => F}
df2.
withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
groupBy("A").
agg(F.first("values").as("first value")). // Get the first value
drop("A").
show()
Here is the catalyst if you wish to go further in the comprehension !
As you can see in the optimised logical plan, the two following are stricly equivalent :
creating explicitly a new column with : .withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
grouping by a new column with : .groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))
Here is the catalyst plan :
== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]
Transform your data to a Seq with two columns and operate it:
val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
.map(x=>(x.split(",")(0).replace("A:","")
,x.split(",")(1).replace("B:","")))
.toDF("A","B")
.groupBy("A")
I removed A: and B:, but it is not necesary.
Or you can try:
df2.withColumn("A",col("value").substr(4,8))
.groupBy("A")

Why does Spark fail with "Detected cartesian product for INNER join between logical plans"?

I am using Spark 2.1.0.
When I execute the following code I'm getting an error from Spark. Why? How to fix it?
val i1 = Seq(("a", "string"), ("another", "string"), ("last", "one")).toDF("a", "b")
val i2 = Seq(("one", "string"), ("two", "strings")).toDF("a", "b")
val i1Idx = i1.withColumn("sourceId", lit(1))
val i2Idx = i2.withColumn("sourceId", lit(2))
val input = i1Idx.union(i2Idx)
val weights = Seq((1, 0.6), (2, 0.4)).toDF("sourceId", "weight")
weights.join(input, "sourceId").show
Error:
scala> weights.join(input, "sourceId").show
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [_1#34 AS sourceId#39, _2#35 AS weight#40]
+- Filter (((1 <=> _1#34) || (2 <=> _1#34)) && (_1#34 = 1))
+- LocalRelation [_1#34, _2#35]
and
Union
:- Project [_1#0 AS a#5, _2#1 AS b#6]
: +- LocalRelation [_1#0, _2#1]
+- Project [_1#10 AS a#15, _2#11 AS b#16]
+- LocalRelation [_1#10, _2#11]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$19.applyOrElse(Optimizer.scala:1011)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$19.applyOrElse(Optimizer.scala:1008)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1008)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:993)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2791)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
... 48 elided
You can triggers inner join after turning on the flag
spark.conf.set("spark.sql.crossJoin.enabled", "true")
You also could also use the cross join.
weights.crossJoin(input)
or set the Alias as
weights.join(input, input("sourceId")===weights("sourceId"), "cross")
You can find more about the issue SPARK-6459 which is said to be fixed in 2.1.1
As you have already used 2.1.1 the issue should have been fixed.
tl;dr Upgrade to Spark 2.1.1. It's an issue in Spark that was fixed.
(I really wished I could also show you the exact change that fixed that in 2.1.1)
For me:
Dataset<Row> ds1 = sparkSession.read().load("/tmp/data");
Dataset<Row> ds2 = ds1;
ds1.join(ds2, ds1.col("name").equalTo(ds2.col("name"))) // got "Detected cartesian product for INNER join between logical plans"
Dataset<Row> ds1 = sparkSession.read().load("/tmp/data");
Dataset<Row> ds2 = sparkSession.read().load("/tmp/data");
ds1.join(ds2, ds1.col("name").equalTo(ds2.col("name"))) // running properly without errors
I'm using Spark 2.1.0.
Got this error in SPARK version:
2.3.0.cloudera3
Solved by aliasing the dataframes.
e.g. re-assigning the failing dataframe to another dataframe and aliasing the name to that other dataframe.
val dataFrame = inDataFrame.alias("dataFrame")
Hope this helps.
you can use this on top of the command
SET spark.sql.crossJoin.enabled = TRUE;
If your query is complex, may be re structure the query to get better result
Just enable joins at the beginning of your code like this:
spark.conf.set("spark.sql.leftJoin.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
spark.conf.set("spark.sql.leftOuterJoin.enabled", "true")
add the join name which you are using. It worked for me.

Ensuring narrow dependency in Spark job when grouping on pre-partitioned data

I have a huge Spark Dataset with columns A, B, C, D, E. Question is if I initially repartition on column A, and subsequently do two 'within-partition' groupBy operations:
**groupBy("A", "C")**....map(....).**groupBy("A", "E")**....map(....)
is Spark 2.0 clever enough to by-pass shuffling since both groupBy operations are 'within-partition' with respect to the parent stage - i.e. column A is included in both groupBy column specs? If not, what can I do to ensure a narrow dependency throughout the chain of operations?
Spark indeed supports optimization like this. You can check that by analyzing execution plan:
val df = Seq(("a", 1, 2)).toDF("a", "b", "c")
df.groupBy("a").max().groupBy("a", "max(b)").sum().explain
== Physical Plan ==
*HashAggregate(keys=[a#42, max(b)#92], functions=[sum(cast(max(b)#92 as bigint)), sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42, max(b)#92], functions=[partial_sum(cast(max(b)#92 as bigint)), partial_sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42], functions=[max(b#43), max(c#44)])
+- Exchange hashpartitioning(a#42, 200)
+- *HashAggregate(keys=[a#42], functions=[partial_max(b#43), partial_max(c#44)])
+- LocalTableScan [a#42, b#43, c#44]
As you can see there is only one exchange but two hash aggregates.