Spark withColumn and where execution order - scala

I have a Spark query which reads a lot of parquet data from S3, filters it, and adds a column computed as regexp_extract(input_file_name, ...) which I assume is a relatively heavy operation (if applied before filtering rather than after it).
The whole query looks like this:
val df = spark
.read
.option("mergeSchema", "true")
.parquet("s3://bucket/path/date=2020-01-1{5,6}/clientType=EXTENSION_CHROME/type={ACCEPT,IGNORE*}/")
.where(...)
.withColumn("type", regexp_extract(input_file_name, "type=([^/]+)", 1))
.repartition(300)
.cache()
df.count()
Is withColumn executed after where or before where? Does it depend on the order in which I write them? What if my where statement used a column added by withColumn?

The withColumn and filter execute in the order they are called. The plan explains it. Please read the plan bottom up.
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees.withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).filter(col("column1")==="poor").explain(true)
Plan - project happened 1st then filter.
== Parsed Logical Plan ==
'Filter ('column1 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Filter (column1#8 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Code 1st filters then adds new column
employees.filter(col("employee")==="E1").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).explain(true)
Plan - 1st filters then projects
== Parsed Logical Plan ==
'Project [employee#4, salary#5, CASE WHEN ('salary > 200) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Another evidence - it gives error when filter is called on a column before adding it (obviously)
employees.filter(col("column1")==="poor").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`column1`' given input columns: [employee, salary];;
'Filter ('column1 = poor)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]

Related

Aggregation after sort(), persist() and limit() in Spark

I'm trying to get the sum of a column of the top n rows in a persisted DataFrame. For some reason, the following doesn't work:
val df = df0.sort(col("colB").desc).persist()
df.limit(2).agg(sum("colB")).show()
It shows a random number which is clearly less than the sum of the top two. The number changes from run-to-run.
Calling show() on the limit()'ed DF does consistently show the correct top two values:
df.limit(2).show()
It is as if sort() doesn't apply before the aggregation. Is this a bug in Spark? I suppose it's kind of expected that persist() loses the sorting, but why does it work with show() and should this be documented somewhere?
See the query plans below. agg results in an exchange (4th line in physical plan) which removes the sorting, whereas show does not result in any exchange, so sorting is maintained.
scala> df.limit(2).agg(sum("colB")).explain()
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(2) GlobalLimit 2
+- Exchange SinglePartition, true, [id=#95]
+- *(1) LocalLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
scala> df.limit(2).explain()
== Physical Plan ==
CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
But if you persist the limited dataframe, there won't be any exchange for the aggregation, so that might do the trick:
val df1 = df.limit(2).persist()
scala> df1.agg(sum("colB")).explain()
== Physical Plan ==
*(1) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
In any case, it's best to use window functions to assign row numbers and sum the rows if their row number meets a certain condition (e.g. row_number <= 2). This will result in a deterministic outcome. For example,
df0.withColumn(
"rn",
row_number().over(Window.orderBy($"colB".desc))
).filter("rn <= 2").agg(sum("colB"))

Scala spark: Sum all columns across all rows

I can do this quite easily with
df.groupBy().sum()
But I'm not sure if the groupBy() doesn't add additional performance impacts, or is just bad style. I've seen it done with
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))
Which skips the (I think unnecessary groupBy), but has its own uglyness. What's the correct way to do this? Is there any under-the-hood difference between using .groupBy().<aggOp>() and using .agg?
If you check the Physical plan for both queries spark internally calls same plan so we can use either of them!
I think using df.groupBy().sum() will be handy as we don't need to specify all column names.
Example:
val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")
scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]
scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

Groupby regular expression Spark Scala

Let us suppose I have a dataframe that looks like this:
val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()
How can I group it by a regular expression like "job_" and then take the first element to end in something like :
|A:job_1, B:whatever1|
|A:job_2, B:whatever3|
Thank a lot and kind regards
You should probably just create a new column with regexp_extract and drop this column !
import org.apache.spark.sql.{functions => F}
df2.
withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
groupBy("A").
agg(F.first("values").as("first value")). // Get the first value
drop("A").
show()
Here is the catalyst if you wish to go further in the comprehension !
As you can see in the optimised logical plan, the two following are stricly equivalent :
creating explicitly a new column with : .withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
grouping by a new column with : .groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))
Here is the catalyst plan :
== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]
Transform your data to a Seq with two columns and operate it:
val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
.map(x=>(x.split(",")(0).replace("A:","")
,x.split(",")(1).replace("B:","")))
.toDF("A","B")
.groupBy("A")
I removed A: and B:, but it is not necesary.
Or you can try:
df2.withColumn("A",col("value").substr(4,8))
.groupBy("A")

Apache Spark - Does dataset.dropDuplicates() preserve partitioning?

I know that there exist several transformations which preserve parent partitioning (if it was set before - e.g. mapValues) and some which do not preserve it (e.g. map).
I use Dataset API of Spark 2.2. My question is - does dropDuplicates transformation preserve partitioning? Imagine this code:
case class Item(one: Int, two: Int, three: Int)
import session.implicits._
val ds = session.createDataset(List(Item(1,2,3), Item(1,2,3)))
val repart = ds.repartition('one, 'two).cache()
repart.dropDuplicates(List("one", "two")) // will be partitioning preserved?
generally, dropDuplicates does a shuffle (and thus not preserve partitioning), but in your special case it does NOT do an additional shuffle because you have already partitioned the dataset in a suitable form which is taken into account by the optimizer:
repart.dropDuplicates(List("one","two")).explain()
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- *HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange hashpartitioning(one#3, two#4, 200)
+- LocalTableScan [one#3, two#4, three#5]
the keyword to look for here is : Exchange
But consider the following code where you first repartition the dataset using plain repartition():
val repart = ds.repartition(200).cache()
repart.dropDuplicates(List("one","two")).explain()
This will indeed trigger an additional shuffle ( now you have 2 Exchange steps):
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4], functions=[first(three#5, false)])
+- Exchange hashpartitioning(one#3, two#4, 200)
+- *HashAggregate(keys=[one#3, two#4], functions=[partial_first(three#5, false)])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange RoundRobinPartitioning(200)
+- LocalTableScan [one#3, two#4, three#5]
NOTE: I checked that with Spark 2.1, it may be different in Spark 2.2 because the optimizer changed in Spark 2.2 (Cost-Based Optimizer)
No, dropDuplicates doesn't preserve partitions since it has a shuffle boundary, which doesn't guarantee order.
dropDuplicates is approximately:
ds.groupBy(columnId).agg(/* take first column from any available partition */)

Unable to convert rdd to Dataframe using Case Class

I am trying to convert rdd into DataFrame using Case Class as follows
1.)Fetching Data from textfile having "id,name,country" saperated by "," but without header
val x = sc.textFile("file:///home/hdadmin/records.txt")
2.)Creating a case class "rec" with header definition as below:
case class rec(id:Int, name:String, country:String)
3.) Now I define the transformations
val y = x.map(x=>x.split(",")).map(x=>rec(x(0).toInt,x(1),x(2)))
4.) Then I imported the implicits library
import spark.implicits._
5.) Converting rdd to data Frame using toDF method:
val z = y.toDF()
6.) Now when I try to fetch the records with command below:
z.select("name").show()
I get the following error:
17/05/19 12:50:14 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerSQLExecutionStart(9,show at
:49,org.apache.spark.sql.Dataset.show(Dataset.scala:495)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:49)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:54)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:56)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:58)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:60)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:62)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:64) $line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:66)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:68)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:70)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:72)
$line105.$read$$iw$$iw$$iw$$iw$$iw.(:74)
$line105.$read$$iw$$iw$$iw$$iw.(:76)
$line105.$read$$iw$$iw$$iw.(:78)
$line105.$read$$iw$$iw.(:80)
$line105.$read$$iw.(:82)
$line105.$read.(:84)
$line105.$read$.(:88)
$line105.$read$.(),== Parsed Logical Plan ==
GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Analyzed Logical Plan == name: string GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Optimized Logical Plan == GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Physical Plan == CollectLimit 21
+- *Project [name#91] +- Scan ExistingRDD[id#90,name#91,country#92],org.apache.spark.sql.execution.SparkPlanInfo#b807ee,1495223414636)
17/05/19 12:50:14 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerSQLExecutionEnd(9,1495223414734)
java.lang.IllegalStateException: SparkContext has been shutdown at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1863) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at
org.apache.spark.sql.Dataset.show(Dataset.scala:526) at
org.apache.spark.sql.Dataset.show(Dataset.scala:486) at
org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 56 elided
Where could be the problem?
After trying the same code for a couple of text files I actually rectified the text format in the text file for any discrepency.
The Column Separator in below code is "," and it was missing at 1 place inside the text file after I scanned it minutely.
val y = x.map(x=>x.split(",")).map(x=>rec(x(0).toInt,x(1),x(2)))
The code worked fine and gave me results in Structured table format after the changes.
Therefore its important to note that the separator(",", "\t", "|") given inside
x.split("")
should be same as in source file and throughout the source file.