Groupby regular expression Spark Scala - scala

Let us suppose I have a dataframe that looks like this:
val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()
How can I group it by a regular expression like "job_" and then take the first element to end in something like :
|A:job_1, B:whatever1|
|A:job_2, B:whatever3|
Thank a lot and kind regards

You should probably just create a new column with regexp_extract and drop this column !
import org.apache.spark.sql.{functions => F}
df2.
withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
groupBy("A").
agg(F.first("values").as("first value")). // Get the first value
drop("A").
show()
Here is the catalyst if you wish to go further in the comprehension !
As you can see in the optimised logical plan, the two following are stricly equivalent :
creating explicitly a new column with : .withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
grouping by a new column with : .groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))
Here is the catalyst plan :
== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]

Transform your data to a Seq with two columns and operate it:
val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
.map(x=>(x.split(",")(0).replace("A:","")
,x.split(",")(1).replace("B:","")))
.toDF("A","B")
.groupBy("A")
I removed A: and B:, but it is not necesary.
Or you can try:
df2.withColumn("A",col("value").substr(4,8))
.groupBy("A")

Related

Spark is pushing down a filter even when the column is not in the dataframe

I have a DataFrame with the columns:
field1, field1_name, field3, field5, field4, field2, field6
I am selecting it so that I only keep field1, field2, field3, field4. Note that there is no field5 after the select.
After that, I have a filter that uses field5 and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame (before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Physical Plan ==
*Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
A reproducible example of the issue:
val df = Seq(
("", "", "")
).toDF("field1", "field2", "field3")
val selected = df.select("field1", "field2")
val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
shouldFail.show()
Output:
+------+------+
|field1|field2|
+------+------+
+------+------+
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.

Why does my column exist in my pyspark dataframe after dropping it?

I am using pyspark version 2.4.5 and Databricks runtime 6.5 and I have run into unexpected behavior. My code is as follows:
import pyspark.sql.functions as F
df_A = spark.table(...)
df_B = df_A.drop(
F.col("colA")
)
df_C = df_B.filter(
F.col("colA") > 0
)
When I assign df_C by filtering on df_B I expect an error to be thrown as "colA" has been dropped. But this code works fine when I run it. Is this expected or am I missing something?
Spark constructs an explain plan that makes sense and applies the drop after the filter. You can see that from the explain plan e.g.
spark.createDataFrame([('foo','bar')]).drop(col('_2')).filter(col('_2') == 'bar').explain()
Gives:
== Physical Plan ==
*(1) Project [_1#0]
+- *(1) Filter (isnotnull(_2#1) && (_2#1 = bar))
+- Scan ExistingRDD[_1#0,_2#1]
In the above explain plan, the projection of the dropped column happens after the filter.

Spark withColumn and where execution order

I have a Spark query which reads a lot of parquet data from S3, filters it, and adds a column computed as regexp_extract(input_file_name, ...) which I assume is a relatively heavy operation (if applied before filtering rather than after it).
The whole query looks like this:
val df = spark
.read
.option("mergeSchema", "true")
.parquet("s3://bucket/path/date=2020-01-1{5,6}/clientType=EXTENSION_CHROME/type={ACCEPT,IGNORE*}/")
.where(...)
.withColumn("type", regexp_extract(input_file_name, "type=([^/]+)", 1))
.repartition(300)
.cache()
df.count()
Is withColumn executed after where or before where? Does it depend on the order in which I write them? What if my where statement used a column added by withColumn?
The withColumn and filter execute in the order they are called. The plan explains it. Please read the plan bottom up.
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees.withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).filter(col("column1")==="poor").explain(true)
Plan - project happened 1st then filter.
== Parsed Logical Plan ==
'Filter ('column1 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Filter (column1#8 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Code 1st filters then adds new column
employees.filter(col("employee")==="E1").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).explain(true)
Plan - 1st filters then projects
== Parsed Logical Plan ==
'Project [employee#4, salary#5, CASE WHEN ('salary > 200) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Another evidence - it gives error when filter is called on a column before adding it (obviously)
employees.filter(col("column1")==="poor").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`column1`' given input columns: [employee, salary];;
'Filter ('column1 = poor)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]

How to join 2 dataframes in Spark based on a wildcard/regex condition?

I have 2 dataframes df1 and df2.
Suppose there is a location column in df1 which may contain a regular URL or a URL with a wildcard, e.g.:
stackoverflow.com/questions/*
*.cnn.com
cnn.com/*/politics
The seconds dataframe df2 has url field which may contain only valid URLs without wildcards.
I need to join these two dataframes, something like df1.join(df2, $"location" matches $"url") if there was magic matches operator in join conditions.
After some googling I still don't see a way how to achieve this. How would you approach solving such problem?
There exist "magic" matches operator - it is called rlike
val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")
df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
| url| location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+
however there are some caveats:
Patterns have to be proper regular expressions, anchored in case of leading / trailing wildcards.
It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
: +- *Filter isnotnull(value#215)
: +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [value#209 AS location#211]
+- *Filter isnotnull(value#209)
+- LocalTableScan [value#209]
It is possible to filter candidates using method I described in Efficient string matching in Apache Spark

Ensuring narrow dependency in Spark job when grouping on pre-partitioned data

I have a huge Spark Dataset with columns A, B, C, D, E. Question is if I initially repartition on column A, and subsequently do two 'within-partition' groupBy operations:
**groupBy("A", "C")**....map(....).**groupBy("A", "E")**....map(....)
is Spark 2.0 clever enough to by-pass shuffling since both groupBy operations are 'within-partition' with respect to the parent stage - i.e. column A is included in both groupBy column specs? If not, what can I do to ensure a narrow dependency throughout the chain of operations?
Spark indeed supports optimization like this. You can check that by analyzing execution plan:
val df = Seq(("a", 1, 2)).toDF("a", "b", "c")
df.groupBy("a").max().groupBy("a", "max(b)").sum().explain
== Physical Plan ==
*HashAggregate(keys=[a#42, max(b)#92], functions=[sum(cast(max(b)#92 as bigint)), sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42, max(b)#92], functions=[partial_sum(cast(max(b)#92 as bigint)), partial_sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42], functions=[max(b#43), max(c#44)])
+- Exchange hashpartitioning(a#42, 200)
+- *HashAggregate(keys=[a#42], functions=[partial_max(b#43), partial_max(c#44)])
+- LocalTableScan [a#42, b#43, c#44]
As you can see there is only one exchange but two hash aggregates.