How to set a dynamic where clause using pyspark - pyspark

I have a dataset within which there are multiple groups. I have a rank column which incrementally counts counts each entry per group. An example of this structure is shown below:
+-----------+---------+---------+
| equipment| run_id|run_order|
+-----------+---------+---------+
|1 |430032589| 1|
|1 |430332632| 2|
|1 |430563033| 3|
|1 |430785715| 4|
|1 |431368577| 5|
|1 |431672148| 6|
|2 |435497596| 1|
|1 |435522469| 7|
Each group (equipment) has a different amount of runs. Shown above equipment 1 has 7 runs whilst equipment 2 has 1 run. I would like to select the first and last n runs per equipment. To select the first n runs is straightforward:
df.select("equipment", "run_id").distinct().where(df.run_order <= n).orderBy("equipment").show()
The distinct is in the query because each row is equivalent to a timestep and therefore each row will log sensor readings associated with that timestep. Therefore there will be many rows with the same equipment, run_id and run_order, which should be preserved in the end result and not aggregated.
As the number of runs is unique to each equipment I can't do an equivalent select query with a where clause (I think) to get the last n runs:
df.select("equipment", "run_id").distinct().where(df.rank >= total_runs - n).orderBy("equipment").show()
I can run a groupBy to get the highest run_order for each equipment
+-----------+----------------+
| equipment| max(run_order) |
+-----------+----------------+
|1 | 7|
|2 | 1|
But I am unsure if there is a way I can construct a dynamic where clause that works like this. So that I get the last n runs (including all timestep data for each run).

You can add a column of the max rank for each equipment and do a filter based on that column:
from pyspark.sql import functions as F, Window
n = 3
df2 = df.withColumn(
'max_run',
F.max('run_order').over(Window.partitionBy('equipment'))
).where(F.col('run_order') >= F.col('max_run') - n)

Related

Scala/Spark; Add column to DataFrame that increments by 1 when a value is repeated in another column

I have a dataframe called rawEDI that looks something like this;
Line_number
Segment
1
ST
2
BPT
3
SE
4
ST
5
BPT
6
N1
7
SE
8
ST
9
PTD
10
SE
Each row represents a line in a file. Each line is called a segment and is denoted by something called a segment identifier; a short string. Segments are grouped together in chunks that start with an ST segment identifier and end with an SE segment segment identifier. There can be any number of ST chunks in a given file and the size of each any ST chunk is not fixed.
I want to create a new column on the dataframe that represents numerically what ST group a given segment belongs to. This will allow me to use groupBy to perform aggregate operations across all ST segments without having to loop over each individual ST segment, which is too slow.
The final DataFrame would look like this;
Line_number
Segment
ST_Group
1
ST
1
2
BPT
1
3
SE
1
4
ST
2
5
BPT
2
6
N1
2
7
SE
2
8
ST
3
9
PTD
3
10
SE
3
In short, I want to create and populate a DataFrame column with a number that increments by one whenever the value "ST" appears in the Segment column.
I am using spark 2.3.2 and scala 2.11.8
My initial thought was to use iteration. I collected another DataFrame, df, that contained the starting and ending line_number for each segment, looking like this;
Start
End
1
3
4
7
8
10
Then iterate over the rows of the dataframe and use them to populate the new column like this;
var st = 1
for (row <- df.collect()) {
val start = row(0)
val end = row(1)
var labelSTs = rawEDI.filter("line_number > = ${start}").filter("line_number <= ${end}").withColumn("ST_Group", lit(st))
st = st + 1
However, this yields an empty DataFrame. Additionally, the use of a for loop is time-prohibitive, taking over 20s on my machine for this. Achieving this result without the use of a loop would be huge, but a solution with a loop may also be acceptable if performant.
I have a hunch this can be accomplished using a udf or a Window, but I'm not certain how to attack that.
This
val func = udf((s:String) => if(s == "ST") 1 else 0)
var labelSTs = rawEDI.withColumn("ST_Group", func((col("segment")))
Only populates the column with 1 at each ST segment start.
And this
val w = Window.partitionBy("Segment").orderBy("line_number")
val labelSTs = rawEDI.withColumn("ST_Group", row_number().over(w)
Returns a nonsense dataframe.
One way is to create an intermediate dataframe of "groups" that would tell you on which line each group starts and ends (sort of what you've already done), and then join it to the original table using greater-than/less-than conditions.
Sample data
scala> val input = Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),
(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE"))
.toDF("linenumber","segment")
scala> input.show(false)
+----------+-------+
|linenumber|segment|
+----------+-------+
|1 |ST |
|2 |BPT |
|3 |SE |
|4 |ST |
|5 |BPT |
|6 |N1 |
|7 |SE |
|8 |ST |
|9 |PTD |
|10 |SE |
+----------+-------+
Create a dataframe for groups, using Window just as your hunch was telling you:
scala> val groups = input.where("segment='ST'")
.withColumn("endline",lead("linenumber",1) over Window.orderBy("linenumber"))
.withColumn("groupnumber",row_number() over Window.orderBy("linenumber"))
.withColumnRenamed("linenumber","startline")
.drop("segment")
scala> groups.show(false)
+---------+-----------+-------+
|startline|groupnumber|endline|
+---------+-----------+-------+
|1 |1 |4 |
|4 |2 |8 |
|8 |3 |null |
+---------+-----------+-------+
Join both to get the result
scala> input.join(groups,
input("linenumber") >= groups("startline") &&
(input("linenumber") < groups("endline") || groups("endline").isNull))
.select("linenumber","segment","groupnumber")
.show(false)
+----------+-------+-----------+
|linenumber|segment|groupnumber|
+----------+-------+-----------+
|1 |ST |1 |
|2 |BPT |1 |
|3 |SE |1 |
|4 |ST |2 |
|5 |BPT |2 |
|6 |N1 |2 |
|7 |SE |2 |
|8 |ST |3 |
|9 |PTD |3 |
|10 |SE |3 |
+----------+-------+-----------+
The only problem with this is Window.orderBy() on an unpartitioned dataframe, which would collect all data to a single partition and thus could be a killer.
if you want just to add column with a number that increments by one whenever the value "ST" appears in the Segment column, you can filter lines with the ST segment in a separate dataframe,
var labelSTs = rawEDI.filter("segement == 'ST'");
// then group by ST and collect to list the linenumbers
var groupedDf = labelSTs.groupBy("Segment").agg(collect_list("Line_number").alias("Line_numbers"))
// now you need to flat back the data frame and log the line number index
var flattedDf = groupedDf.select($"Segment", explode($"Line_numbers").as("Line_number"))
// log the line_number index in your target column ST_Group
val withIndexDF = flattenedDF.withColumn("ST_Group", row_number().over(Window.partitionBy($"Segment").orderBy($"Line_number")))
and you have this as result:
+-------+----------+----------------+
|Segment|Line_number|ST_Group |
+-------+----------+----------------+
| ST| 1| 1|
| ST| 4| 2|
| ST| 8| 3|
+-------|----------|----------------|
then you concat this with other Segement in the initial dataframe.
Found a more simpler way, add a column which will have 1 when the segment column value is ST, otherwise it will have 0. Then using Window function find the cummulative sum of that new column. This will give you the desired results.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val rawEDI=Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE")).toDF("line_number","segment")
val newDf=rawEDI.withColumn("ST_Group", ($"segment" === "ST").cast("bigint"))
val windowSpec = Window.orderBy("line_number")
newDf.withColumn("ST_Group", sum("ST_Group").over(windowSpec))
.show
+-----------+-------+--------+
|line_number|segment|ST_Group|
+-----------+-------+--------+
| 1| ST| 1|
| 2| BPT| 1|
| 3| SE| 1|
| 4| ST| 2|
| 5| BPT| 2|
| 6| N1| 2|
| 7| SE| 2|
| 8| ST| 3|
| 9| PTD| 3|
| 10| SE| 3|
+-----------+-------+--------+

PySpark Get Count of True Values in Boolean Columns by Row

Please let me know if this is a duplicate. I've been searching all over to try to figure out hwo to do this without writing a user defined function. I have a bunch of boolean columns, each a different quality assurance flag, in a PySpark data frame. All I need to do is create a new column with the number of these columns with a True value, the count of QA checks each row is failing. However, I cannot, for the life of me, figure out an efficient way of doing this. Any ideas, references or links are greatly appreciated!
For instance, for one record with the above columns with the following values...
...I want to create a new column with a count of 2.
Have any good ideas?
Two methods come to mind without using user defined functions .
I'm assuming you have a python list with the boolean column names.
qa_tests = ['qa_flg_xy_equal', 'qa_flg_out_of_bounds_x'] and so forth.
plan a - build in local python a column that is the sum of all boolean columns cast as integers and then put it in spark.
from pyspark.sql.functions import col, lit
from functools import reduce
sum_bools = reduce(lambda acc, v: acc + col(v).cast("integer"), qa_tests, lit(0))
sum_bools is just an automatic way of writing lit(0) + col("qa_flg_xy_equal").cast("integer") + col("qa_flg_out_of_bounds_x").cast("integer") + ...
Here is how sum_bools is defnied:
>> Column<'((0 + CAST(qa_flg_xy_equal AS INT)) + CAST(qa_flg_out_of_bounds_x AS INT))'>
rest of code:
df.withColumn("tests_passed", sum_bools).show(truncate=0)
+---+---------------+----------------------+------------+
|id |qa_flg_xy_equal|qa_flg_out_of_bounds_x|tests_passed|
+---+---------------+----------------------+------------+
|1 |true |false |1 |
|2 |false |false |0 |
|3 |true |true |2 |
+---+---------------+----------------------+------------+
plan b - we can use Array columns to collect all booleans into one array value, filter only the true and check the size of the array after the filter.
No need to keep all the verbose steps below, you can write it in one withColumn of course.
from pyspark.sql.functions import array, filter, size
df \
.withColumn("qa_results", array(*qa_tests)) \
.withColumn("passed_results", filter(
"qa_results",
lambda test: test
)) \
.withColumn("passed_results_count", size("passed_results"))
+---+---------------+----------------------+--------------+--------------+--------------------+
| id|qa_flg_xy_equal|qa_flg_out_of_bounds_x| qa_results|passed_results|passed_results_count|
+---+---------------+----------------------+--------------+--------------+--------------------+
| 1| true| false| [true, false]| [true]| 1|
| 2| false| false|[false, false]| []| 0|
| 3| true| true| [true, true]| [true, true]| 2|
+---+---------------+----------------------+--------------+--------------+--------------------+

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

Spark Scala Cumulative Unique Count by Date

I have a dataframe that gives a set of id numbers and the date at which they visited a certain location and I’m trying to find a way in spark scala to get the number of unique people (“id”) that have visited this location on or before each day so that one id number won’t be counted twice if they visit on 2019-01-01 and then again on 2019-01-07 for example.
df.show(5,false)
+---------------+
|id |date |
+---------------+
|3424|2019-01-02|
|8683|2019-01-01|
|7690|2019-01-02|
|3424|2019-01-07|
|9002|2019-01-02|
+---------------+
I want the output to look like this: where I groupBy(“date”) and get the count of unique id’s as a cumulative number. (So for example: next to 2019-01-03, it would give the distinct count of id’s on any day up to 2019-01-03)
+----------+-------+
|date |cum_ct |
+----------+-------+
|2019-01-01|xxxxx |
|2019-01-02|xxxxx |
|2019-01-03|xxxxx |
|... |... |
|2019-01-08|xxxxx |
|2019-01-09|xxxxx |
+------------------+
What would be the best way to do this after df.groupBy("date")
You will have to use the ROW_NUMBER() function in this scenario. I have created a dataframe
val df = Seq((1,"2019-05-03"),(1,"2018-05-03"),(2,"2019-05-03"),(2,"2018-05-03"),(3,"2019-05-03"),(3,"2018-05-03")).toDF("id","date")
df.show
+---+----------+
| id| date|
+---+----------+
| 1|2019-05-03|
| 1|2018-05-03|
| 2|2019-05-03|
| 2|2018-05-03|
| 3|2019-05-03|
| 3|2018-05-03|
+---+----------+
ID represents a person id in your case that can appear against multiple dates.
Here is the count against each date.
df.groupBy("date").count.show
+----------+-----+
| date|count|
+----------+-----+
|2018-05-03| 3|
|2019-05-03| 3|
+----------+-----+
This shows the repetitive count of id's against each date. I have used 3 id's in total and each date has a count of 3 that means all id's are counted explicitly in each date.
Now to my understanding you want an ID to be counted only once against any date (depends if you want latest date or oldest date).
I am going to use latest date for every ID.
val newdf = df.withColumn("row_num",row_number().over(Window.partitionBy($"id").orderBy($"date".desc)))
Above line will assign row numbers against every ID for each date against it's entry, and row number 1 will refer to the latest date of each ID, now you take count against each ID where row number is 1. That will result in single count of every ID (Distinct).
Here is the output , I have applied filter against row number and you can see in output that the dates are latest i.e in my case 2019.
newdf.select("id","date","row_num").where("row_num = 1").show()
+---+----------+-------+
| id| date|row_num|
+---+----------+-------+
| 1|2019-05-03| 1|
| 3|2019-05-03| 1|
| 2|2019-05-03| 1|
+---+----------+-------+
Now i will take count on NEWDF with same filter which will return date wise count.
newdf.groupBy("date","row_num").count().filter("row_num = 1").select("date","count").show
+----------+-----+
| date|count|
+----------+-----+
|2019-05-03| 3|
+----------+-----+
Here total count is 3 which excludes ID's on previous dates, previously it was 6 (because repetition of id in multiple date)
I hope it answers your questions.

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))