Applying conditional counts (with reset) to the grouped data in PySpark? - pyspark

I have a data frame that is grouped by Calendar Date and ID. I need to populate the Expected Output based on the Booked: Within each group, if Booked is equal to 1 then set the output to 0, otherwise count the not-booked days. In other words, I'm trying to find all the consecutive available days (not booked days) within a group. Any ideas on how to do this?
Grouped by Calendar Date and ID and find the Expected Output based on Booked.
Example
enter image description here

This is a typical "Gaps and Islands problem". You can solve it like this:
from datetime import date
from pyspark.sql.functions import expr
from pyspark.sql.types import StructType, StructField, IntegerType, DateType
SampleData = [
(date(2022, 1, 1), 1, date(2022, 1, 1), 0),
(date(2022, 1, 1), 1, date(2022, 1, 2), 1),
(date(2022, 1, 1), 1, date(2022, 1, 3), 1),
(date(2022, 1, 1), 1, date(2022, 1, 4), 0),
(date(2022, 1, 1), 2, date(2022, 1, 1), 0),
(date(2022, 1, 1), 2, date(2022, 1, 2), 0),
(date(2022, 1, 1), 2, date(2022, 1, 3), 0),
(date(2022, 1, 1), 2, date(2022, 1, 4), 0),
(date(2022, 1, 2), 1, date(2022, 1, 2), 1),
(date(2022, 1, 2), 1, date(2022, 1, 3), 1),
(date(2022, 1, 2), 1, date(2022, 1, 4), 0),
(date(2022, 1, 2), 2, date(2022, 1, 2), 0),
(date(2022, 1, 2), 2, date(2022, 1, 3), 0),
(date(2022, 1, 2), 2, date(2022, 1, 4), 0),
(date(2022, 1, 3), 1, date(2022, 1, 3), 1),
(date(2022, 1, 3), 1, date(2022, 1, 4), 0),
(date(2022, 1, 3), 2, date(2022, 1, 3), 0),
(date(2022, 1, 3), 2, date(2022, 1, 4), 0),
(date(2022, 1, 4), 1, date(2022, 1, 4), 0),
(date(2022, 1, 4), 2, date(2022, 1, 4), 0)
]
ColumnSchema = StructType([
StructField("CalendarDate", DateType()),
StructField("ID", IntegerType()),
StructField("Date2", DateType()),
StructField("Booked", IntegerType()),
])
dfSampleData = spark.createDataFrame(SampleData, schema=ColumnSchema)
dfSampleDataWithRowNumbers = dfSampleData.selectExpr(
"*",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID ORDER BY Date2) AS rn1",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID, Booked ORDER BY Date2) AS rn2"
).selectExpr(
"*",
"rn1 - rn2 AS GapsAndIslandsGroupNo"
).selectExpr(
"*",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID, GapsAndIslandsGroupNo ORDER BY Date2) AS ExpectedOutputRaw"
).selectExpr(
"*",
"CASE Booked WHEN 0 THEN ExpectedOutputRaw ELSE 0 END AS `Expected Output`"
).orderBy("CalendarDate", "ID", "Date2")
dfSampleDataWithRowNumbers.show()
Output:
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+
|CalendarDate| ID| Date2|Booked|rn1|rn2|GapsAndIslandsGroupNo|ExpectedOutputRaw|Expected Output|
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+
| 2022-01-01| 1|2022-01-01| 0| 1| 1| 0| 1| 1|
| 2022-01-01| 1|2022-01-02| 1| 2| 1| 1| 1| 0|
| 2022-01-01| 1|2022-01-03| 1| 3| 2| 1| 2| 0|
| 2022-01-01| 1|2022-01-04| 0| 4| 2| 2| 1| 1|
| 2022-01-01| 2|2022-01-01| 0| 1| 1| 0| 1| 1|
| 2022-01-01| 2|2022-01-02| 0| 2| 2| 0| 2| 2|
| 2022-01-01| 2|2022-01-03| 0| 3| 3| 0| 3| 3|
| 2022-01-01| 2|2022-01-04| 0| 4| 4| 0| 4| 4|
| 2022-01-02| 1|2022-01-02| 1| 1| 1| 0| 1| 0|
| 2022-01-02| 1|2022-01-03| 1| 2| 2| 0| 2| 0|
| 2022-01-02| 1|2022-01-04| 0| 3| 1| 2| 1| 1|
| 2022-01-02| 2|2022-01-02| 0| 1| 1| 0| 1| 1|
| 2022-01-02| 2|2022-01-03| 0| 2| 2| 0| 2| 2|
| 2022-01-02| 2|2022-01-04| 0| 3| 3| 0| 3| 3|
| 2022-01-03| 1|2022-01-03| 1| 1| 1| 0| 1| 0|
| 2022-01-03| 1|2022-01-04| 0| 2| 1| 1| 1| 1|
| 2022-01-03| 2|2022-01-03| 0| 1| 1| 0| 1| 1|
| 2022-01-03| 2|2022-01-04| 0| 2| 2| 0| 2| 2|
| 2022-01-04| 1|2022-01-04| 0| 1| 1| 0| 1| 1|
| 2022-01-04| 2|2022-01-04| 0| 1| 1| 0| 1| 1|
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+

Related

Recursive computation with pypsark

I want to be able, within a window, to calculate a sum between two columns and modify the value of a column if this sum becomes odd. Thus, the modification of this column will de facto modify the sum and so on.
However, I don't know how to scan "row by row" my data in an efficient way.
Would you have a tip for that?
I am attaching a few example rows and what I would like to achieve for clarity:
My window will be based on the ID column
my_data = spark.createDataFrame([
(1, 1, 0),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 1, 0),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 1, 0),
(1, 0, 0),
(1, 1, 0),
(1, 0, 1),
],
['ID','flag_1','flag_2'])
Thus my issue is to derive the sum and at the same time to modify the flag_2 if the sum is becoming odd. sum is here the expected results and flag_2_results the "cleaned" version of flag_2 as explained,
my_data = spark.createDataFrame([
(1, 1, 0, 0, 1),
(1, 0, 0, 0, 1),
(1, 0, 1, 1, 2),
(1, 0, 0, 0, 2),
(1, 0, 1, 0, 2),
(1, 0, 0, 0, 2),
(1, 1, 0, 0, 3),
(1, 0, 0, 0, 3),
(1, 0, 1, 1, 4),
(1, 0, 0, 0, 4),
(1, 1, 0, 0, 5),
(1, 0, 0, 0, 5),
(1, 1, 0, 0, 6),
(1, 0, 1, 0, 6),],
['ID','flag_1','flag_2', 'flag_2_results', 'sum'])
Raw n°3 : we keep the flag_2 = 1 as the sum was odd.
Raw n°5 : we do not keep the flag_2 = 1 as the sum was even, thus the sum is not changing until flag_1 = 1.
Last raw : we do not keep the flag_2 = 1 (even if it's the first after a flag_1 = 1) because it would lead to an odd cumulative sum
Thank you for your help,
According to your last comment, you do not have that much lines to process. Then, I'd advice you to use an UDF only on the lines where "flag1+flag2>0" :
from pyspark.sql import functions as F, Window as W, types as T
df = my_data.groupBy("ID").agg(
F.collect_list(F.struct(F.col("posTime"), F.col("flag_1"), F.col("flag_2"))).alias(
"data"
)
)
schm = T.ArrayType(
T.StructType(
[
T.StructField("posTime", T.IntegerType()),
T.StructField("flag_1", T.IntegerType()),
T.StructField("flag_2", T.IntegerType()),
T.StructField("flag_2_result", T.IntegerType()),
T.StructField("sum", T.IntegerType()),
]
)
)
#F.udf(schm)
def process(data):
accumulator = 0
out = []
data.sort(key=lambda x: x["posTime"])
for l in data:
flag_2_result = 0
accumulator += l["flag_1"]
if l["flag_2"] and accumulator % 2 == 1:
accumulator += l["flag_2"]
flag_2_result = 1
out.append((l["posTime"], l["flag_1"], l["flag_2"], flag_2_result, accumulator))
return out
df.select("ID", F.explode(process(F.col("data"))).alias("data")).select(
"ID", "data.*"
).show()
and the result :
+---+-------+------+------+-------------+---+
| ID|posTime|flag_1|flag_2|flag_2_result|sum|
+---+-------+------+------+-------------+---+
| 1| 1| 1| 0| 0| 1|
| 1| 2| 0| 0| 0| 1|
| 1| 3| 0| 1| 1| 2|
| 1| 4| 0| 0| 0| 2|
| 1| 5| 0| 1| 0| 2|
| 1| 6| 0| 0| 0| 2|
| 1| 7| 1| 0| 0| 3|
| 1| 8| 0| 0| 0| 3|
| 1| 9| 0| 1| 1| 4|
| 1| 10| 0| 0| 0| 4|
| 1| 11| 1| 0| 0| 5|
| 1| 12| 0| 0| 0| 5|
| 1| 13| 1| 0| 0| 6|
| 1| 14| 0| 1| 0| 6|
+---+-------+------+------+-------------+---+

Concatenate PySpark Dataframe Column Names by Value and Sum

I have an example dataframe:
df = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something"),
(2, 0, 1, 1, 1, 0, "something"),
(3, 1, 0, 0, 0, 0, "something"),
(4, 0, 1, 0, 0, 0, "something"),
(5, 1, 0, 0, 0, 0, "something"),
(6, 0, 0, 0, 0, 0, "something")
], ["int" * 6, "string"]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column")
df.show()
+---+---+---+---+---+---+------------+
| id| a| b| c| d| e|extra_column|
+---+---+---+---+---+---+------------+
| 1| 0| 1| 1| 1| 1| something|
| 2| 0| 1| 1| 1| 0| something|
| 3| 1| 0| 0| 0| 0| something|
| 4| 0| 1| 0| 0| 0| something|
| 5| 1| 0| 0| 0| 0| something|
| 6| 0| 0| 0| 0| 0| something|
I want to concatenate across the columns per row and produce a key where the column = 1. I don't need to show this result but this is the intermediate step I need to solve:
df_row_concat = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something", "bcde"),
(2, 0, 1, 1, 1, 0, "something", "bcd"),
(3, 1, 0, 0, 0, 0, "something", "a"),
(4, 0, 1, 0, 0, 0, "something", "b"),
(5, 1, 0, 0, 0, 0, "something", "a"),
(6, 0, 0, 0, 0, 0, "something", "")
], ["int" * 6, "string" * 2]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column", "key")
df_row_concat.show()
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
This last part I can get on my own, but to complete the example, I want to sum the key values and output:
+----+-----+
| key|value|
+----+-----+
| a| 2|
| b| 1|
| bcd| 1|
|bcde| 1|
+----+-----+
My actual dataset is much longer and wider. I could hard-code every combination but there must be a more efficient way to loop over the list of columns to consider (e.g. column_list = ["a", "b", "c", "d", "e"]). Maybe not necessary, but I included the extra_column because there are additional columns in my dataset which won't be considered..
I don't see anything wrong with writing a for loop here
from pyspark.sql import functions as F
cols = ['a', 'b', 'c', 'd', 'e']
temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
(temp
.groupBy('key')
.agg(F.count('*').alias('value'))
.where(F.col('key') != '')
.show()
)
+----+-----+
| key|value|
+----+-----+
|bcde| 1|
| b| 1|
| a| 2|
| bcd| 1|
+----+-----+

How to get value from previous group in spark?

I need to get value of previous group in spark and set it to the current group.
How can I achieve that?
I must order by count instead of TEXT_NUM.
Ordering by TEXT_NUM is not possible because events repeat in time, as count 10 and 11 shows.
I'm trying with the following code:
val spark = SparkSession.builder()
.master("spark://spark-master:7077")
.getOrCreate()
val df = spark
.createDataFrame(
Seq[(Int, String, Int)](
(0, "", 0),
(1, "", 0),
(2, "A", 1),
(3, "A", 1),
(4, "A", 1),
(5, "B", 2),
(6, "B", 2),
(7, "B", 2),
(8, "C", 3),
(9, "C", 3),
(10, "A", 1),
(11, "A", 1)
))
.toDF("count", "TEXT", "TEXT_NUM")
val w1 = Window
.orderBy("count")
.rangeBetween(Window.unboundedPreceding, -1)
df
.withColumn("LAST_VALUE", last("TEXT_NUM").over(w1))
.orderBy("count")
.show()
Result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| 0|
| 2| A| 1| 0|
| 3| A| 1| 1|
| 4| A| 1| 1|
| 5| B| 2| 1|
| 6| B| 2| 2|
| 7| B| 2| 2|
| 8| C| 3| 2|
| 9| C| 3| 3|
| 10| A| 1| 3|
| 11| A| 1| 1|
+-----+----+--------+----------+
Desired result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| null|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| A| 1| 0|
| 5| B| 2| 1|
| 6| B| 2| 1|
| 7| B| 2| 1|
| 8| C| 3| 2|
| 9| C| 3| 2|
| 10| A| 1| 3|
| 11| A| 1| 3|
+-----+----+--------+----------+
Consider using Window function last(columnName, ignoreNulls) to backfill nulls in a column that consists of previous "text_num" at group boundaries, as shown below:
val df = Seq(
(0, "", 0), (1, "", 0),
(2, "A", 1), (3, "A", 1), (4, "A", 1),
(5, "B", 2), (6, "B", 2), (7, "B", 2),
(8, "C", 3), (9, "C", 3),
(10, "A", 1), (11, "A", 1)
).toDF("count", "text", "text_num")
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("count")
val w2 = w1.rowsBetween(Window.unboundedPreceding, 0)
df.
withColumn("prev_num", lag("text_num", 1).over(w1)).
withColumn("last_change", when($"text_num" =!= $"prev_num", $"prev_num")).
withColumn("last_value", last("last_change", ignoreNulls=true).over(w2)).
show
/*
+-----+----+--------+--------+-----------+----------+
|count|text|text_num|prev_num|last_change|last_value|
+-----+----+--------+--------+-----------+----------+
| 0| | 0| null| null| null|
| 1| | 0| 0| null| null|
| 2| A| 1| 0| 0| 0|
| 3| A| 1| 1| null| 0|
| 4| A| 1| 1| null| 0|
| 5| B| 2| 1| 1| 1|
| 6| B| 2| 2| null| 1|
| 7| B| 2| 2| null| 1|
| 8| C| 3| 2| 2| 2|
| 9| C| 3| 3| null| 2|
| 10| A| 1| 3| 3| 3|
| 11| A| 1| 1| null| 3|
+-----+----+--------+--------+-----------+----------+
*/
The intermediary columns are kept in the output for references. Just drop them if they aren't needed.

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.
Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()
Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

Spark dataset: return a HashMap of values having same key

+------+-----+
|userID|entID|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
I want the result as {0->(5,15), 1->(7,3),..}
Any help would be appreciated.
Here is your table again:
val df = Seq(
(0, 5),
(0, 15),
(1, 7),
(1, 3),
(2, 3),
(2, 4),
(2, 5),
(2, 9),
(3, 25)
).toDF("userId", "entId")
df.show()
Outputs:
+------+-----+
|userId|entId|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
Now you can group by userId and then collect endId to lists, aliasing the resulting column with lists as entIds:
import org.apache.spark.sql.functions._
val entIdsForUserId = df.
groupBy($"userId").
agg(collect_list($"entId").alias("entIds"))
entIdsForUserId.show()
Output:
+------+------------+
|userId| entIds|
+------+------------+
| 1| [7, 3]|
| 3| [25]|
| 2|[3, 4, 5, 9]|
| 0| [5, 15]|
+------+------------+
The order after groupBy is not specified. Depending on what you want to do with it, you could additionally sort it.
You can collect it into a single map on the master node:
val m = entIdsForUserId.
map(r => (r.getAs[Int](0), r.getAs[Seq[Int]](1))).
collect.toMap
this will give you:
Map(1 -> List(7, 3), 3 -> List(25), 2 -> List(3, 4, 5, 9), 0 -> List(5, 15))
One approach would be to convert the Dataset to a RDD and perform a groupByKey. To obtain the result as a Map, you'll need to collect the grouped RDD provided if the dataset isn't too big:
val ds = Seq(
(0, 5), (0, 15), (1, 7), (1, 3),
(2, 3), (2, 4), (2, 5), (2, 9), (3, 25)
).toDF("userID", "entID").as[(Int, Int)]
// ds: org.apache.spark.sql.Dataset[(Int, Int)] =[userID: int, entID: int]
val map = ds.rdd.groupByKey.collectAsMap
// map: scala.collection.Map[Int,Iterable[Int]] = Map(
// 2 -> CompactBuffer(3, 4, 5, 9), 1 -> CompactBuffer(7, 3),
// 3 -> CompactBuffer(25), 0 -> CompactBuffer(5, 15)
// )