I want to be able, within a window, to calculate a sum between two columns and modify the value of a column if this sum becomes odd. Thus, the modification of this column will de facto modify the sum and so on.
However, I don't know how to scan "row by row" my data in an efficient way.
Would you have a tip for that?
I am attaching a few example rows and what I would like to achieve for clarity:
My window will be based on the ID column
my_data = spark.createDataFrame([
(1, 1, 0),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 1, 0),
(1, 0, 0),
(1, 0, 1),
(1, 0, 0),
(1, 1, 0),
(1, 0, 0),
(1, 1, 0),
(1, 0, 1),
],
['ID','flag_1','flag_2'])
Thus my issue is to derive the sum and at the same time to modify the flag_2 if the sum is becoming odd. sum is here the expected results and flag_2_results the "cleaned" version of flag_2 as explained,
my_data = spark.createDataFrame([
(1, 1, 0, 0, 1),
(1, 0, 0, 0, 1),
(1, 0, 1, 1, 2),
(1, 0, 0, 0, 2),
(1, 0, 1, 0, 2),
(1, 0, 0, 0, 2),
(1, 1, 0, 0, 3),
(1, 0, 0, 0, 3),
(1, 0, 1, 1, 4),
(1, 0, 0, 0, 4),
(1, 1, 0, 0, 5),
(1, 0, 0, 0, 5),
(1, 1, 0, 0, 6),
(1, 0, 1, 0, 6),],
['ID','flag_1','flag_2', 'flag_2_results', 'sum'])
Raw n°3 : we keep the flag_2 = 1 as the sum was odd.
Raw n°5 : we do not keep the flag_2 = 1 as the sum was even, thus the sum is not changing until flag_1 = 1.
Last raw : we do not keep the flag_2 = 1 (even if it's the first after a flag_1 = 1) because it would lead to an odd cumulative sum
Thank you for your help,
According to your last comment, you do not have that much lines to process. Then, I'd advice you to use an UDF only on the lines where "flag1+flag2>0" :
from pyspark.sql import functions as F, Window as W, types as T
df = my_data.groupBy("ID").agg(
F.collect_list(F.struct(F.col("posTime"), F.col("flag_1"), F.col("flag_2"))).alias(
"data"
)
)
schm = T.ArrayType(
T.StructType(
[
T.StructField("posTime", T.IntegerType()),
T.StructField("flag_1", T.IntegerType()),
T.StructField("flag_2", T.IntegerType()),
T.StructField("flag_2_result", T.IntegerType()),
T.StructField("sum", T.IntegerType()),
]
)
)
#F.udf(schm)
def process(data):
accumulator = 0
out = []
data.sort(key=lambda x: x["posTime"])
for l in data:
flag_2_result = 0
accumulator += l["flag_1"]
if l["flag_2"] and accumulator % 2 == 1:
accumulator += l["flag_2"]
flag_2_result = 1
out.append((l["posTime"], l["flag_1"], l["flag_2"], flag_2_result, accumulator))
return out
df.select("ID", F.explode(process(F.col("data"))).alias("data")).select(
"ID", "data.*"
).show()
and the result :
+---+-------+------+------+-------------+---+
| ID|posTime|flag_1|flag_2|flag_2_result|sum|
+---+-------+------+------+-------------+---+
| 1| 1| 1| 0| 0| 1|
| 1| 2| 0| 0| 0| 1|
| 1| 3| 0| 1| 1| 2|
| 1| 4| 0| 0| 0| 2|
| 1| 5| 0| 1| 0| 2|
| 1| 6| 0| 0| 0| 2|
| 1| 7| 1| 0| 0| 3|
| 1| 8| 0| 0| 0| 3|
| 1| 9| 0| 1| 1| 4|
| 1| 10| 0| 0| 0| 4|
| 1| 11| 1| 0| 0| 5|
| 1| 12| 0| 0| 0| 5|
| 1| 13| 1| 0| 0| 6|
| 1| 14| 0| 1| 0| 6|
+---+-------+------+------+-------------+---+
Related
I have an example dataframe:
df = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something"),
(2, 0, 1, 1, 1, 0, "something"),
(3, 1, 0, 0, 0, 0, "something"),
(4, 0, 1, 0, 0, 0, "something"),
(5, 1, 0, 0, 0, 0, "something"),
(6, 0, 0, 0, 0, 0, "something")
], ["int" * 6, "string"]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column")
df.show()
+---+---+---+---+---+---+------------+
| id| a| b| c| d| e|extra_column|
+---+---+---+---+---+---+------------+
| 1| 0| 1| 1| 1| 1| something|
| 2| 0| 1| 1| 1| 0| something|
| 3| 1| 0| 0| 0| 0| something|
| 4| 0| 1| 0| 0| 0| something|
| 5| 1| 0| 0| 0| 0| something|
| 6| 0| 0| 0| 0| 0| something|
I want to concatenate across the columns per row and produce a key where the column = 1. I don't need to show this result but this is the intermediate step I need to solve:
df_row_concat = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something", "bcde"),
(2, 0, 1, 1, 1, 0, "something", "bcd"),
(3, 1, 0, 0, 0, 0, "something", "a"),
(4, 0, 1, 0, 0, 0, "something", "b"),
(5, 1, 0, 0, 0, 0, "something", "a"),
(6, 0, 0, 0, 0, 0, "something", "")
], ["int" * 6, "string" * 2]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column", "key")
df_row_concat.show()
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
This last part I can get on my own, but to complete the example, I want to sum the key values and output:
+----+-----+
| key|value|
+----+-----+
| a| 2|
| b| 1|
| bcd| 1|
|bcde| 1|
+----+-----+
My actual dataset is much longer and wider. I could hard-code every combination but there must be a more efficient way to loop over the list of columns to consider (e.g. column_list = ["a", "b", "c", "d", "e"]). Maybe not necessary, but I included the extra_column because there are additional columns in my dataset which won't be considered..
I don't see anything wrong with writing a for loop here
from pyspark.sql import functions as F
cols = ['a', 'b', 'c', 'd', 'e']
temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
(temp
.groupBy('key')
.agg(F.count('*').alias('value'))
.where(F.col('key') != '')
.show()
)
+----+-----+
| key|value|
+----+-----+
|bcde| 1|
| b| 1|
| a| 2|
| bcd| 1|
+----+-----+
I have a Spark dataframe with the following data:
val df = sc.parallelize(Seq(
(1, "A", "2022-01-01", 30, 0),
(1, "A", "2022-01-02", 20, 30),
(1, "B", "2022-01-03", 50, 20),
(1, "A", "2022-01-04", 10, 70),
(1, "B", "2022-01-05", 30, 60),
(1, "A", "2022-01-06", 0, 40),
(1, "C", "2022-01-07", 100,30),
(2, "D", "2022-01-08", 5, 0)
)).toDF("id", "event", "eventTimestamp", "amount", "expected")
display(df)
id
event
eventTimestamp
amount
expected
1
"A"
"2022-01-01"
30
0
1
"A"
"2022-01-02"
20
30
1
"B"
"2022-01-03"
50
20
1
"A"
"2022-01-04"
10
70
1
"B"
"2022-01-05"
30
60
1
"A"
"2022-01-06"
0
40
1
"C"
"2022-01-07"
100
30
2
"D"
"2022-01-08"
5
0
I want to find the following for each row: The sum of all last entries (above the current row) for each id and each unique event. The desired outcome is in the column "expected".
E.g. for the order "C" I'd like to get the latest amounts for "A" and "B": 30 + 0 = 30
I tried the following query, however it would sum up the amounts of all previous orders, including duplications, (I'm not sure, if it's possible to apply a filter on the sum to take only distinct values):
val days = (x:Int) => x * 86400
val idWindow = Window.partitionBy("id").orderBy(col("eventTimestamp")
.cast("timestamp").cast("long"))
.rangeBetween(Window.unboundedPreceding, -days(1))
val res = df.withColumn("totalAmount", sum($"amount").over(idWindow))
Please note that the rangeBetween functionality is important for my use-case and should be preserved.
The trick is to convert amounts to diffs within (id, event) pairs, which allows you to calculate moving sum in the next step. That moving sum maintains latest amounts of each unique event.
df
.withColumn("diff", coalesce($"amount" - lag($"amount", 1).over(wIdEvent), $"amount")).
.withColumn("sum", sum($"diff").over(wId)).
.withColumn("final", coalesce(lag($"sum", 1).over(wId), lit(0))).
.orderBy($"eventTimestamp").show
+---+-----+--------------+------+--------+----+---+-----+
| id|event|eventTimestamp|amount|expected|diff|sum|final|
+---+-----+--------------+------+--------+----+---+-----+
| 1| A| 2022-01-01| 30| 0| 30| 30| 0|
| 1| A| 2022-01-02| 20| 30| -10| 20| 30|
| 1| B| 2022-01-03| 50| 20| 50| 70| 20|
| 1| A| 2022-01-04| 10| 70| -10| 60| 70|
| 1| B| 2022-01-05| 30| 60| -20| 40| 60|
| 1| A| 2022-01-06| 0| 40| -10| 30| 40|
| 1| C| 2022-01-07| 100| 30| 100|130| 30|
| 2| D| 2022-01-08| 5| 0| 5| 5| 0|
+---+-----+--------------+------+--------+----+---+-----+
I have a data frame that is grouped by Calendar Date and ID. I need to populate the Expected Output based on the Booked: Within each group, if Booked is equal to 1 then set the output to 0, otherwise count the not-booked days. In other words, I'm trying to find all the consecutive available days (not booked days) within a group. Any ideas on how to do this?
Grouped by Calendar Date and ID and find the Expected Output based on Booked.
Example
enter image description here
This is a typical "Gaps and Islands problem". You can solve it like this:
from datetime import date
from pyspark.sql.functions import expr
from pyspark.sql.types import StructType, StructField, IntegerType, DateType
SampleData = [
(date(2022, 1, 1), 1, date(2022, 1, 1), 0),
(date(2022, 1, 1), 1, date(2022, 1, 2), 1),
(date(2022, 1, 1), 1, date(2022, 1, 3), 1),
(date(2022, 1, 1), 1, date(2022, 1, 4), 0),
(date(2022, 1, 1), 2, date(2022, 1, 1), 0),
(date(2022, 1, 1), 2, date(2022, 1, 2), 0),
(date(2022, 1, 1), 2, date(2022, 1, 3), 0),
(date(2022, 1, 1), 2, date(2022, 1, 4), 0),
(date(2022, 1, 2), 1, date(2022, 1, 2), 1),
(date(2022, 1, 2), 1, date(2022, 1, 3), 1),
(date(2022, 1, 2), 1, date(2022, 1, 4), 0),
(date(2022, 1, 2), 2, date(2022, 1, 2), 0),
(date(2022, 1, 2), 2, date(2022, 1, 3), 0),
(date(2022, 1, 2), 2, date(2022, 1, 4), 0),
(date(2022, 1, 3), 1, date(2022, 1, 3), 1),
(date(2022, 1, 3), 1, date(2022, 1, 4), 0),
(date(2022, 1, 3), 2, date(2022, 1, 3), 0),
(date(2022, 1, 3), 2, date(2022, 1, 4), 0),
(date(2022, 1, 4), 1, date(2022, 1, 4), 0),
(date(2022, 1, 4), 2, date(2022, 1, 4), 0)
]
ColumnSchema = StructType([
StructField("CalendarDate", DateType()),
StructField("ID", IntegerType()),
StructField("Date2", DateType()),
StructField("Booked", IntegerType()),
])
dfSampleData = spark.createDataFrame(SampleData, schema=ColumnSchema)
dfSampleDataWithRowNumbers = dfSampleData.selectExpr(
"*",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID ORDER BY Date2) AS rn1",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID, Booked ORDER BY Date2) AS rn2"
).selectExpr(
"*",
"rn1 - rn2 AS GapsAndIslandsGroupNo"
).selectExpr(
"*",
"ROW_NUMBER() OVER (PARTITION BY CalendarDate, ID, GapsAndIslandsGroupNo ORDER BY Date2) AS ExpectedOutputRaw"
).selectExpr(
"*",
"CASE Booked WHEN 0 THEN ExpectedOutputRaw ELSE 0 END AS `Expected Output`"
).orderBy("CalendarDate", "ID", "Date2")
dfSampleDataWithRowNumbers.show()
Output:
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+
|CalendarDate| ID| Date2|Booked|rn1|rn2|GapsAndIslandsGroupNo|ExpectedOutputRaw|Expected Output|
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+
| 2022-01-01| 1|2022-01-01| 0| 1| 1| 0| 1| 1|
| 2022-01-01| 1|2022-01-02| 1| 2| 1| 1| 1| 0|
| 2022-01-01| 1|2022-01-03| 1| 3| 2| 1| 2| 0|
| 2022-01-01| 1|2022-01-04| 0| 4| 2| 2| 1| 1|
| 2022-01-01| 2|2022-01-01| 0| 1| 1| 0| 1| 1|
| 2022-01-01| 2|2022-01-02| 0| 2| 2| 0| 2| 2|
| 2022-01-01| 2|2022-01-03| 0| 3| 3| 0| 3| 3|
| 2022-01-01| 2|2022-01-04| 0| 4| 4| 0| 4| 4|
| 2022-01-02| 1|2022-01-02| 1| 1| 1| 0| 1| 0|
| 2022-01-02| 1|2022-01-03| 1| 2| 2| 0| 2| 0|
| 2022-01-02| 1|2022-01-04| 0| 3| 1| 2| 1| 1|
| 2022-01-02| 2|2022-01-02| 0| 1| 1| 0| 1| 1|
| 2022-01-02| 2|2022-01-03| 0| 2| 2| 0| 2| 2|
| 2022-01-02| 2|2022-01-04| 0| 3| 3| 0| 3| 3|
| 2022-01-03| 1|2022-01-03| 1| 1| 1| 0| 1| 0|
| 2022-01-03| 1|2022-01-04| 0| 2| 1| 1| 1| 1|
| 2022-01-03| 2|2022-01-03| 0| 1| 1| 0| 1| 1|
| 2022-01-03| 2|2022-01-04| 0| 2| 2| 0| 2| 2|
| 2022-01-04| 1|2022-01-04| 0| 1| 1| 0| 1| 1|
| 2022-01-04| 2|2022-01-04| 0| 1| 1| 0| 1| 1|
+------------+---+----------+------+---+---+---------------------+-----------------+---------------+
I need to get value of previous group in spark and set it to the current group.
How can I achieve that?
I must order by count instead of TEXT_NUM.
Ordering by TEXT_NUM is not possible because events repeat in time, as count 10 and 11 shows.
I'm trying with the following code:
val spark = SparkSession.builder()
.master("spark://spark-master:7077")
.getOrCreate()
val df = spark
.createDataFrame(
Seq[(Int, String, Int)](
(0, "", 0),
(1, "", 0),
(2, "A", 1),
(3, "A", 1),
(4, "A", 1),
(5, "B", 2),
(6, "B", 2),
(7, "B", 2),
(8, "C", 3),
(9, "C", 3),
(10, "A", 1),
(11, "A", 1)
))
.toDF("count", "TEXT", "TEXT_NUM")
val w1 = Window
.orderBy("count")
.rangeBetween(Window.unboundedPreceding, -1)
df
.withColumn("LAST_VALUE", last("TEXT_NUM").over(w1))
.orderBy("count")
.show()
Result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| 0|
| 2| A| 1| 0|
| 3| A| 1| 1|
| 4| A| 1| 1|
| 5| B| 2| 1|
| 6| B| 2| 2|
| 7| B| 2| 2|
| 8| C| 3| 2|
| 9| C| 3| 3|
| 10| A| 1| 3|
| 11| A| 1| 1|
+-----+----+--------+----------+
Desired result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| null|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| A| 1| 0|
| 5| B| 2| 1|
| 6| B| 2| 1|
| 7| B| 2| 1|
| 8| C| 3| 2|
| 9| C| 3| 2|
| 10| A| 1| 3|
| 11| A| 1| 3|
+-----+----+--------+----------+
Consider using Window function last(columnName, ignoreNulls) to backfill nulls in a column that consists of previous "text_num" at group boundaries, as shown below:
val df = Seq(
(0, "", 0), (1, "", 0),
(2, "A", 1), (3, "A", 1), (4, "A", 1),
(5, "B", 2), (6, "B", 2), (7, "B", 2),
(8, "C", 3), (9, "C", 3),
(10, "A", 1), (11, "A", 1)
).toDF("count", "text", "text_num")
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("count")
val w2 = w1.rowsBetween(Window.unboundedPreceding, 0)
df.
withColumn("prev_num", lag("text_num", 1).over(w1)).
withColumn("last_change", when($"text_num" =!= $"prev_num", $"prev_num")).
withColumn("last_value", last("last_change", ignoreNulls=true).over(w2)).
show
/*
+-----+----+--------+--------+-----------+----------+
|count|text|text_num|prev_num|last_change|last_value|
+-----+----+--------+--------+-----------+----------+
| 0| | 0| null| null| null|
| 1| | 0| 0| null| null|
| 2| A| 1| 0| 0| 0|
| 3| A| 1| 1| null| 0|
| 4| A| 1| 1| null| 0|
| 5| B| 2| 1| 1| 1|
| 6| B| 2| 2| null| 1|
| 7| B| 2| 2| null| 1|
| 8| C| 3| 2| 2| 2|
| 9| C| 3| 3| null| 2|
| 10| A| 1| 3| 3| 3|
| 11| A| 1| 1| null| 3|
+-----+----+--------+--------+-----------+----------+
*/
The intermediary columns are kept in the output for references. Just drop them if they aren't needed.
+------+-----+
|userID|entID|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
I want the result as {0->(5,15), 1->(7,3),..}
Any help would be appreciated.
Here is your table again:
val df = Seq(
(0, 5),
(0, 15),
(1, 7),
(1, 3),
(2, 3),
(2, 4),
(2, 5),
(2, 9),
(3, 25)
).toDF("userId", "entId")
df.show()
Outputs:
+------+-----+
|userId|entId|
+------+-----+
| 0| 5|
| 0| 15|
| 1| 7|
| 1| 3|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 9|
| 3| 25|
+------+-----+
Now you can group by userId and then collect endId to lists, aliasing the resulting column with lists as entIds:
import org.apache.spark.sql.functions._
val entIdsForUserId = df.
groupBy($"userId").
agg(collect_list($"entId").alias("entIds"))
entIdsForUserId.show()
Output:
+------+------------+
|userId| entIds|
+------+------------+
| 1| [7, 3]|
| 3| [25]|
| 2|[3, 4, 5, 9]|
| 0| [5, 15]|
+------+------------+
The order after groupBy is not specified. Depending on what you want to do with it, you could additionally sort it.
You can collect it into a single map on the master node:
val m = entIdsForUserId.
map(r => (r.getAs[Int](0), r.getAs[Seq[Int]](1))).
collect.toMap
this will give you:
Map(1 -> List(7, 3), 3 -> List(25), 2 -> List(3, 4, 5, 9), 0 -> List(5, 15))
One approach would be to convert the Dataset to a RDD and perform a groupByKey. To obtain the result as a Map, you'll need to collect the grouped RDD provided if the dataset isn't too big:
val ds = Seq(
(0, 5), (0, 15), (1, 7), (1, 3),
(2, 3), (2, 4), (2, 5), (2, 9), (3, 25)
).toDF("userID", "entID").as[(Int, Int)]
// ds: org.apache.spark.sql.Dataset[(Int, Int)] =[userID: int, entID: int]
val map = ds.rdd.groupByKey.collectAsMap
// map: scala.collection.Map[Int,Iterable[Int]] = Map(
// 2 -> CompactBuffer(3, 4, 5, 9), 1 -> CompactBuffer(7, 3),
// 3 -> CompactBuffer(25), 0 -> CompactBuffer(5, 15)
// )