spark aggregation with sorted rows that returns a row's value before a condition is met - pyspark

I have some data (invoice data). Assuming id ~ date and id is what I'm sorting by:
fid, id, due, overdue
0, 1, 5, 0
0, 3, 5, 5
0, 13, 5, 10
0, 14, 5, 0
1, 5, 5, 0
1, 26, 5, 5
1, 27, 5, 10
1, 38, 5, 0
remove all rows under some arbitrary date-id id = 20
group_by fid and sort by id within the group
(major) aggregate a new column overdue_id that is the id of the row before the first row in the group that has a nonzero value for overdue
(minor) fill a row for every fid even if all rows are filtered out by #0
so the output would be (given default value null)
fid, overdue_id
0, 1
1, null
because for fid = 0, the first id with nonzero overdue is id = 3, and I'd like to output the id for the row that before that in id-date time which is id = 1.
I have group_by('fid').withColumn('overdue_id', ...), and want to use functions like agg, min, when, but am not sure after that as I am very new to the docs.

You can use the following steps to solve :
import pyspark.sql.functions as F
from pyspark.sql import *
#added fid=2 for overdue = 0 condition
fid = [0,1,2]*4
fid.sort()
dateId = [1,3,13,14,5,26,27,28]
dateId.extend(range(90,95))
due = [5]*12
overdue = [0,5,10,0]*2
overdue.extend([0,0,0,0])
data = zip(fid, dateId, due, overdue)
df = spark.createDataFrame(data, schema =["fid", "dateId", "due", "overdue"])
win = Window.partitionBy(df['fid']).orderBy(df['dateId'])
res = df\
.filter(F.col("dateId")!= 20)\
.withColumn("lag_id", F.lag(F.col("dateId"), 1).over(win))\
.withColumn("overdue_id", F.when(F.col("overdue")!=0, F.col("lag_id")).otherwise(None))\
.groupBy("fid")\
.agg(F.min("overdue_id").alias("min_overdue_id"))
>>> res.show()
+---+--------------+
|fid|min_overdue_id|
+---+--------------+
| 0| 1|
| 1| 5|
| 2| null|
+---+--------------+

You need to use the lag and window function. Before we begin, why is your example output showing null for fid 1. The first non zero value is for id 26, so the id before that is 5. so shouldn't be 5? Unless you need something else, you can try this.
tst=sqlContext.createDataFrame([(0, 1,5,0),(0,20,5,0),(0,30,5,5),(0,13,5,10),(0,14,5,0),(1,5,5,0),(1,26,5,5),(1,27,5,10),(1,38,5,0)],schema=["fid","id","due","overdue"])
# To filter data
tst_f = tst.where('id!=20')
# Define window function
w=Window.partitionBy('fid').orderBy('id')
tst_lag = tst_f.withColumn('overdue_id',F.lag('id').over(w))
# Remove rows with 0 overdue
tst_od = tst_lag.where('overdue!=0')
# Find the row before first non zero overdue
tst_res = tst_od.groupby('fid').agg(F.first('overdue_id').alias('overdue_id'))
tst_res.show()
+---+----------+
|fid|overdue_id|
+---+----------+
| 0| 1|
| 1| 5|
+---+----------+
If you are weary about using the first function , or just to be confident about avoiding ghost issues, you can try the below performance expensive option
# Create a copy to avoid ambiguous join and select the minimum from non zero overdue rows
tst_min= tst_od.withColumn("dummy",F.lit('dummy')).groupby('fid').agg(F.min('id').alias('id_min'))
# Join this with the dataframe to get results
tst_join = tst_od.join(tst_min,on=tst_od.id==tst_min.id_min,how='right')
tst_join.show()
+---+---+---+-------+----------+---+------+
|fid| id|due|overdue|overdue_id|fid|id_min|
+---+---+---+-------+----------+---+------+
| 1| 26| 5| 5| 5| 1| 26|
| 0| 13| 5| 10| 1| 0| 13|
+---+---+---+-------+----------+---+------+
# This way you can see all the information
You can filter the relevant information from this dataframe using filter() or where() method

Related

PySpark: How to create a column based on the previous value from the same column?

Dear PySpark community:
I would like to calculate the estimate_day_to_sustain before supply. The original code is written in SAS using 'retain' statement, however, I cannot find a way to solve it in PySpark. Please help, thanks!
input data:
output data:
Algorithm:
On day 1: current estimate_day_to_sustain=current day
On other days:
1> if previous estimate_day_to_sustain + previous supply <= current day;
then current estimate_day_to_sustain = current day
2> else current estimate_day_to_sustain= previous estimate_day_to_sustain + previous supply
Explanation of the algorithm:
on day 1: the estimate_day_to_sustain is 1; by the end of the day, 3 days of supply arrive
on day 3, we have 1+3=4 days of supply (from previous row), and it's day 3, so the estimate_day_to_sustain is 4;by the end of the day, 1 days of supply arrive
on day 9, we have 4+1=5 days of supply (from previous row), but it's already day 9, so the estimate_day_to_sustain is 9(this is the tricky part);by the end of the day, 5 days of supply arrive
on day 10, we have 9+5=14 days of supply (from previous row), and it's day 10, so the estimate_day_to_sustain is 14;by the end of the day, 9 days of supply arrive
on day 11, we have 14+9=23 days of supply (from previous row), and it's day 11, so the estimate_day_to_sustain is 23;by the end of the day,1 days of supply arrive
Two ways to do it -
using arrays, aggregate() and lambda function, inspired by this answer (SPARK 3.1+)
using rdd, flatMapValues() and python function
using your input data
data_sdf.show()
# +---+------+
# |day|supply|
# +---+------+
# | 1| 3|
# | 3| 1|
# | 9| 5|
# | 10| 9|
# | 11| 1|
# +---+------+
Spark does not retain a sort order like SAS data steps do. So, we will have to sort the array or list wherever required.
Using arrays, aggregate() and lambda function
SPARK 3.1+
create day-supply structs and collect it to create an array of the said structs. The array_sort() is used to order the array of structs by the day field (first element within struct). The aggregate() takes in an initial value and applies a function to each element of the provided array. So, the initial value is the array's first struct, and the lambda function is applied to each of the remaining structs. The array_union() is used to append the newly created array, after applying the lambda function, to the initial value recursively. Finally, the inline() function is used to create separate columns from the newly created array of structs. Detailed explanation on its workings can be found in this answer.
data2_sdf = data_sdf. \
withColumn('day_supply_struct', func.struct(func.col('day'), func.col('supply'))). \
groupBy(func.lit(1).alias('group_field')). \
agg(func.array_sort(func.collect_list('day_supply_struct')).alias('ds_struct_arr'))
# +-----------+------------------------------------------+
# |group_field|ds_struct_arr |
# +-----------+------------------------------------------+
# |1 |[{1, 3}, {3, 1}, {9, 5}, {10, 9}, {11, 1}]|
# +-----------+------------------------------------------+
# create new field within the struct
data3_sdf = data2_sdf. \
withColumn('arr_struct_w_est_days',
func.aggregate(func.slice(func.col('ds_struct_arr'), 2, data_sdf.count()),
func.array(func.col('ds_struct_arr')[0].withField('estimate_days', func.col('ds_struct_arr')[0]['day'])),
lambda x, y: func.array_union(x,
func.array(y.withField('estimate_days',
func.when(func.element_at(x, -1)['estimate_days'] + func.element_at(x, -1)['supply'] <= y['day'], y['day']).
otherwise(func.element_at(x, -1)['estimate_days'] + func.element_at(x, -1)['supply'])
)
)
)
)
)
# +-----------+------------------------------------------+-----------------------------------------------------------+
# |group_field|ds_struct_arr |arr_struct_w_est_days |
# +-----------+------------------------------------------+-----------------------------------------------------------+
# |1 |[{1, 3}, {3, 1}, {9, 5}, {10, 9}, {11, 1}]|[{1, 3, 1}, {3, 1, 4}, {9, 5, 9}, {10, 9, 14}, {11, 1, 23}]|
# +-----------+------------------------------------------+-----------------------------------------------------------+
# create columns using the struct fields
data3_sdf. \
selectExpr('inline(arr_struct_w_est_days)'). \
show()
# +---+------+-------------+
# |day|supply|estimate_days|
# +---+------+-------------+
# | 1| 3| 1|
# | 3| 1| 4|
# | 9| 5| 9|
# | 10| 9| 14|
# | 11| 1| 23|
# +---+------+-------------+
using rdd, flatMapValues() and python function
create a python function to calculate estimate days while keeping track of the previous calculated values. It takes in a group of row-lists (groupBy() is used to identify the grouping) and creates a list of rows-lists.
# best to ship this function to all executors in case of huge datasets
def estimateDaysCalc(groupedRows):
res = []
frstRec = True
for row in groupedRows:
if frstRec:
frstRec = False
# the first day will have a static value
estimate_days = row.day
else:
if prev_est_day + prev_supply <= row.day:
estimate_days = row.day
else:
estimate_days = prev_est_day + prev_supply
# keep track of the current calcs for next row calcs
prev_est_day = estimate_days
prev_supply = row.supply
prev_day = row.day
res.append([item for item in row] + [estimate_days])
return res
run the function on the RDD using flatMapValues() and extract the values. A sort on the day field is required, and a sorted() is used to sort the group of row-lists by the field (ok.day).
data_vals = data_sdf.rdd. \
groupBy(lambda gk: 1). \
flatMapValues(lambda r: estimateDaysCalc(sorted(r, key=lambda ok:ok.day))). \
values()
create schema for the new values
data_schema = data_sdf. \
withColumn('dropme', func.lit(None).cast('int')). \
drop('dropme'). \
schema. \
add('estimate_days', 'integer')
create dataframe using the newly created values and schema
new_data_sdf = spark.createDataFrame(data_vals, data_schema)
new_data_sdf.show()
# +---+------+-------------+
# |day|supply|estimate_days|
# +---+------+-------------+
# | 1| 3| 1|
# | 3| 1| 4|
# | 9| 5| 9|
# | 10| 9| 14|
# | 11| 1| 23|
# +---+------+-------------+

forward fill nulls with latest non null value over each column except first two

I have a dataframe that after my pivot it created rows with null values . I need to replace the null values with the latest non null value. And I need to do this over each column in the df except the first two
Sample:
columns = ['date', 'group', 'value', 'value2']
data = [\
('2020-1-1','b', 5, 20),\
('2020-2-1','a', None, 15),\
('2020-3-1','a', 20, None),\
('2020-3-1','b', 10, None),\
('2020-2-1','b', None, None),\
('2020-1-1','a', None, None),\
('2020-4-1','b', None, 100)]
sdf = spark.createDataFrame(data, columns)
Window function for ffill logic
# fill nulls with previous non null value
plist = ['group']
ffill = Window.partitionBy(*plist).orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
Goal: I basically want to overwrite the value and value2 columns by replacing the nulls. THis is a sample but my actual df has over 30 columns. How can i loop through all of them , again, except col 1 & 2.
Use last function with ignorenulls set to True to get the last non-null value within a window (if all null then return null). For looping through all columns except the first two, you can use list comprehension.
from pyspark.sql.functions import col, last
# all colums except the first two
cols = sdf.columns[2:]
sdf = sdf.select('date', 'group',
*[last(col(c), ignorenulls=True).over(ffill).alias(c) for c in cols])
sdf.show()
# +--------+-----+-----+------+
# | date|group|value|value2|
# +--------+-----+-----+------+
# |2020-1-1| b| 5| 20|
# |2020-2-1| b| 5| 20|
# |2020-3-1| b| 10| 20|
# |2020-4-1| b| 10| 100|
# |2020-1-1| a| null| null|
# |2020-2-1| a| null| 15|
# |2020-3-1| a| 20| 15|
# +--------+-----+-----+------+

Looking to substract every value in a row based on the value of a separate DF

As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()
You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+

pyspark Transpose dataframe

I have a dataframe as given below
ID, Code_Num, Code, Code1, Code2, Code3
10, 1, A1005*B1003, A1005, B1003, null
12, 2, A1007*D1008*C1004, A1007, D1008, C1004
I need help on transposing the above dataset, and output should be displayed as below.
ID, Code_Num, Code, Code_T
10, 1, A1005*B1003, A1005
10, 1, A1005*B1003, B1003
12, 2, A1007*D1008*C1004, A1007
12, 2, A1007*D1008*C1004, D1008
12, 2, A1007*D1008*C1004, C1004
Step 1: Creating the DataFrame.
values = [(10, 'A1005*B1003', 'A1005', 'B1003', None),(12, 'A1007*D1008*C1004', 'A1007', 'D1008', 'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Step 2: Explode the DataFrame -
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID| Code|Code_T|
+---+-----------------+------+
| 10| A1005*B1003| A1005|
| 10| A1005*B1003| B1003|
| 10| A1005*B1003| null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+
In case you only want non-Null values in column Code_T, just run the statement below -
df = df.where(col('Code_T').isNotNull())

pyspark/dataframe - creating a nested structure

i'm using pyspark with dataframe and would like to create a nested structure as below
Before:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
After:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
Is this doable?
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
#F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
First, reproducible example of your dataframe.
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+