Forward Fill New Row to Account for Missing Dates - pyspark

I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x.
I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF.
I'd initially thought about something like the following with PANDAS but also struggled to implement this to just fill ignoring the aggregator as a first pass:
df = df.set_index(keys=[df.timestamp]).resample('1H', fill_method='ffill')
But ideally i'd like to avoid using PANDAS.
In the example below i have two missing rows of hourly data (labeled as MISSING).
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| MISSING | MISSING |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| MISSING | MISSING |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
The expected output here would be the following:
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| 2018-12-27T11:00:00Z | A |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| 2018-12-27T12:00:00Z | B |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
Appreciate the help.
Thanks.

Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
from dateutil.relativedelta import relativedelta
def missing_hours(t1, t2):
return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
df = spark.read.csv('dates.csv',header=True,inferSchema=True)
window = Window.partitionBy("aggregator").orderBy("timestamp")
df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
.filter(col("prev_timestamp").isNotNull())\
.withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
.drop("prev_timestamp")
df.union(df_mising).orderBy("aggregator","timestamp").show()
which results
+-------------------+----------+
| timestamp|aggregator|
+-------------------+----------+
|2018-12-27 09:00:00| A|
|2018-12-27 10:00:00| A|
|2018-12-27 11:00:00| A|
|2018-12-27 12:00:00| A|
|2018-12-27 13:00:00| A|
|2018-12-27 09:00:00| B|
|2018-12-27 10:00:00| B|
|2018-12-27 11:00:00| B|
|2018-12-27 12:00:00| B|
|2018-12-27 13:00:00| B|
|2018-12-27 14:00:00| B|
+-------------------+----------+

Related

PySpark Relate multiple rows

I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
---------------------
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
e.g.
ID | MPAN | Value | Relation
----------------------------------------
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
df.show()
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
df.show()
+---+---+--------+-------------+
|idx| ID| MPAN| Value|
+---+---+--------+-------------+
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
+---+---+--------+-------------+
you can simply do that with last function :
from pyspark.sql import functions as F, Window
df.withColumn(
"Relation",
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
Window.orderBy("idx")
),
).show()
+---+---+--------+-------------+--------+
|idx| ID| MPAN| Value|Relation|
+---+---+--------+-------------+--------+
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|
+---+---+--------+-------------+--------+

how to update a cell of a spark data frame

I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!
You can create new column using withColumn (either putting original or updated value) and then drop isExist column.
I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

Spark dataframe: Pivot and Group based on columns

I have input dataframe as below with id, app, and customer
Input dataframe
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
Expected output
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
Expected dataframe
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
What have i tried ?
val newDF =
df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
Need your suggestion to get the list of customer under each app respectively.
You can use collect_list if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+
Using CONCAT_WS we can explode array and can remove the square brackets.
df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))

how to output multiple (key,value) in spark map function

The format of input data likes below:
+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+
And the format of output likes below():
+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+
The Right means 1 , The Wrong means 0.
I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.
Use split and explode twice and do the union
val df = List(
("studentNo01","a,b,c","x,y,z"),
("studentNo02","c,d","v,w")
).toDF("StudenID","Right","Wrong")
+-----------+-----+-----+
| StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02| c,d| v,w|
+-----------+-----+-----+
val pair = (
df.select('StudenID,explode(split('Right,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(1))
).unionAll(
df.select('StudenID,explode(split('Wrong,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(0))
)
+-------------+-----+
| key|value|
+-------------+-----+
|studentNo01,a| 1|
|studentNo01,b| 1|
|studentNo01,c| 1|
|studentNo02,c| 1|
|studentNo02,d| 1|
|studentNo01,x| 0|
|studentNo01,y| 0|
|studentNo01,z| 0|
|studentNo02,v| 0|
|studentNo02,w| 0|
+-------------+-----+
You can convert to RDD as follows
val rdd = pair.map(r => (r.getString(0),r.getInt(1)))