to_date gives null on format yyyyww (202001 and 202053) - date

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?

Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.
import datetime
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+

Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.

The following will not use udf, but instead, a more efficient vectorized pandas_udf:
import pandas as pd
#F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+

Related

Selecting subset spark dataframe by months

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.
You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

Merge two tables respecting to dates (date & period) using pyspark

Is there a way to merge two tables in pyspark - respect to a date, one presenting events linked to a date, and an other one presenting some other informations, presenting a period with a start and an end date ?
There is similar topics on python, but non on pyspark, like presented (using numpy) in this answer. My idea would not to get only one information but the complete available information in my right table.
In this example, I would get in df1, based on the id, all available information in df2 for this id, respecting the event_date including in the start_period and the end_period.
df1 = spark.createDataFrame([
(1,'a', datetime.datetime(2021,1,1)),
(1,'b', datetime.datetime(2021,1,5)),
(1,'c', datetime.datetime(2021,1,24)),
(2,'d', datetime.datetime(2021,1,10)),
(2,'e' , datetime.datetime(2021,1,15))], ['id','event','event_date'])
df2 = spark.createDataFrame([
(1,'Xxz45','XX013', datetime.datetime(2021,1,1), datetime.datetime(2021,1,10)),
(1,'Xasz','XX014', datetime.datetime(2021,1,11), datetime.datetime(2021,1,22)),
(1,'Xbbd','XX015', datetime.datetime(2021,1,23), datetime.datetime(2021,1,26)),
(1,'Xaaq','XX016', datetime.datetime(2021,1,27), datetime.datetime(2021,1,31))], ['id','info1','info2','start_period', 'end_period'])
[EDIT] The expected output would be (merging on id and on the event_date included in the period):
df_results = spark.createDataFrame([
(1, 'a', datetime.datetime(2021,1,1),'Xxz45','XX013'),
(1, 'b', datetime.datetime(2021,1,5),'Xxz45','XX013'),
(1, 'c', datetime.datetime(2021,1,24),'Xbbd','XX015'),
(2, 'd', datetime.datetime(2021,1,10), NA, NA),
(2, 'e' , datetime.datetime(2021,1,15), NA, NA)], ['id','event','event_date','info1','info2'])
You can left join df1 with df2 with condition start_period <= event_date <= end_period
from pyspark.sql import functions as F
(df1
.join(df2, on=[df1['id'] == df2['id'], (df1['event_date'] >= df2['start_period']) & (df1['event_date'] <= df2['end_period'])], how='left')
.drop(df2['id'])
.drop('start_period', 'end_period')
.show()
)
# Output
# +---+-----+-------------------+-----+-----+
# | id|event| event_date|info1|info2|
# +---+-----+-------------------+-----+-----+
# | 1| a|2021-01-01 00:00:00|Xxz45|XX013|
# | 1| b|2021-01-05 00:00:00|Xxz45|XX013|
# | 1| c|2021-01-24 00:00:00| Xbbd|XX015|
# | 2| d|2021-01-10 00:00:00| null| null|
# | 2| e|2021-01-15 00:00:00| null| null|
# +---+-----+-------------------+-----+-----+
What you can do is write an UDF that creates a new column in df2 from start_period and end_period, with values like
[
datetime.datetime(2021,1,1),
datetime.datetime(2021,1,2),
datetime.datetime(2021,1,3),
datetime.datetime(2021,1,4),
datetime.datetime(2021,1,5),
datetime.datetime(2021,1,6),
datetime.datetime(2021,1,7),
datetime.datetime(2021,1,8),
datetime.datetime(2021,1,9),
datetime.datetime(2021,1,10)
]
After that you can explode this column and get a row for every date in the list. Finally, you can do an ordinary join between df1 and df2.
I did not check whether there is any pushdown function to create the list of dates from the interval.

Spark dataframe filter a timestamp by just the date part

How can I filter a spark dataframe that has a column of type timestamp but filter out by just the date part. I tried below, but it only matches if time is 00:00:00.
Basically I want the filter to match all rows with date 2020-01-01 (3 rows)
import java.sql.Timestamp
val df = Seq(
(1, Timestamp.valueOf("2020-01-01 23:00:01")),
(2, Timestamp.valueOf("2020-01-01 00:00:00")),
(3, Timestamp.valueOf("2020-01-01 12:54:00")),
(4, Timestamp.valueOf("2019-12-15 09:54:00")),
(5, Timestamp.valueOf("2019-12-09 10:12:43"))
).toDF("someCol","someTimeStamp")
df.filter(df("someTimeStamp") === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 2|2020-01-01 00:00:00| // ONLY MATCHED with time 00:00
+-------+-------------------+
Use the to_date function to extract the date from the timestamp:
scala> df.filter(to_date(df("someTimeStamp")) === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 1|2020-01-01 23:00:01|
| 2|2020-01-01 00:00:00|
| 3|2020-01-01 12:54:00|
+-------+-------------------+

Filling missing dates in spark dataframe column

I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing.
For eg -
Current df -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
14-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
20-09-2016 | 2
As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
12-09-2016 | 0
13-09-2016 | 0
14-09-2016 | 0
15-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
18-09-2016 | 0
19-09-2016 | 0
20-09-2016 | 2
Any help/suggestion regarding this will be appreciated. Thanks in advance.
Note that I am coding in scala.
I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.
Needed imports
import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}
UDFs for String to Valid date format
val date_transform = udf((date: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
val dt = LocalDate.parse(date, dtFormatter)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
Below UDF code taken from Iterate over dates range
def fill_dates = udf((start: String, excludedDiff: Int) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val fromDt = LocalDateTime.parse(start, dtFormatter)
(1 to (excludedDiff - 1)).map(day => {
val dt = fromDt.plusDays(day)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
})
Setting up sample dataframe (df)
val df = Seq(
("10-09-2016", 1),
("11-09-2016", 2),
("14-09-2016", 0),
("16-09-2016", 1),
("17-09-2016", 0),
("20-09-2016", 2)).toDF("date", "quantity")
.withColumn("date", date_transform($"date").cast(TimestampType))
.withColumn("quantity", $"quantity".cast(LongType))
df.printSchema()
root
|-- date: timestamp (nullable = true)
|-- quantity: long (nullable = false)
df.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-14 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Create a temporary dataframe(tempDf) to union with df:
val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
.filter($"diff" > 1) // Pick date diff more than one day to generate our date
.withColumn("next_dates", fill_dates($"date", $"diff"))
.withColumn("quantity", lit("0"))
.withColumn("date", explode($"next_dates"))
.withColumn("date", $"date".cast(TimestampType))
tempDf.show(false)
+-------------------+--------+----+------------------------+
|date |quantity|diff|next_dates |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0 |2 |[2016-09-15] |
|2016-09-18 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+
Now union two dataframes
val result = df.union(tempDf.select("date", "quantity"))
.orderBy("date")
result.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-12 00:00:00| 0|
|2016-09-13 00:00:00| 0|
|2016-09-14 00:00:00| 0|
|2016-09-15 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-18 00:00:00| 0|
|2016-09-19 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Based on the #mrsrinivas excellent answer, here is the PySpark version.
Needed imports
from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType
UDF to create the range of next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
Function the create the DateFrame filling the dates (support "grouping" columns):
def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
window = Window.orderBy(*group_columns, date_column)
return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
.filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
.withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
.drop("_diff", "_next_dates")
The usage of the function:
fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)
It assumes that the date column is already in date type.
Here is a slight modification, to use this function with months and enter measure columns (columns that should be set to zero) instead of group columns:
from typing import List
import datetime
from dateutil import relativedelta
import math
import pyspark.sql.functions as f
from pyspark.sql import DataFrame, Window
from pyspark.sql.types import DateType, ArrayType
def fill_time_gaps_date_diff_based(df: pyspark.sql.dataframe.DataFrame, measure_columns: list, date_column: str):
group_columns = [col for col in df.columns if col not in [date_column]+measure_columns]
# save measure sums for qc
qc = df.agg({col: 'sum' for col in measure_columns}).collect()
# convert month to date
convert_int_to_date = f.udf(lambda mth: datetime.datetime(year=math.floor(mth/100), month=mth%100, day=1), DateType())
df = df.withColumn(date_column, convert_int_to_date(date_column))
# sort values
df = df.orderBy(group_columns)
# get_fill_dates_df (instead of months_between also use date_diff for days)
window = Window.orderBy(*group_columns, date_column)
# calculate diff column
fill_df = df.withColumn(
"_diff",
f.months_between(f.lead(date_column, 1).over(window), date_column).cast(IntegerType())
).filter(
f.col("_diff") > 1
)
# generate next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [
start_date + relativedelta.relativedelta(months=months)
for months in range(1, diff)
]
get_next_dates_udf = f.udf(_get_next_dates, ArrayType(DateType()))
fill_df = fill_df.withColumn(
"_next_dates",
get_next_dates_udf(date_column, "_diff")
)
# set measure columns to 0
for col in measure_columns:
fill_df = fill_df.withColumn(col, f.lit(0))
# explode next_dates column
fill_df = fill_df.withColumn(date_column, f.explode('_next_dates'))
# drop unneccessary columns
fill_df = fill_df.drop(
"_diff",
"_next_dates"
)
# union df with fill_df
df = df.union(fill_df)
# qc: should be removed for productive runs
if qc != df.agg({col: 'sum' for col in measure_columns}).collect():
raise ValueError('Sums before and after run do not fit.')
return df
Please note, that I assume that the month is given as Integer in the form YYYYMM. This could easily be adjusted by modifying the "convert month to date" part.

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.