Fix Dates in Pyspark DataFrame - set to minimum value - pyspark

I have a data frame with a timestamp field - RECEIPTDATEREQUESTED:timestamp
For some reason, there are dates that are less than 1900-01-01. I don't want these, what I want to do, is for every value in the column of the dataframe where the RECEIPTDATEREQUESTED<'1900-01-01 00:00:00' then set the timestamp to either 1900-01-01 or null.
I've tried a few ways to do this, but it seems some more simple must exist. I thought something like this might work, but
import datetime
def testdate(date_value):
oldest = datetime.datetime.strptime('1900-01-01 00:00:00', '%Y-%m-%d')
try:
if (date_value < oldest):
return oldest
else:
return date_value
except ValueError:
return oldest
udf_testdate = udf(lambda x:testdate(x),TimestampType())
bdf = olddf.withColumn("RECEIPTDATEREQUESTED",udf_testdate(col("RECEIPTDATEREQUESTED")))

You can use conditional evaluation using when and otherwise to set RECEIPTDATEREQUESTED to either null or 1900-01-01 00:00:00 whenerver the value is < '1900-01-01 00:00:00'.
from pyspark.sql import functions as F
data = [("1000-01-01 00:00:00",),
("1899-12-31 23:59:59",),
("1900-01-01 00:00:00",),
("1901-01-01 00:00:00",)]
df = spark.createDataFrame(data, ("RECEIPTDATEREQUESTED",))\
.withColumn("RECEIPTDATEREQUESTED", F.to_timestamp(F.col("RECEIPTDATEREQUESTED")))
# Fill null
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit(None))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
# Fill default value
df.withColumn("RECEIPTDATEREQUESTED",
F.when(F.col("RECEIPTDATEREQUESTED") < "1900-01-01 00:00:00", F.lit("1900-01-01 00:00:00").cast("timestamp"))
.otherwise(F.col("RECEIPTDATEREQUESTED")))\
.show(200, False)
Output
Fill null
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|null |
|null |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+
Fill 1900-01-01 00:00:00
+--------------------+
|RECEIPTDATEREQUESTED|
+--------------------+
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1900-01-01 00:00:00 |
|1901-01-01 00:00:00 |
+--------------------+

Related

to_date in selectexpr of pyspark is truncating date time to year by default. how to avoid this?

I have a requirement where I derive the year to be loaded and then have to load the first day and last day of that year in date format to a table.
Here is what I'm doing:
boy = str(nxt_yr)+'01-01'
eoy = str(nxt_yr)+'12-31'
df_final = df_demo.selectExpr("to_date('{}','yyyy-MM-dd') as strt_dt".format(boy),"to_date('{}','yyyy-MM-dd') as end_dt".format(eoy))
spark.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
df_final.show(1)
this is giving me 2023-01-01 in both the fields in date datatype.
Is this expected behavior and if yes is there any workaround?
Note: I tried hardcoding the date as 2022-11-30 also in the code but still received the beginning of the year in the output.
Its working as expected , additionally you are missing a - within your dates you create for conversions
nxt_yr = 2022
boy = str(nxt_yr)+'-01-01'
# |
# /|\
# |
# |
eoy = str(nxt_yr)+'-12-31'
sql.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
sql.sql(f"""
SELECT
to_date('{boy}','yyyy-MM-dd') as strt_dt
,to_date('{eoy}','yyyy-MM-dd') as end_dt
"""
).show()
+----------+----------+
| strt_dt| end_dt|
+----------+----------+
|2022-01-01|2022-12-31|
+----------+----------+

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG

Pyspark calculated field based off time difference

I have a table that looks like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime|
+-------------+----------------------+----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 |
In the end, I need to get create a speed column for each row, so something like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| speed |
+-------------+----------------------+----------------------+-------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 13.5 |
So this is what I'm trying to do to get there. I figure I should add an interium column to help out, called trip_time which is a calculation of tpep_dropoff_datetime - tpep_pickup_datetime. Here is the code I'm doing to get that:
df4 = df.withColumn('trip_time', df.tpep_dropoff_datetime - df.tpep_pickup_datetime)
which is producing a nice trip_time column:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| trip_time|
+-------------+----------------------+----------------------+-----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 6 minutes 40 seconds|
But now I want to do the speed column, and this how I'm trying to do that:
df4 = df4.withColumn('speed', (F.col('trip_distance') / F.col('trip_time')))
But that is giving me this error:
AnalysisException: cannot resolve '(trip_distance/trip_time)' due to data type mismatch: differing types in '(trip_distance/trip_time)' (float and interval).;;
Is there a better way?
One option is to convert your time to unix_timestamp which is in unit of seconds, and then you can do the subtraction, which gives you interval as integer that can be further used to calculate speed:
import pyspark.sql.functions as f
df.withColumn('speed', f.col('trip_distance') * 3600 / (
f.unix_timestamp('tpep_dropoff_datetime') - f.unix_timestamp('tpep_pickup_datetime'))
).show()
+-------------+--------------------+---------------------+-----+
|trip_distance|tpep_pickup_datetime|tpep_dropoff_datetime|speed|
+-------------+--------------------+---------------------+-----+
| 1.5| 2019-01-01 00:46:40| 2019-01-01 00:53:20| 13.5|
+-------------+--------------------+---------------------+-----+

Scala: copying a dataframe column into array and preserving the original order

Suppose I have a dataframe df with one timestamp column and one integer column such that no timestamp appears in more than 1 record. It looks like this:
timestamp | value
------------------
2019-07-03 | 2100
2019-04-15 | 1828
2019-06-01 | 948
2019-07-12 | 2912
[etc.]
Using the following I can order this by timestamp:
df.createorReplaceView("tmp")
var sql_cmd = """select
*
from
tmp
order by
timestamp asc""";
var new_df = spark.sql(sql_command);
and get new_df looking this way:
timestamp | value
------------------
2019-04-15 | 1828
2019-06-01 | 948
2019-07-03 | 2100
2019-07-12 | 2912
[etc.]
Is there a way to put the contents of value of new_df into an array new_df_array such that the ordering of the numbers of that column is preserved? (That is: new_df_array[0] == 1828, new_df_array[1] == 948 etc.)
This should do the trick:
val array = new_df.coalesce(1).sortWithinPartitions($"timestamp").collect()
Note that is no dataframe, but a plain scala array

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).