How to round off a datetime column in pyspark dataframe to nearest quarter - pyspark

I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.

try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))

Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps

Related

Unix timestamp granularity changed to hours instead of milliseconds

I have a Spark data frame with the column timestamp. I need to create event_hour in unix_timestamp format out of this column. The current issue is that the timestamp is in unix_timestamp format with a granularity of milliseconds while I need the granularity of hours.
Current values for timestamp:
1653192037
1653192026
1653192025
1653192024
1653192023
1653192022
Expected values:
1653192000
1653195600
1653199200
1653202800
How can I achieve that using Spark functions?
I've already tried to convert it to timestamp and then format it but I got null as the result:
inputDf
.withColumn("event_hour", unix_timestamp(date_format($"timestamp".cast(TimestampType), "MM-dd-yyyy HH")))
A (not very explicit but) efficient way would be to use modulus operation with 3600 (as 3600 seconds = 1 hour):
timestamp_hour = timestamp_second - (timestamp_second % 3600)
This assumes you are manipulating data as numeric.
You can use DateUtils API,
import org.apache.commons.lang3.time.DateUtils;
Long epochTimestamp_hour = DateUtils.truncate(Timestamp_column, Calendar.HOUR)).getTime();
create new column of type timestamp
use that column to truncate timestamp to epochTimestamp_hour

Calculate previous year end date from current date

I have a bus_date column. which has multiple records with different date i.e 2021-03-15, 2021-05-12, 2021-01-15 etc.
I want to calculate previous year end for all given dates. my expected output is 2020-12-31 for all three dates.
However, I can use function date_sub(start_date, num_days).
but I don't want to manually pass num_days. since there are million of rows with diff dates.
Can we write a view from a table or create dataframe, which will calculate previous year end?
You can use date_add and date_trunc to achieve this.
import pyspark.sql.functions as F
......
data = [
('2021-03-15',),
('2021-05-12',),
('2021-01-15',)
]
df = spark.createDataFrame(data, ['bus_date'])
df = df.withColumn('pre_year_end', F.date_add(F.date_trunc('yyyy', 'bus_date'), -1))
df.show()

Convert using unixtimestamp to Date

I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))

How to change the column type from String to Date in DataFrames?

I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date
Spark >= 2.2
You can use to_date:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
or to_timestamp:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
with intermediate unix_timestamp call.
Spark < 2.2
Since Spark 1.5 you can use unix_timestamp function to parse string to long, cast it to timestamp and truncate to_date:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
Note:
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.
If you use unpatched version unix_timestamp output requires multiplication by 1000.

How to join a date to date ranges in Pandas?

I have one table with dates and another table where there is rather weekly data. My weeks start at Tuesday and the second table's date is supposed to determine the week (basically the Tuesday before the date is the start of the week; alternatively that date is an example day in that week).
How can I join the dates to information about weeks?
Here is the setup:
from datetime import datetime as dt
import pandas as pd
df=pd.DataFrame([dt(2016,2,3), dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)])
df_week=pd.DataFrame([(dt(2016,2,4),"a"), (dt(2016,2,11),"b")], columns=["week", "val"])
# note the actual start of the weeks are the Tuesdays: 2.2., 9.2.
# I expect a new column df["val"]=["a", "a", "b", "b"]
I've seen pandas date_range, but I cannot see how to do that from there.
You're looking for DatetimeIndex.asof:
This will give you the closest index up to the day in df:
df_week.set_index('week', inplace=True)
df_week.index.asof(df['day'][1])
You can now use it to select the corresponding value:
df_week.loc[df_week.index.asof(df['day'][1])]
Finally, apply it to the entire dataframe:
df = pd.DataFrame([dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)], columns=['day'])
df['val'] = df.apply(lambda row: df_week.loc[df_week.index.asof(row['day'])]['val'], axis=1)
I removed the first value from df because I didn't want to deal with edge cases.
Result:
day val
0 2016-02-08 a
1 2016-02-09 a
2 2016-02-15 b