Pyspark Getting the last date of the previous quarter based on Today's Date - pyspark

In a code repo, using pyspark, I'm trying to use today's date and based on this I need to retrieve the last day of the prior quarter. This date would be then used to filter out data in a data frame. I was trying to create a dataframe in a code repo and that wasn't working. My code works in Code Workbook. This is my code workbook code.
import datetime as dt
import pyspark.sql.functions as F
def unnamed():
date_df = spark.createDataFrame([(dt.date.today(),)], ['date'])
date_df = date_df \
.withColumn('qtr_start_date', F.date_trunc('quarter', F.col('date'))) \
.withColumn('qtr_date', F.date_sub(F.col('qtr_start_date'), 1))
return date_df
Any help would be appreciated.

I got the following code to run successfully in a Code Repository:
from transforms.api import transform_df, Input, Output
import datetime as dt
import pyspark.sql.functions as F
#transform_df(
Output("/my/output/dataset"),
)
def my_compute_function(ctx):
date_df = ctx.spark_session.createDataFrame([(dt.date.today(),)], ['date'])
date_df = date_df \
.withColumn('qtr_start_date', F.date_trunc('quarter', F.col('date'))) \
.withColumn('qtr_date', F.date_sub(F.col('qtr_start_date'), 1))
return date_df
You'll need to pass the ctx argument into your transform, and you can make the pyspark.sql.DataFrame directly using the underlying spark_session variable.
If you already have the date column available in your input, you'll just need to make sure it's the Date type so that the F.date_trunc call works on the correct type.

Related

Convert using unixtimestamp to Date

I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))

How to make timestamp date column into preferred format dd/MM/yyyy?

I have a column YDate in the form yyyy-MM-dd HH:mm:ss (timestamp type) but would like to convert it to dd/MM/yyyy.
I tried that;
df = df.withColumn('YDate',F.to_date(F.col('YDate'),'dd/MM/yyyy'))
but get yyyy-MM-dd.
How can I effectively do this.
Use date_format instead:
df = df.withColumn('YDate',F.date_format(F.col('YDate'),'dd/MM/yyyy'))
to_date converts from the given format, while date_format converts into the given format.
You can use date_format function present in the pyspark library.
For more information about date formats you can refer to Date Format Documentation.
Below is the code snippet to solve your usecase.
from pyspark.sql import functions as F
df = spark.createDataFrame([('2015-12-28 23:59:59',)], ['YDate'])
df = df.withColumn('YDate', F.date_format('YDate', 'dd/MM/yyy'))

Pyspark convert string type date into dd-mm-yyyy format

Using pyspark 2.4.0
I have the date column in the dateframe as follows :
I need to convert it into DD-MM-YYYY format. I have tried a few solutions including the following code but it returns me null values,
df_students_2 = df_students.withColumn(
'new_date',
F.to_date(
F.unix_timestamp('dt', '%B %d, %Y').cast('timestamp')))
Note that different types of date format in the dt column. It would be easier if i could make the whole column in one format just for the ease of converting ,but since the dataframe is big it is not possible to go through each column and change it to one format. I have also tried the following code, just for the future readers i am including it, for the 2 types of date i tried to go through in a loop, but did not succeed.
def to_date_(col, formats=(datetime.strptime(col,"%B %d, %Y"), \
datetime.strptime(col,"%d %B %Y"), "null")):
return F.coalesce(*[F.to_date(col, f) for f in formats])
Any ideas?
Try this-
implemented in scala, but can be done pyspark with minimal change
// I've put the example formats, but just replace this list with expected formats in the dt column
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df_students.withColumn("new_date", coalesce(dt_formats.map(fmt => to_date($"dt", fmt)):_*))
Try this should work...
from pyspark.sql.functions import to_date
df = spark.createDataFrame([("Mar 25, 1991",), ("May 1, 2020",)],['date_str'])
df.select(to_date(df.date_str, 'MMM d, yyyy').alias('dt')).collect()
[Row(dt=datetime.date(1991, 3, 25)), Row(dt=datetime.date(2020, 5, 1))]
see also - Datetime Patterns for Formatting and Parsing

How to round off a datetime column in pyspark dataframe to nearest quarter

I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.
try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))
Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps

Read JSON data from Blob where the files are stored inside date folders which auto increments everyday

Hdfs blob stores the json data in the below format on a daily basis. I will need to read the json data using spark.read.json() on a day wise. Ex: Today i want to read day=01 day's files and tomorrow i want to read day=02 day's files. Is there a logic i can write in Scala which auto increments the date consider month and year also. Any help would me much appreciated.
/signals/year=2019/month=08/day=01
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=01/*****.json
/signals/year=2019/month=08/day=02
/signals/year=2019/month=08/day=02/*****_.json
/signals/year=2019/month=08/day=02/*****_.json
Looks like data stored in partitioned format, and for read only one date such function can be used:
def readForDate(year: Int, month: Int, day: Int): DataFrame = {
spark.read.json("/signals")
.where($"year" === year && $"month" === month && $"day" === day)
}
For use this function, take current date and split on parts, with regular Scala code, not related to Spark.
If there is any relation between current date and the date you want to process the JSON file, you can get the current date (you can add/minus any number of days) using below Scala code and use it in your Spark application as #pasha701 suggested.
scala> import java.time.format.DateTimeFormatter
scala> import java.time.LocalDateTime
scala> val dtf = DateTimeFormatter.ofPattern("dd") // you can get the Year and Month like this.
scala> val now = LocalDateTime.now()
scala> println(dtf.format(now))
02
scala> println(dtf.format(now.plusDays(2))) // Added two days on the current date
04
Just a thought: If you are using Azure's Databricks then you can run shell command in notebook to get the current day (again if there is any relation on the partition's files you are trying to fetch with the current date) using "%sh" command.
Hope this may help any of you in future. Below code helps to read the data available in blob where the files are stored inside date folders which auto increments everyday. I wanted to read the data of previous day's data so adding now.minusDays(1)
val dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val now = LocalDateTime.now()
val date = dtf.format(now.minusDays(1))
val currentDateHold = date.split("-").toList
val year = currentDateHold(0)
val month = currentDateHold(1)
val day = currentDateHold(2)
val path = "/signals/year="+year+"/month="+month+"/day="+day
// Read JSON data from the Azure Blob`enter code here`
var initialDF = spark.read.format("json").load(path)