How to pull data by looping over dates in pyspark sql? - pyspark

I have a script where I'm pulling data into a pyspark DataFrame using spark sql. The script is shown below:
from pyspark import SparkContext, SparkConf, HiveContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df_query = """
select
*
from schema.table
where start_date between date '2019-03-01' and date '2019-03-07'
"""
df = spark.sql(df_query)
Currently, the script pulls data for a particular week. However, I want to iterate this script over all weeks. How can I do that?

You can use the timedelta class for that:
import datetime
startDate = datetime.datetime.strptime('2019-03-01', "%Y-%m-%d")
maxDate = datetime.datetime.strptime('2019-04-03', "%Y-%m-%d")
while startDate <= maxDate:
endDate = startDate + datetime.timedelta(days=7)
df_query = """
select
*
from schema.table
where start_date between date '{}' and date '{}'
""".format(startDate.date(), endDate.date())
print(df_query)
startDate = endDate + datetime.timedelta(days=1)

Related

Convert using unixtimestamp to Date

I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))

Add days to a datetime and convert it to a date ? - Odoo V14

i want to add 30 days to a datetime field and make my date field take this value as default.
I tried this, but don't work :/
from odoo import models, fields, api
from datetime import datetime
from dateutil.relativedelta import relativedelta
class crm_lead(models.Model):
_inherit = 'crm.lead'
date_deadline = fields.Datetime(string='Fermeture prévue')
#api.onchange('create_date')
def _onchange_enddate(self):
if self.create_date:
date_end = ( datetime.strptime(self.create_date, '%Y-%m-%d') + relativedelta(days =+ 30).strftime('%Y-%m-%d') )
self.date_deadline = date_end.date()
Thanks by advance !
The create_date is a magic field set in _create method. To use the creation date as a default value for the date_deadline field, you can use the default attribute and use the Date.today function to get the creation date.
Example:
date_deadline = fields.Datetime(default=lambda record: fields.Date.today() + relativedelta(days=30))
Does this work?:
from odoo import models, fields, api
from datetime import datetime, date, timedelta
class crm_lead(models.Model):
_inherit = 'crm.lead'
date_deadline = fields.Datetime(string='Fermeture prévue')
#api.onchange('create_date')
def _onchange_enddate(self):
if self.create_date:
date_end = ( date.today() + timedelta(30))
date_deadline = date_end
It gives the current date + 30 days.
It does not strip the datetime variable, so you might want to do that after.

pyspark How to filter rows based on HH:mm:ss portion in timestamp column

I have a dataframe in pyspark that has a timestamp string column in the following format:
"11/21/2018 07:21:49 PM"
This is in 24 hours format.
I want to filter the rows in the dataframe based on only the time portion of this string timestamp regardless of the date. For example I want to keep all rows that fall between the hours of 2:00pm and 4:00pm inclusive.
I tried the below to extract the HH:mm:ss and use the function between but it is not working.
# Grabbing only time portion from datetime column
import pyspark.sql.functions as F
time_format = "HH:mm:ss"
split_col = F.split(df['datetime'], ' ')
df = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
df = df.withColumn('Timestamp', from_unixtime(unix_timestamp('Time', format=time_format)))
df.filter(F.col("Timestamp").between('14:00:00','16:00:00')).show()
Any ideas on how to filter rows only based on the HH:mm:ss portion in a timestamp column regardless of the actual date, would be very appreciated.
Format your timestamp to HH:mm:ss then filter using between clause.
Example:
df=spark.createDataFrame([("11/21/2018 07:21:49 PM",),("11/22/2018 04:21:49 PM",),("11/23/2018 12:21:49 PM",)],["ts"])
from pyspark.sql.functions import *
df.withColumn("tt",from_unixtime(unix_timestamp(col("ts"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).\
filter(col("tt").between("12:00","16:00")).\
show()
#+----------------------+--------+
#|ts |tt |
#+----------------------+--------+
#|11/23/2018 12:21:49 PM|12:21:49|
#+----------------------+--------+

Pyspark Getting the last date of the previous quarter based on Today's Date

In a code repo, using pyspark, I'm trying to use today's date and based on this I need to retrieve the last day of the prior quarter. This date would be then used to filter out data in a data frame. I was trying to create a dataframe in a code repo and that wasn't working. My code works in Code Workbook. This is my code workbook code.
import datetime as dt
import pyspark.sql.functions as F
def unnamed():
date_df = spark.createDataFrame([(dt.date.today(),)], ['date'])
date_df = date_df \
.withColumn('qtr_start_date', F.date_trunc('quarter', F.col('date'))) \
.withColumn('qtr_date', F.date_sub(F.col('qtr_start_date'), 1))
return date_df
Any help would be appreciated.
I got the following code to run successfully in a Code Repository:
from transforms.api import transform_df, Input, Output
import datetime as dt
import pyspark.sql.functions as F
#transform_df(
Output("/my/output/dataset"),
)
def my_compute_function(ctx):
date_df = ctx.spark_session.createDataFrame([(dt.date.today(),)], ['date'])
date_df = date_df \
.withColumn('qtr_start_date', F.date_trunc('quarter', F.col('date'))) \
.withColumn('qtr_date', F.date_sub(F.col('qtr_start_date'), 1))
return date_df
You'll need to pass the ctx argument into your transform, and you can make the pyspark.sql.DataFrame directly using the underlying spark_session variable.
If you already have the date column available in your input, you'll just need to make sure it's the Date type so that the F.date_trunc call works on the correct type.

How to round off a datetime column in pyspark dataframe to nearest quarter

I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.
try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))
Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps