How to convert Date to timezone aware datetime in polars - python-polars

Let's say I have
df = pl.DataFrame({
"date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Date), "%Y-%m-%d")
})
How do I localize that to a specific timezone and make it a datetime?
I tried:
df.select(pl.col('date').cast(pl.Datetime(time_zone='America/New_York')))
but that gives me
shape: (2, 1)
date
datetime[μs, America/New_York]
2021-12-31 19:00:00 EST
2022-01-01 19:00:00 EST
so it looks like it's starting from the presumption that the naïve datetimes are UTC and then applying the conversion. I set os.environ['TZ']='America/New_York' but I got the same result.
I looked through the polars config options in the API guide to see if there's something else to set but couldn't find anything about default timezone.

As of polars 0.16.3, you can do:
df.select(
pl.col('date').cast(pl.Datetime).dt.replace_time_zone("America/New_York")
)
In previous versions (after 0.14.24), the syntax was
df.select(
pl.col('date').cast(pl.Datetime).dt.tz_localize("America/New_York")
)

Related

Spark - Strict validate Date against format

I want to validate dates in a file against the format specified by user, if date are not exactly matching the format I need to set a False flag.
I am using Spark = spark-3.1.1
sample code:
spark = SparkSession.builder.getOrCreate()
schema = StructType([ \
StructField("date",StringType(),True), \
StructField("active", StringType(), True)
])
input_data = [
("Saturday November 2012 10:45:42.720+0100",'Y'),
("Friday April 2022 10:45:42.720-0800",'Y'),
("Friday April 20225 10:45:42.720-0800",'Y'),
("Friday April 202 10:45:42.720-0800",'Y'),
("Friday April 20 10:45:42.720-0800",'Y'),
("Friday April 1 10:45:42.720-0800",'Y'),
("Friday April 0 10:45:42.720-0800",'Y'),
]
date_format = "EEEE MMMM yyyy HH:mm:ss.SSSZ"
temp_df = spark.createDataFrame(data=input_data,schema=schema)
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = temp_df.select('*',
f.when(f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).isNotNull(), True).otherwise(False).alias('Date_validation'),
f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).alias('converted_date'),
)
df.show(truncate=False)
Which give output like:
Where year are not strictly validated.
when tried with
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
I am faced with exception
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEE MMMM yyyy HH:mm:ss.SSSZ' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Caused by: java.lang.IllegalArgumentException: Illegal pattern character: E
as per documnetaion :Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
is there anyway I could use best of both version ? to handle this scenario
expectation:
EEEE -> should be able to handle days in week
YYYY -> strict 4 year
YY -> strict 2 year

Apache Spark How to convert a datetime from Australia/Melbourne time to UTC?

How do I convert a datetime string like 21/10/2021 15:15:28 in Australia/Melbourne time to UTC in Apache Spark in Scala?
Try this (of course you can ignore the creation of the data, I added it for you to understand the flow: columns' names and types):
Seq("21/10/2021 15:15:28").toDF("timeStr")
.withColumn("australiaMelbourneTime", to_timestamp(col("timeStr"), "dd/MM/yyyy HH:mm:ss"))
.withColumn("utcTime", to_utc_timestamp(col("australiaMelbourneTime"), "Australia/Melbourne"))
Output (tested):
+-------------------+----------------------+-------------------+
| timeStr|australiaMelbourneTime| utcTime|
+-------------------+----------------------+-------------------+
|21/10/2021 15:15:28| 2021-10-21 15:15:28|2021-10-21 04:15:28|
+-------------------+----------------------+-------------------+

date format function MMM YYYY in spark sql returning inaccurate values

I'm trying to get month year out of a date but there's something wrong in the output only for the month year December2020, it's returning December2021 instead of December2020, output
in the cancelation_year column I got the year using this function :
year(last_order_date) and it's returning the year correctly.
in the cancelation_month_year I used
date_format(last_order_date,'MMMM YYYY') and it's only returning wrong value for december 2020
from pyspark.sql import functions as F
data = [{"dt": "12/27/2020 5:11:53 AM"}]
df = spark.createDataFrame(data)
df.withColumn("ts_new", date_format(to_date("dt", "M/d/y h:m:s 'AM'"), "MMMM yyyy")).show()
+--------------------+-------------+
| dt| ts_new|
+--------------------+-------------+
|12/27/2020 5:11:5...|December 2020|
+--------------------+-------------+

spark scala how can I calculate days since 1970-01-01

Looking for scala code to replicate https://www.epochconverter.com/seconds-days-since-y0
I have a spark streaming job reading the avro message. The message has a column of type int and holds Days Since 1970-01-01. I want to convert that to date.
dataFrame.select(from_avro(col("Value"), valueRegistryConfig) as 'value)
.select("value.*")
.withColumn("start_date",'start_date)
start_date is holding an integer value like 18022 i.e Days Since 1970-01-01. I want to convert this value to a date
18022 - > Sun May 05 2019
Use default date as 1970-01-01 and pass number of days to date_add function.
This will give you date but will be 1 day additional so you do minus 1.
Something like this:
var dataDF = Seq(("1970-01-01",18091),("1970-01-01",18021),("1970-01-01",18022)).toDF("date","num")
dataDF.select(
col("date"),
expr("date_add(date,num-1)").as("date_add")).show(10,false)
+----------+----------+
|date |date_add |
+----------+----------+
|1970-01-01|2019-07-13|
|1970-01-01|2019-05-04|
|1970-01-01|2019-05-05|
+----------+----------+

How to get week start date in scala

I wrote the below code to get the Monday date for the date passed, Basically created an udf to pass a date and get it's monday date
def calculate_weekstartUDF = udf((pro_rtc:String)=>{
val df = new SimpleDateFormat("yyyy-MM-dd").parse(pro_rtc)
val cal = Calendar.getInstance()
cal.setTime(df)
cal.set(Calendar.DAY_OF_WEEK, Calendar.MONDAY)
//Get this Monday date
val Period=cal.getTime()
})
Using the above UDF in below code
flattendedJSON.withColumn("weekstartdate",calculate_weekstartUDF($"pro_rtc")).show()
is there any better way to achieve this.
Try with this approach using date_sub,next_day functions in spark.
Explanation:
date_sub(
next_day('dt,"monday"), //get next monday date
7)) //substract week from the date
Example:
val df =Seq(("2019-08-06")).toDF("dt")
import org.apache.spark.sql.functions._
df.withColumn("week_strt_day",date_sub(next_day('dt,"monday"),7)).show()
Result:
+----------+-------------+
| dt|week_strt_day|
+----------+-------------+
|2019-08-06| 2019-08-05|
+----------+-------------+
You could use the Java 8 Date API :
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.temporal.{TemporalField, WeekFields}
import java.util.Locale
def calculate_weekstartUDF =
(pro_rtc:String)=>{
val localDate = LocalDate.parse(pro_rtc); // By default parses a string in YYYY-MM-DD format.
val dayOfWeekField = WeekFields.of(Locale.getDefault).dayOfWeek()
localDate.`with`(dayOfWeekField, 1)
}
Of course, specify other thing than Locale.getDefault if you want to use another Locale.
tl;dr
LocalDate
.parse( "2019-01-23" )
.with(
TemporalAdjusters.previous( DayOfWeek.MONDAY )
)
.toString()
2019-01-21
Avoid legacy date-time classes
You are using terrible date-time classes that were supplanted years ago by the modern java.time classes defined in JSR 310.
java.time
Your input string is in standard ISO 8601 format. The java.time classes use these standard formats by default when parsing/generating strings. So no need to specify a formatting pattern.
Here is Java-syntax example code. (I don't know Scala)
LocalDate ld = LocalDate.parse( "2019-01-23" ) ;
To move from that date to another, use a TemporalAdjuster. You can find several in the TemporalAdjusters class.
Specify a day-of-week using the DayOfWeek enum, predefining seven objects, one for each day of the week.
TemporalAdjuster ta = TemporalAdjusters.previous( DayOfWeek.MONDAY ) ;
LocalDate previousMonday = ld.with( ta ) ;
See this code run live at IdeOne.com.
Monday, January 21, 2019
If the starting date happened to be a Monday, and you want to stay with that, use the alternate adjuster, previousOrSame.
Try this:
In my example, 'pro_rtc' is in seconds. Adjust if needed.
import org.apache.spark.sql.functions._
dataFrame
.withColumn("Date", to_date(from_unixtime(col("pro_rtc"))))
.withColumn("Monday", expr("date_sub(Date, dayofweek(Date) - 2)"))
That way, you're also utilizing Spark's query engine and avoiding UDF's latency
The spark-daria beginningOfWeek and endOfWeek functions are the easiest way to solve this problem. They're also the most flexible because they can easily be configured for different week end dates.
Suppose you have this dataset:
+----------+
| some_date|
+----------+
|2020-12-27|
|2020-12-28|
|2021-01-03|
|2020-12-12|
| null|
+----------+
Here's how to compute the beginning of the week and the end of the week, assuming the week ends on a Wednesday:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_week", endOfWeek(col("some_date"), "Wed"))
.withColumn("beginning_of_week", beginningOfWeek(col("some_date"), "Wed"))
.show()
Here are the results:
+----------+-----------+-----------------+
| some_date|end_of_week|beginning_of_week|
+----------+-----------+-----------------+
|2020-12-27| 2020-12-30| 2020-12-24|
|2020-12-28| 2020-12-30| 2020-12-24|
|2021-01-03| 2021-01-06| 2020-12-31|
|2020-12-12| 2020-12-16| 2020-12-10|
| null| null| null|
+----------+-----------+-----------------+
See this file for the underlying implementations. This post explains these functions in greater detail.