Spark - Strict validate Date against format - scala

I want to validate dates in a file against the format specified by user, if date are not exactly matching the format I need to set a False flag.
I am using Spark = spark-3.1.1
sample code:
spark = SparkSession.builder.getOrCreate()
schema = StructType([ \
StructField("date",StringType(),True), \
StructField("active", StringType(), True)
])
input_data = [
("Saturday November 2012 10:45:42.720+0100",'Y'),
("Friday April 2022 10:45:42.720-0800",'Y'),
("Friday April 20225 10:45:42.720-0800",'Y'),
("Friday April 202 10:45:42.720-0800",'Y'),
("Friday April 20 10:45:42.720-0800",'Y'),
("Friday April 1 10:45:42.720-0800",'Y'),
("Friday April 0 10:45:42.720-0800",'Y'),
]
date_format = "EEEE MMMM yyyy HH:mm:ss.SSSZ"
temp_df = spark.createDataFrame(data=input_data,schema=schema)
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = temp_df.select('*',
f.when(f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).isNotNull(), True).otherwise(False).alias('Date_validation'),
f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).alias('converted_date'),
)
df.show(truncate=False)
Which give output like:
Where year are not strictly validated.
when tried with
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
I am faced with exception
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEE MMMM yyyy HH:mm:ss.SSSZ' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Caused by: java.lang.IllegalArgumentException: Illegal pattern character: E
as per documnetaion :Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
is there anyway I could use best of both version ? to handle this scenario
expectation:
EEEE -> should be able to handle days in week
YYYY -> strict 4 year
YY -> strict 2 year

Related

Pyspark - extract first monday of week

Given Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James",202245),
("Michael",202133),
("Robert",202152),
("Maria",202252),
("Jen",202201)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("Week", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
The week column denotes the week number and year ie; 202245 is the 45th week of 2022. I would like to extract the date on which Monday falls for the 45th week of 2022 which is 7th Nov 2022.
What I tried, :
Tried using datetime and udf:
def get_monday_from_week(x: int) -> datetime.date:
"""
Converts fiscal week to datetime of first Monday from that week
Args:
x (int): fiscal week
Returns:
datetime.date: datetime of first Monday from that week
"""
x = str(x)
r = datetime.datetime.strptime(x + "-1", "%Y%W-%w")
return r
How can I use spark functions to implement this, I am trying to avoid the use of udf here.

date format function MMM YYYY in spark sql returning inaccurate values

I'm trying to get month year out of a date but there's something wrong in the output only for the month year December2020, it's returning December2021 instead of December2020, output
in the cancelation_year column I got the year using this function :
year(last_order_date) and it's returning the year correctly.
in the cancelation_month_year I used
date_format(last_order_date,'MMMM YYYY') and it's only returning wrong value for december 2020
from pyspark.sql import functions as F
data = [{"dt": "12/27/2020 5:11:53 AM"}]
df = spark.createDataFrame(data)
df.withColumn("ts_new", date_format(to_date("dt", "M/d/y h:m:s 'AM'"), "MMMM yyyy")).show()
+--------------------+-------------+
| dt| ts_new|
+--------------------+-------------+
|12/27/2020 5:11:5...|December 2020|
+--------------------+-------------+

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

Spark date format issue

I have observed weird behavior in spark date formatting. Actually I need to convert the date yy to yyyy. After date conversion it should be 20yy in date
I have tried as below, it failing after 2040 year.
import org.apache.spark.sql.functions._
val df= Seq(("06/03/35"),("07/24/40"), ("11/15/43"), ("12/15/12"), ("11/15/20"), ("12/12/22")).toDF("Date")
df.withColumn("newdate", from_unixtime(unix_timestamp($"Date", "mm/dd/yy"), "mm/dd/yyyy")).show
+--------+----------+
| Date| newdate|
+--------+----------+
| 06/3/35|06/03/2035|
|07/24/40|07/24/2040|
|11/15/43|11/15/1943| // Here year appended with 19
|12/15/12|12/15/2012|
|11/15/20|11/15/2020|
|12/12/22|12/12/2022|
+--------+----------+
Why this behavior, Is there any date utility function that I can use directly without appending 20 to string date
Parsing 2-digit year strings is subject to some relative interpretation that is documented in the SimpleDateFormat docs:
For parsing with the abbreviated year pattern ("y" or "yy"), SimpleDateFormat must interpret the abbreviated year relative to some century. It does this by adjusting dates to be within 80 years before and 20 years after the time the SimpleDateFormat instance is created. For example, using a pattern of "MM/dd/yy" and a SimpleDateFormat instance created on Jan 1, 1997, the string "01/11/12" would be interpreted as Jan 11, 2012 while the string "05/04/64" would be interpreted as May 4, 1964.
So, 2043 being more than 20 years away, the parser uses 1943 as documented.
Here's one approach that uses a UDF that explicitly calls set2DigitYearStart on a SimpleDateFormat object before parsing the date (I picked 1980 just as an example):
def parseDate(date: String, pattern: String): Date = {
val format = new SimpleDateFormat(pattern);
val cal = Calendar.getInstance();
cal.set(Calendar.YEAR, 1980)
val beginning = cal.getTime();
format.set2DigitYearStart(beginning)
return new Date(format.parse(date).getTime);
}
And then:
val custom_to_date = udf(parseDate _);
df.withColumn("newdate", custom_to_date($"Date", lit("mm/dd/yy"))).show(false)
+--------+----------+
|Date |newdate |
+--------+----------+
|06/03/35|2035-01-03|
|07/24/40|2040-01-24|
|11/15/43|2043-01-15|
|12/15/12|2012-01-15|
|11/15/20|2020-01-15|
|12/12/22|2022-01-12|
+--------+----------+
Knowing your data, you would know which value to pick for the parameter to set2DigitYearStart()

converting specific string format to date in sparksql

I have a column that contains a string with the following date as a string Sat Sep 14 09:54:30 UTC 2019. Not familiar with format at all.
I need to convert to date or timestamp. Just a unit that I can compare against. I just need a point of comparison with a precision of one day.
This can help you get the timestamp from your string and then you get the days from it using Spark SQL(2.x)
spark.sql("""SELECT from_utc_timestamp(from_unixtime(unix_timestamp("Sat Sep 14 09:54:30 UTC 2019","EEE MMM dd HH:mm:ss zzz yyyy") ),"IST")as timestamp""").show()
+-------------------+
| timestamp|
+-------------------+
|2019-09-14 20:54:30|
+-------------------+