converting specific string format to date in sparksql - date

I have a column that contains a string with the following date as a string Sat Sep 14 09:54:30 UTC 2019. Not familiar with format at all.
I need to convert to date or timestamp. Just a unit that I can compare against. I just need a point of comparison with a precision of one day.

This can help you get the timestamp from your string and then you get the days from it using Spark SQL(2.x)
spark.sql("""SELECT from_utc_timestamp(from_unixtime(unix_timestamp("Sat Sep 14 09:54:30 UTC 2019","EEE MMM dd HH:mm:ss zzz yyyy") ),"IST")as timestamp""").show()
+-------------------+
| timestamp|
+-------------------+
|2019-09-14 20:54:30|
+-------------------+

Related

PySpark: Convert Date

How can I convert this date to a date format such that I can eventually transform it into yyyy-MM-dd? Similar examples, Convert string of format MMM d yyyy hh:mm AM/PM to date using Pyspark, could not solve it.
df = spark.createDataFrame(sc.parallelize([
['Wed Sep 30 21:06:00 1998'],
['Fri Apr 1 08:37:00 2022'],
]),
['Date'])
+--------------------+
| Date|
+--------------------+
|Wed Sep 30 21:06:...|
|Fri Apr 1 08:37:...|
+--------------------+
# fail
df.withColumn('Date', F.to_date(F.col('Date'), "DDD MMM dd hh:mm:ss yyyy")).show()
I think you are using wrong symbols for Day-Of-Week and Hour - try this one:
from pyspark.sql.functions import to_date
df = spark.createDataFrame([('Wed Sep 30 21:06:00 1998',), ('Fri Apr 1 08:37:00 2022',)], 'Date: string')
df.withColumn('Date', to_date('Date', "E MMM dd HH:mm:ss yyyy")).show()
+----------+
| Date|
+----------+
|1998-09-30|
|2022-04-01|
+----------+

date format function MMM YYYY in spark sql returning inaccurate values

I'm trying to get month year out of a date but there's something wrong in the output only for the month year December2020, it's returning December2021 instead of December2020, output
in the cancelation_year column I got the year using this function :
year(last_order_date) and it's returning the year correctly.
in the cancelation_month_year I used
date_format(last_order_date,'MMMM YYYY') and it's only returning wrong value for december 2020
from pyspark.sql import functions as F
data = [{"dt": "12/27/2020 5:11:53 AM"}]
df = spark.createDataFrame(data)
df.withColumn("ts_new", date_format(to_date("dt", "M/d/y h:m:s 'AM'"), "MMMM yyyy")).show()
+--------------------+-------------+
| dt| ts_new|
+--------------------+-------------+
|12/27/2020 5:11:5...|December 2020|
+--------------------+-------------+

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

How to load date with custom format in Spark

I have a scenario where I have a column data like "Tuesday, 09-Aug-11 21:13:26 GMT" and I want to create a schema in Spark but the datatypes TimestampType and DateType is not able to recognize this date format.
After loading the data to a dataframe using TimestampType or DateType I am seeing NULL values in that particular column.
Is there any alternative for this?
One option is to read "Tuesday, 09-Aug-11 21:13:26 GMT" as string type column & do transformation from string to timestamp something like below.
df.show(truncate=false)
+-------------------------------+
|dt |
+-------------------------------+
|Tuesday, 09-Aug-11 21:13:26 GMT|
+-------------------------------+
df.withColumn("dt",to_timestamp(col("dt"),"E, d-MMM-y H:m:s z")).show(truncate=false) //Note - It is converted GMT to IST local timezone.
+-------------------+
|dt |
+-------------------+
|2011-08-10 02:43:26|
+-------------------+

spark scala how can I calculate days since 1970-01-01

Looking for scala code to replicate https://www.epochconverter.com/seconds-days-since-y0
I have a spark streaming job reading the avro message. The message has a column of type int and holds Days Since 1970-01-01. I want to convert that to date.
dataFrame.select(from_avro(col("Value"), valueRegistryConfig) as 'value)
.select("value.*")
.withColumn("start_date",'start_date)
start_date is holding an integer value like 18022 i.e Days Since 1970-01-01. I want to convert this value to a date
18022 - > Sun May 05 2019
Use default date as 1970-01-01 and pass number of days to date_add function.
This will give you date but will be 1 day additional so you do minus 1.
Something like this:
var dataDF = Seq(("1970-01-01",18091),("1970-01-01",18021),("1970-01-01",18022)).toDF("date","num")
dataDF.select(
col("date"),
expr("date_add(date,num-1)").as("date_add")).show(10,false)
+----------+----------+
|date |date_add |
+----------+----------+
|1970-01-01|2019-07-13|
|1970-01-01|2019-05-04|
|1970-01-01|2019-05-05|
+----------+----------+