Converting string time to day timestamp - pyspark

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!

If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

Related

Convert date from "yyyy/mm/dd" format to "M/d/yyyy" format in pyspark dataframe

I am reading a table to a dataframe which has a column "day_dt" which is in date format "2022/01/08". I want the format to be in "1/8/2022" (M/d/yyyy) Is it possible in pyspark? I have tried using date_format() but resulting in null.
Did you cast day_dt column to timestamp before using date_format? Code below adds a null valued column as you described in your question because it is StringType. You can see it using df.printSchema()
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value ,
"yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+-------+
| value|newDate|
+----------+-------+
|2022/01/08| null|
+----------+-------+
After casting string type to timestamp, date column is formatted properly:
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
d = ['2022/01/08']
df = spark.createDataFrame(d, StringType())
df.show()
df2 = df.withColumn("newDate", date_format(unix_timestamp(df.value , "yyyy/mm/dd").cast("timestamp"),"mm/dd/yyyy"))
df2.show()
+----------+
| value|
+----------+
|2022/01/08|
+----------+
+----------+----------+
| value| newDate|
+----------+----------+
|2022/01/08|01/08/2022|
+----------+----------+
Hope it helps.
If you mean you have date as string in format "yyyy/mm/dd" and you want to convert it to a string with format "M/d/yyyy", then:
First parse string to Date type using to_date().
Then, convert Date type to string using date_format.
df = spark.createDataFrame(data=[["2022/01/01",],["2022/12/31",]], schema=["date_str_in"])
df = df.withColumn("date_dt", F.to_date("date_str_in", format="yyyy/MM/dd"))
df = df.withColumn("date_str_out", F.date_format("date_dt", format="M/d/yyyy"))
+-----------+----------+------------+
|date_str_in| date_dt|date_str_out|
+-----------+----------+------------+
| 2022/01/01|2022-01-01| 1/1/2022|
| 2022/12/31|2022-12-31| 12/31/2022|
+-----------+----------+------------+

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

PySpark: String to timestamp transformation

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))
Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.

Convert from timestamp to specific date in pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.
Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.
Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.