PySpark string column to timestamp conversion - date

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?

There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+

unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

Related

Pyspark: Convert date from string format (20220124) to date format

In a spark dataframe, I will like to convert date column, "Date" which is in string format (eg. 20220124) to 2022-01-24 and then to date format using python.
df_new= df.withColumn('Date',to_date(df.Date, 'yyyy-MM-dd'))
You can do it with to_date function which gets the input col and format of your date.
from pyspark.sql import functions as F
df.withColumn('date', F.to_date('date', 'yyyyMMdd'))

Convert timestamp to string without loosing miliseconds in Spark (Scala)

I'm using the following code in order to convert a date/timestamp into a string with a specific format:
when(to_date($"timestamp", fmt).isNotNull, date_format(to_timestamp($"timestamp", fmt), outputFormat))
The "fmt" is coming from a list of possible formats because we have different formats in the source data.
The issue here is that when we apply the "to_timestamp" function, the milliseconds part is lost. Is there any other possible (and not over complicated) ways to do this without loosing the miliseconds detail?
Thanks,
BR
I remember having to mess with it while back. This will work as well.
df = (
spark
.createDataFrame(['2021-07-19 17:29:36.123',
'2021-07-18 17:29:36.123'], "string").toDF("ts")
.withColumn('ts_with_mili',
(unix_timestamp(col('ts'), "yyyy-MM-dd HH:mm:ss.SSS")
+ substring(col('ts'), -3, 3).cast('float')/1000).cast('timestamp'))
).show(truncate=False)
# +-----------------------+-----------------------+
# |ts |ts_with_mili |
# +-----------------------+-----------------------+
# |2021-07-19 17:29:36.123|2021-07-19 17:29:36.123|
# |2021-07-18 17:29:36.123|2021-07-18 17:29:36.123|
# +-----------------------+-----------------------+

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Pyspark sql add letter in datetype value

I have epoch time values in Spark dataframe like 1569872588019 and I'm using pyspark sql in jupyter notebook.
I'm using the from_unixtime method to convert it to date.
Here is my code:
SELECT from_unixtime(dataepochvalues/1000,'yyyy-MM-dd%%HH:MM:ss') AS date FROM testdata
The result is like: 2019-04-30%%11:09:11
But what I want is like: 2019-04-30T11:04:48.366Z
I tried to add T and Z instead of %% in date but failed.
How can I insert T and Z letter?
You can specify those letters using single quotes. For your desired output, use the following date and time pattern:
"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
Using your example:
spark.sql(
"""SELECT from_unixtime(1569872588019/1000,"yyyy-MM-dd'T'HH:MM:ss'Z'") AS date"""
).show()
#+--------------------+
#| date|
#+--------------------+
#|2019-09-30T14:09:08Z|
#+--------------------+

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))