Pyspark: Convert date from string format (20220124) to date format - date

In a spark dataframe, I will like to convert date column, "Date" which is in string format (eg. 20220124) to 2022-01-24 and then to date format using python.
df_new= df.withColumn('Date',to_date(df.Date, 'yyyy-MM-dd'))

You can do it with to_date function which gets the input col and format of your date.
from pyspark.sql import functions as F
df.withColumn('date', F.to_date('date', 'yyyyMMdd'))

Related

Converting mm/yyyy string to date pyspark

I am trying to convert a string column that has no date just mm/yyyy to a date column with format mm/1/yyyy.
The current code I have is
from pyspark.sql.functions import *
monthdf.select(col("PP Month"),to_date(col("PP Month"),"m/yyyy").alias("date")).show()
but this returns null. Example of how the string column is formatted is 6/2021.
try this:
import pyspark.sql.functions as f
df = (
df.withColumn('date', f.date_format(f.to_date(f.concat(f.lit('1/'), f.col('PP Month')), 'd/M/yyyy'), 'MM/d/yyyy'))
)

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Converting timestamp format in dataframe

I have read a csv file and made a dataframe where timestamp column is in format "11/12/2020 3:01".
How do I convert this into "yyyy-mm-dd hh:mm:ss.ssssss" format for the data of that particular timestamp column?
import org.apache.spark.sql.functions._
df.withColumn("timestamp_col",
date_format(
unix_timestamp($"timestamp_col", "dd/MM/yyyy h:mm").cast("timestamp"),
"yyyy-MM-dd hh:mm:ss.SSSSSS"
)
)
watch for .strftime in the documentation
https://docs.python.org/3/library/datetime.html
timestamp.strftime("%d.%m.%Y")

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))