How to change the column type from String to Date in DataFrames? - scala

I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date

Spark >= 2.2
You can use to_date:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
or to_timestamp:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
with intermediate unix_timestamp call.
Spark < 2.2
Since Spark 1.5 you can use unix_timestamp function to parse string to long, cast it to timestamp and truncate to_date:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
Note:
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.
If you use unpatched version unix_timestamp output requires multiplication by 1000.

Related

Convert using unixtimestamp to Date

I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))

Pyspark convert string type date into dd-mm-yyyy format

Using pyspark 2.4.0
I have the date column in the dateframe as follows :
I need to convert it into DD-MM-YYYY format. I have tried a few solutions including the following code but it returns me null values,
df_students_2 = df_students.withColumn(
'new_date',
F.to_date(
F.unix_timestamp('dt', '%B %d, %Y').cast('timestamp')))
Note that different types of date format in the dt column. It would be easier if i could make the whole column in one format just for the ease of converting ,but since the dataframe is big it is not possible to go through each column and change it to one format. I have also tried the following code, just for the future readers i am including it, for the 2 types of date i tried to go through in a loop, but did not succeed.
def to_date_(col, formats=(datetime.strptime(col,"%B %d, %Y"), \
datetime.strptime(col,"%d %B %Y"), "null")):
return F.coalesce(*[F.to_date(col, f) for f in formats])
Any ideas?
Try this-
implemented in scala, but can be done pyspark with minimal change
// I've put the example formats, but just replace this list with expected formats in the dt column
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df_students.withColumn("new_date", coalesce(dt_formats.map(fmt => to_date($"dt", fmt)):_*))
Try this should work...
from pyspark.sql.functions import to_date
df = spark.createDataFrame([("Mar 25, 1991",), ("May 1, 2020",)],['date_str'])
df.select(to_date(df.date_str, 'MMM d, yyyy').alias('dt')).collect()
[Row(dt=datetime.date(1991, 3, 25)), Row(dt=datetime.date(2020, 5, 1))]
see also - Datetime Patterns for Formatting and Parsing

How to round off a datetime column in pyspark dataframe to nearest quarter

I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.
try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))
Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps

create a timestamp from month and year string columns in PySpark

I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")

Difference between Date column and other date

I would like to find difference between Date column in spark dataset and date value which is not column.
If both were column i would do
datediff(col("dateStart"),col("dateEnd")
As i want to find difference between col("dateStart") and another date which is not column
val dsWithTimeDiff = detailedRecordsDs.withColumn(RunDate,
lit(runDate.toString))
val dsWithTimeDiff = dsWithRunDate.withColumn(DateDiff,
datediff(to_date(col(RunDate)), col(DateCol)))
Is there better way to do instead of adding one more column and then finding difference