Convert DataFrame String colum to Timestamp - scala

I am trying the following code to convert a string date column to a timestamp column:
val df = Seq(
("19-APR-2019 10:11:10"),
("19-MAR-2019 10:11:10"),
("19-FEB-2019 10:11:10")
).toDF("date")
.withColumn("new_date", to_utc_timestamp(to_date('date, "dd-MMM-yyyy hh:mm:ss"), "UTC"))
df.show
It almost works but it lost hours
+--------------------+-------------------+
| date| new_date|
+--------------------+-------------------+
|19-APR-2019 10:11:10|2019-04-19 00:00:00|
|19-MAR-2019 10:11:10|2019-03-19 00:00:00|
|19-FEB-2019 10:11:10|2019-02-19 00:00:00|
+--------------------+-------------------+
Do you have any idea or any other solution?

as SMaz mentioned in comment, followings lines do the tick:
import org.apache.sql.functions.to_timestamp
df.withColumn("new_date", to_timestamp('date, "dd-MMM-yyyy hh:mm:ss"))

Related

Pyspark - Convert specific string to date format

I have a date pyspark dataframe with a string column in the format of Mon-YY eg. 'Jan-17' and I am attempting to convert this into a date column.
I've tried to do it like this but it does not work out :
df.select(to_timestamp(df.t, 'MON-YY HH:mm:ss').alias('dt'))
Is it possible to do it like in SQL or do I need to write a special function for conversion ?
You should use valid Java date format. The following will work
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt'))
Jan-17 will become 2017-01-01 in that case
Example
df = spark.createDataFrame([("Jan-17 00:00:00",'a'),("Apr-19 00:00:00",'b')], ['t','x'])
df.show(2)
+---------------+---+
| t| x|
+---------------+---+
|Jan-17 00:00:00| a|
|Apr-19 00:00:00| b|
+---------------+---+
Conversion to timestamp:
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt')).show(2)
+-------------------+
| dt|
+-------------------+
|2017-01-01 00:00:00|
|2018-12-30 00:00:00|
+-------------------+

Convert String to DataFrame using Spark/scala

I want to convert String column to timestamp column , but it returns always null values .
val t = unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")
val df= df2.withColumn("ts", t)
Any idea ?
Thank you .
Make sure your String column is matching with the format specified MM/dd/yyyy.
If not matching then null will be returned.
Example:
val df2=Seq(("12/12/2020")).toDF("tracking_time")
val t = unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")
df2.withColumn("ts", t).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+
df2.withColumn("ts",unix_timestamp(col("tracking_time"),"MM/dd/yyyy").cast("timestamp")).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+
//(or) by using to_timestamp function.
df2.withColumn("ts",to_timestamp(col("tracking_time"),"MM/dd/yyyy")).show()
//+-------------+-------------------+
//|tracking_time| ts|
//+-------------+-------------------+
//| 12/12/2020|2020-12-12 00:00:00|
//+-------------+-------------------+
As #Shu mentioned, the cause might have been the invalid format of tracking_time column. It is worth mentioning though, that Spark is looking for the pattern as a prefix of the column's value. Study these examples for better intuition
Seq(
"03/29/2020 00:00",
"03/29/2020",
"00:00 03/29/2020",
"03/29/2020somethingsomething"
).toDF("tracking_time")
.withColumn("ts", unix_timestamp(col("tracking_time"), "MM/dd/yyyy").cast("timestamp"))
.show()
//+--------------------+-------------------+
//| tracking_time| ts|
//+--------------------+-------------------+
//| 03/29/2020 00:00|2020-03-29 00:00:00|
//| 03/29/2020|2020-03-29 00:00:00|
//| 00:00 03/29/2020| null|
//|03/29/2020somethi...|2020-03-29 00:00:00|

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.

How to obtain raws that have a datetime greater than specific datetime?

I want to obtain only those raws in Spark DataFrame df that have a datetime greater than 2017-Jul-10 08:35. How can I do it?
I know how to extract rows corresponding to specific datetime, e.g. 2017-Jul-10, however I don't know how to make the comparison, i.e.e greater than 2017-Jul-10 08:35.
df = df.filter(df("p_datetime") === "2017-Jul-10")
Your p_datetime is in custom date format so you need to convert to proper date format to compare,
Below is a simple example to represent your problem
val df = Seq(
("2017-Jul-10", "0.26"),
("2017-Jul-9", "0.81"),
("2015-Jul-8", "0.24"),
("2015-Jul-11", "null"),
("2015-Jul-12", "null"),
("2015-Jul-15", "0.13")
).toDF("datetime", "value")
val df1 = df.withColumn("datetime", from_unixtime(unix_timestamp($"datetime", "yyyy-MMM-dd")))
df1.filter($"datetime".gt(lit("2017-07-10"))).show // greater than
df1.filter($"datetime" > (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-10 00:00:00| 0.26|
+-------------------+-----+
df1.filter($"datetime".lt(lit("2017-07-10"))).show //less than
df1.filter($"datetime" < (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-09 00:00:00| 0.81|
|2015-07-08 00:00:00| 0.24|
|2015-07-11 00:00:00| null|
|2015-07-12 00:00:00| null|
|2015-07-15 00:00:00| 0.13|
+-------------------+-----+
df1.filter($"datetime".geq(lit("2017-07-10"))).show // greater than equal to
df1.filter($"datetime" <= (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-10 00:00:00| 0.26|
+-------------------+-----+
Edit: You can also compare timestamp by just
val df1 = df.withColumn("datetime", unix_timestamp($"datetime", "yyyy-MMM-dd")) //cast to timestamp
df4.filter($"datetime" >= (lit("2017-07-10").cast(TimestampType))).show
//cast "2017-07-10" also to timestamp
Hope this helps!