How to change date format in Spark? - scala

I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.

Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))

Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))

You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS

Related

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.
You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

Spark SQL - DataFrame - How to read different format date format

Data set look like this. stuck in changing HIRE_DATE format to date format field
EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000, - , - ,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000, - ,100,90
And code snippet
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").
csv(filePath)empData.printSchema()
The printSchema output is giving string for HIRE_DATE field . But i am expecting Dateformat field. How can I change?
Here is the way I do it :
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("dd-MMM-yy")
def convertStringToDate(StringDate:String) = {
val parsed = dateFormat.parse(StringDate)
new java.sql.Date(parsed.getTime())
}
val convertStringToDateUDF = udf(convertStringToDate _)
df.withColumn("HIRE_DATE",convertStringToDateUDF($"HIRE_DATE"))
Spark has its own date type. If you supply the date value in the format string "yyyy-MM-dd", it can be converted to Spark's Date type. So all that you have to do is to bring the input date string to this format "yyyy-MM-dd"
And for time and date formatting, it is always better to use java.time libraries.
See below
val df = spark.read.option("inferSchema",true).option("header", true).csv("in/emp2.txt")
def formatDate(x:String):String =
{
val y = x.toLowerCase.split('-').map(_.capitalize).mkString("-")
val z= java.time.LocalDate.parse(y,java.time.format.DateTimeFormatter.ofPattern("dd-MMM-yy"))
z.toString
}
val myudfDate = udf ( formatDate(_:String):String )
val df2 = df.withColumn("HIRE_DATE2", date_format(myudfDate('HIRE_DATE),"yyyy-MM-dd") )
df2.show(false)
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|EMPLOYEE_ID|FIRST_NAME|LAST_NAME|EMAIL |PHONE_NUMBER|HIRE_DATE|JOB_ID |SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|HIRE_DATE2|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|100 |Steven |King |SKING |515.123.4567|17-JUN-03|AD_PRES|24000 | - | - |90 |2003-06-17|
|101 |Neena |Kochhar |NKOCHHAR|515.123.4568|21-SEP-05|AD_VP |17000 | - |100 |90 |2005-09-21|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+

Change column value in a dataframe spark scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?
Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})
withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.
First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

Reading a full timestamp into a dataframe

I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.
To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+