Spark SQL - DataFrame - How to read different format date format - scala

Data set look like this. stuck in changing HIRE_DATE format to date format field
EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000, - , - ,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000, - ,100,90
And code snippet
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").
csv(filePath)empData.printSchema()
The printSchema output is giving string for HIRE_DATE field . But i am expecting Dateformat field. How can I change?

Here is the way I do it :
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("dd-MMM-yy")
def convertStringToDate(StringDate:String) = {
val parsed = dateFormat.parse(StringDate)
new java.sql.Date(parsed.getTime())
}
val convertStringToDateUDF = udf(convertStringToDate _)
df.withColumn("HIRE_DATE",convertStringToDateUDF($"HIRE_DATE"))

Spark has its own date type. If you supply the date value in the format string "yyyy-MM-dd", it can be converted to Spark's Date type. So all that you have to do is to bring the input date string to this format "yyyy-MM-dd"
And for time and date formatting, it is always better to use java.time libraries.
See below
val df = spark.read.option("inferSchema",true).option("header", true).csv("in/emp2.txt")
def formatDate(x:String):String =
{
val y = x.toLowerCase.split('-').map(_.capitalize).mkString("-")
val z= java.time.LocalDate.parse(y,java.time.format.DateTimeFormatter.ofPattern("dd-MMM-yy"))
z.toString
}
val myudfDate = udf ( formatDate(_:String):String )
val df2 = df.withColumn("HIRE_DATE2", date_format(myudfDate('HIRE_DATE),"yyyy-MM-dd") )
df2.show(false)
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|EMPLOYEE_ID|FIRST_NAME|LAST_NAME|EMAIL |PHONE_NUMBER|HIRE_DATE|JOB_ID |SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|HIRE_DATE2|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|100 |Steven |King |SKING |515.123.4567|17-JUN-03|AD_PRES|24000 | - | - |90 |2003-06-17|
|101 |Neena |Kochhar |NKOCHHAR|515.123.4568|21-SEP-05|AD_VP |17000 | - |100 |90 |2005-09-21|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+

Related

How to convert excel date long type to timestamp in scala

Column1 = 43784.2892847338
Tried this below option
val finalDF=df1.withColumn("Column1 ",expr("""(Column1 -25569) * 86400.0"""))
print(finalDF)
Result : 1911120001
Expected result should be in "YYYY-MM-dd HH:mm:ss.SSS"
Could you please help me with the solution.
If you use the from_unixtime then it will not give you the .SSS milliseconds. So, in this case, the casting is better to use.
val df1 = spark.createDataFrame(Seq(("1", 43784.2892847338))).toDF("id", "Column1")
val finalDF = df1.withColumn("Column1_timestamp", expr("""(Column1 -25569) * 86400.0""").cast("timestamp"))
finalDF.show(false)
+---+----------------+-----------------------+
|id |Column1 |Column1_timestamp |
+---+----------------+-----------------------+
|1 |43784.2892847338|2019-11-15 06:56:34.201|
+---+----------------+-----------------------+

Validate time_stamp in input spark dataframe to generate correct output spark dataframe

I have a spark data frame which contains multiple columns. one out of which is "t_s" column.
I want to generate a new data frame with following conditions:
a. if value of "t_s" column is empty, or not of correct format then generate current_timestamp.
b. if value of "t_s" column is not empty and of correct format then use the same value.
I have managed to write following code, but I also want to plug the code to check if "t_s" is of correct fromat or not?
def generateTimeStamp(df: DataFrame) = {
import spark.implicits._
var updatedDF = df
updatedDF = df.withColumn("t_s", when(($"t_s").isNull, current_timestamp()).otherwise($"t_s"))
updatedDF
}
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = java.time.format.DateTimeFormatter.ofPattern(fmt)
def isCompatible(s: String) = try {
java.time.LocalDateTime.parse(s, df)
true
} catch {
case e: java.time.format.DateTimeParseException => false
}
I also want to check condition for value of the column "t_s" with isCompatible() function call.
How to do this?
How about:
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = Seq(
"2019-10-21 14:45:23",
"2019-10-22 14:45:23",
null,
"2019-10-41 14:45:23", //invalid day
).toDF("ts")
df.withColumn("ts", to_timestamp($"ts", fmt))
.withColumn("ts", when($"ts".isNull, date_format(current_timestamp(), fmt)).otherwise($"ts"))
.show(false)
+-------------------+
|ts |
+-------------------+
|2019-10-21 14:45:23|
|2019-10-22 14:45:23|
|2019-08-20 13:54:23|
|2019-08-20 13:54:23|
+-------------------+

How to change date format in Spark?

I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS

Change column value in a dataframe spark scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?
Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})
withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.
First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

Reading a full timestamp into a dataframe

I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.
To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+