Change column value in a dataframe spark scala - scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?

Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})

withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.

First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

Related

How to convert excel date long type to timestamp in scala

Column1 = 43784.2892847338
Tried this below option
val finalDF=df1.withColumn("Column1 ",expr("""(Column1 -25569) * 86400.0"""))
print(finalDF)
Result : 1911120001
Expected result should be in "YYYY-MM-dd HH:mm:ss.SSS"
Could you please help me with the solution.
If you use the from_unixtime then it will not give you the .SSS milliseconds. So, in this case, the casting is better to use.
val df1 = spark.createDataFrame(Seq(("1", 43784.2892847338))).toDF("id", "Column1")
val finalDF = df1.withColumn("Column1_timestamp", expr("""(Column1 -25569) * 86400.0""").cast("timestamp"))
finalDF.show(false)
+---+----------------+-----------------------+
|id |Column1 |Column1_timestamp |
+---+----------------+-----------------------+
|1 |43784.2892847338|2019-11-15 06:56:34.201|
+---+----------------+-----------------------+

Validate time_stamp in input spark dataframe to generate correct output spark dataframe

I have a spark data frame which contains multiple columns. one out of which is "t_s" column.
I want to generate a new data frame with following conditions:
a. if value of "t_s" column is empty, or not of correct format then generate current_timestamp.
b. if value of "t_s" column is not empty and of correct format then use the same value.
I have managed to write following code, but I also want to plug the code to check if "t_s" is of correct fromat or not?
def generateTimeStamp(df: DataFrame) = {
import spark.implicits._
var updatedDF = df
updatedDF = df.withColumn("t_s", when(($"t_s").isNull, current_timestamp()).otherwise($"t_s"))
updatedDF
}
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = java.time.format.DateTimeFormatter.ofPattern(fmt)
def isCompatible(s: String) = try {
java.time.LocalDateTime.parse(s, df)
true
} catch {
case e: java.time.format.DateTimeParseException => false
}
I also want to check condition for value of the column "t_s" with isCompatible() function call.
How to do this?
How about:
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = Seq(
"2019-10-21 14:45:23",
"2019-10-22 14:45:23",
null,
"2019-10-41 14:45:23", //invalid day
).toDF("ts")
df.withColumn("ts", to_timestamp($"ts", fmt))
.withColumn("ts", when($"ts".isNull, date_format(current_timestamp(), fmt)).otherwise($"ts"))
.show(false)
+-------------------+
|ts |
+-------------------+
|2019-10-21 14:45:23|
|2019-10-22 14:45:23|
|2019-08-20 13:54:23|
|2019-08-20 13:54:23|
+-------------------+

Spark SQL - DataFrame - How to read different format date format

Data set look like this. stuck in changing HIRE_DATE format to date format field
EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000, - , - ,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000, - ,100,90
And code snippet
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").
csv(filePath)empData.printSchema()
The printSchema output is giving string for HIRE_DATE field . But i am expecting Dateformat field. How can I change?
Here is the way I do it :
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("dd-MMM-yy")
def convertStringToDate(StringDate:String) = {
val parsed = dateFormat.parse(StringDate)
new java.sql.Date(parsed.getTime())
}
val convertStringToDateUDF = udf(convertStringToDate _)
df.withColumn("HIRE_DATE",convertStringToDateUDF($"HIRE_DATE"))
Spark has its own date type. If you supply the date value in the format string "yyyy-MM-dd", it can be converted to Spark's Date type. So all that you have to do is to bring the input date string to this format "yyyy-MM-dd"
And for time and date formatting, it is always better to use java.time libraries.
See below
val df = spark.read.option("inferSchema",true).option("header", true).csv("in/emp2.txt")
def formatDate(x:String):String =
{
val y = x.toLowerCase.split('-').map(_.capitalize).mkString("-")
val z= java.time.LocalDate.parse(y,java.time.format.DateTimeFormatter.ofPattern("dd-MMM-yy"))
z.toString
}
val myudfDate = udf ( formatDate(_:String):String )
val df2 = df.withColumn("HIRE_DATE2", date_format(myudfDate('HIRE_DATE),"yyyy-MM-dd") )
df2.show(false)
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|EMPLOYEE_ID|FIRST_NAME|LAST_NAME|EMAIL |PHONE_NUMBER|HIRE_DATE|JOB_ID |SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|HIRE_DATE2|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|100 |Steven |King |SKING |515.123.4567|17-JUN-03|AD_PRES|24000 | - | - |90 |2003-06-17|
|101 |Neena |Kochhar |NKOCHHAR|515.123.4568|21-SEP-05|AD_VP |17000 | - |100 |90 |2005-09-21|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+

Reading a full timestamp into a dataframe

I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.
To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+

How to add a new column with day of week based on another in dataframe?

I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported
import com.github.nscala_time.time.Imports._
but am not sure where to go from here.
Create formatter:
val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")
Parse date:
val dt = fmt.parseDateTime("09/11/2015")
Get a day of the week:
dt.toString("EEEEE")
Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
sqlContext.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
// +-----------+--------+
// |date_string| dow|
// +-----------+--------+
// | 08/11/2015| Tuesday|
// | 09/11/2015| Friday|
// | 09/12/2015|Saturday|
// +-----------+--------+
EDIT:
Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:
import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}
df.select(from_unixtime(
unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))