Change column value in a dataframe spark scala

Change column value in a dataframe spark scala - scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?

Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})

withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.

First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

Related

How to convert excel date long type to timestamp in scala

Column1 = 43784.2892847338
Tried this below option
val finalDF=df1.withColumn("Column1 ",expr("""(Column1 -25569) * 86400.0"""))
print(finalDF)
Result : 1911120001
Expected result should be in "YYYY-MM-dd HH:mm:ss.SSS"
Could you please help me with the solution.

If you use the from_unixtime then it will not give you the .SSS milliseconds. So, in this case, the casting is better to use.
val df1 = spark.createDataFrame(Seq(("1", 43784.2892847338))).toDF("id", "Column1")
val finalDF = df1.withColumn("Column1_timestamp", expr("""(Column1 -25569) * 86400.0""").cast("timestamp"))
finalDF.show(false)
+---+----------------+-----------------------+
|id |Column1 |Column1_timestamp |
+---+----------------+-----------------------+
|1 |43784.2892847338|2019-11-15 06:56:34.201|
+---+----------------+-----------------------+

Validate time_stamp in input spark dataframe to generate correct output spark dataframe

I have a spark data frame which contains multiple columns. one out of which is "t_s" column.
I want to generate a new data frame with following conditions:
a. if value of "t_s" column is empty, or not of correct format then generate current_timestamp.
b. if value of "t_s" column is not empty and of correct format then use the same value.
I have managed to write following code, but I also want to plug the code to check if "t_s" is of correct fromat or not?
def generateTimeStamp(df: DataFrame) = {
import spark.implicits._
var updatedDF = df
updatedDF = df.withColumn("t_s", when(($"t_s").isNull, current_timestamp()).otherwise($"t_s"))
updatedDF
}
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = java.time.format.DateTimeFormatter.ofPattern(fmt)
def isCompatible(s: String) = try {
java.time.LocalDateTime.parse(s, df)
true
} catch {
case e: java.time.format.DateTimeParseException => false
}
I also want to check condition for value of the column "t_s" with isCompatible() function call.
How to do this?

How about:
val fmt = "yyyy-MM-dd HH:mm:ss"
val df = Seq(
"2019-10-21 14:45:23",
"2019-10-22 14:45:23",
null,
"2019-10-41 14:45:23", //invalid day
).toDF("ts")
df.withColumn("ts", to_timestamp($"ts", fmt))
.withColumn("ts", when($"ts".isNull, date_format(current_timestamp(), fmt)).otherwise($"ts"))
.show(false)
+-------------------+
|ts |
+-------------------+
|2019-10-21 14:45:23|
|2019-10-22 14:45:23|
|2019-08-20 13:54:23|
|2019-08-20 13:54:23|
+-------------------+

Spark SQL - DataFrame - How to read different format date format

Data set look like this. stuck in changing HIRE_DATE format to date format field
EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000, - , - ,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000, - ,100,90
And code snippet
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").
csv(filePath)empData.printSchema()
The printSchema output is giving string for HIRE_DATE field . But i am expecting Dateformat field. How can I change?

Here is the way I do it :
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("dd-MMM-yy")
def convertStringToDate(StringDate:String) = {
val parsed = dateFormat.parse(StringDate)
new java.sql.Date(parsed.getTime())
}
val convertStringToDateUDF = udf(convertStringToDate _)
df.withColumn("HIRE_DATE",convertStringToDateUDF($"HIRE_DATE"))

Spark has its own date type. If you supply the date value in the format string "yyyy-MM-dd", it can be converted to Spark's Date type. So all that you have to do is to bring the input date string to this format "yyyy-MM-dd"
And for time and date formatting, it is always better to use java.time libraries.
See below
val df = spark.read.option("inferSchema",true).option("header", true).csv("in/emp2.txt")
def formatDate(x:String):String =
{
val y = x.toLowerCase.split('-').map(_.capitalize).mkString("-")
val z= java.time.LocalDate.parse(y,java.time.format.DateTimeFormatter.ofPattern("dd-MMM-yy"))
z.toString
}
val myudfDate = udf ( formatDate(_:String):String )
val df2 = df.withColumn("HIRE_DATE2", date_format(myudfDate('HIRE_DATE),"yyyy-MM-dd") )
df2.show(false)
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|EMPLOYEE_ID|FIRST_NAME|LAST_NAME|EMAIL |PHONE_NUMBER|HIRE_DATE|JOB_ID |SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|HIRE_DATE2|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+
|100 |Steven |King |SKING |515.123.4567|17-JUN-03|AD_PRES|24000 | - | - |90 |2003-06-17|
|101 |Neena |Kochhar |NKOCHHAR|515.123.4568|21-SEP-05|AD_VP |17000 | - |100 |90 |2005-09-21|
+-----------+----------+---------+--------+------------+---------+-------+------+--------------+----------+-------------+----------+

Reading a full timestamp into a dataframe

I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.

To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+

How to add a new column with day of week based on another in dataframe?

I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported
import com.github.nscala_time.time.Imports._
but am not sure where to go from here.

Create formatter:
val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")
Parse date:
val dt = fmt.parseDateTime("09/11/2015")
Get a day of the week:
dt.toString("EEEEE")
Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
sqlContext.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
// +-----------+--------+
// |date_string| dow|
// +-----------+--------+
// | 08/11/2015| Tuesday|
// | 09/11/2015| Friday|
// | 09/12/2015|Saturday|
// +-----------+--------+
EDIT:
Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:
import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}
df.select(from_unixtime(
unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Change column value in a dataframe spark scala - scala

withClumn("date", from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date") this should work. Another notice is the that mm gives minutes and MM gives months, hope this help you.

Related

How to convert excel date long type to timestamp in scala

Validate time_stamp in input spark dataframe to generate correct output spark dataframe

Spark SQL - DataFrame - How to read different format date format

Reading a full timestamp into a dataframe

How to add a new column with day of week based on another in dataframe?

Categories

Resources