Spark (scala) change date in datetime column - scala

pyspark change day in datetime column
I'm trying to do something similar to the answer above. I'm getting
value replace is not a member of java.sql.Timestamp
val changeDay = udf((date:java.sql.Timestamp) => {
val day = 1
date.replace(day=day)
})
val df2 = df1.withColumn("newDateTime", changeDay($"datetime"))
What I can't figure out is what functions are available for this java.sql.Timestamp object. When I google it, it almost seems like the answers are not related to the same type.

You can convert Timestamp to java.time's LocalDateTime and change its day value via withDayOfMonth(day), as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2019-03-07 12:30:00")),
(2, Timestamp.valueOf("2019-04-08 09:00:00"))
).toDF("id", "ts")
def changeDay(day: Int) = udf{ (ts: Timestamp) =>
import java.time.LocalDateTime
val changedTS = ts.toLocalDateTime.withDayOfMonth(day)
Timestamp.valueOf(changedTS)
}
df.withColumn("newTS", changeDay(1)($"ts")).show
// +---+-------------------+-------------------+
// | id| ts| newTS|
// +---+-------------------+-------------------+
// | 1|2019-03-07 12:30:00|2019-03-01 12:30:00|
// | 2|2019-04-08 09:00:00|2019-04-01 09:00:00|
// +---+-------------------+-------------------+

so this probably isn't the best way to do this but here is one way
val DateTimeString = date.toString()
val DTtime = DateTimeString.split(" ")(1)
DTday + " " + DTtime

Related

Update date format in spark dataframe for multiple spark columns

I have a Spark dataframe where few columns having a different type of date format.
To handle this I have written below code to keep a consistent type of format for all the date columns.
As the date column date format may get change every time hence I have defined a set of date formats in dt_formats.
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD1", date_format(to_timestamp_multiple($"ETD",Seq("dd-MMM-yyyy", dt_formats)).cast("date"), "yyyy-MM-dd")).drop("ETD").withColumnRenamed("ETD1","ETD")
But here I have to create a new column then I have to drop older column then rename the new column.
that make the code unnecessary very clumsy hence I want to get override from this code.
I am trying to implement similar functionality by writing a Scala below function but it is throwing the exception org.apache.spark.sql.catalyst.parser.ParseException:, but I am unable to identify the what change I should made to make it work..
val CleansedData= rawDF.selectExpr(rawDF.columns.map(
x => { x match {
case "ETA" => s"""date_format(to_timestamp_multiple($x, dt_formats).cast("date"), "yyyy-MM-dd") as ETA"""
case _ => x
} } ) : _*)
Hence seeking help.
Thanks in advance.
Create a UDF in order to use with select. The select method takes columns and produces another DataFrame.
Also, instead of using coalesce, it might be more straightforward simply to build a parser that handles all of the formats. You can use DateTimeFormatterBuilder for this.
import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatterBuilder
import org.apache.spark.sql.functions.udf
import java.time.LocalDate
import scala.util.Try
import java.sql.Date
val dtFormatStrings:Seq[String] = Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
// use foldLeft with appendOptional method, which for each format,
// returns a new builder with that additional possible format
val initBuilder = new DateTimeFormatterBuilder()
val builder: DateTimeFormatterBuilder = dtFormatStrings.foldLeft(initBuilder)(
(b: DateTimeFormatterBuilder, s:String) => b.appendOptional(DateTimeFormatter.ofPattern(s)))
val formatter = builder.toFormatter()
// Create the UDF, which just takes
// any function returning a sql-compatible type (java.sql.Date, here)
def toTimeStamp2(dateString:String): Date = {
val dateTry: Try[Date] = Try(java.sql.Date.valueOf(LocalDate.parse(dateString, formatter)))
dateTry.toOption.getOrElse(null)
}
val timeConversionUdf = udf(toTimeStamp2 _)
// example DF and new DF
val df = Seq(("05/08/20"), ("2020-04-03"), ("unparseable")).toDF("ETD")
df.select(timeConversionUdf(col("ETD"))).toDF("ETD2").show
Output:
+----------+
| ETD2|
+----------+
|2020-05-08|
|2020-04-03|
| null|
+----------+
Note that unparseable values end up null, as shown.
try withColumn(...) with same name and coalesce as below-
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD", coalesce(dt_formats.map(fmt => to_date($"ETD", fmt)):_*))

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

How do convert date to Unix timestamp in milliseconds [duplicate]

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala