How do convert date to Unix timestamp in milliseconds [duplicate] - scala

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+

UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+

There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865

import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala

Related

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

Spark (scala) change date in datetime column

pyspark change day in datetime column
I'm trying to do something similar to the answer above. I'm getting
value replace is not a member of java.sql.Timestamp
val changeDay = udf((date:java.sql.Timestamp) => {
val day = 1
date.replace(day=day)
})
val df2 = df1.withColumn("newDateTime", changeDay($"datetime"))
What I can't figure out is what functions are available for this java.sql.Timestamp object. When I google it, it almost seems like the answers are not related to the same type.
You can convert Timestamp to java.time's LocalDateTime and change its day value via withDayOfMonth(day), as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2019-03-07 12:30:00")),
(2, Timestamp.valueOf("2019-04-08 09:00:00"))
).toDF("id", "ts")
def changeDay(day: Int) = udf{ (ts: Timestamp) =>
import java.time.LocalDateTime
val changedTS = ts.toLocalDateTime.withDayOfMonth(day)
Timestamp.valueOf(changedTS)
}
df.withColumn("newTS", changeDay(1)($"ts")).show
// +---+-------------------+-------------------+
// | id| ts| newTS|
// +---+-------------------+-------------------+
// | 1|2019-03-07 12:30:00|2019-03-01 12:30:00|
// | 2|2019-04-08 09:00:00|2019-04-01 09:00:00|
// +---+-------------------+-------------------+
so this probably isn't the best way to do this but here is one way
val DateTimeString = date.toString()
val DTtime = DateTimeString.split(" ")(1)
DTday + " " + DTtime

Conditional aggregation in scala based on data type

How do you aggregate dynamically in scala spark based on data types?
For example:
SELECT ID, SUM(when DOUBLE type)
, APPEND(when STRING), MAX(when BOOLEAN)
from tbl
GROUP BY ID
Sample data
You can do this by getting the runtime schema matching on the datatype, example :
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val df =Seq(
(1, 1.0, true, "a"),
(1, 2.0, false, "b")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id") // use all
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
gives
+---+------+------+-----------------------------+
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|
+---+------+------+-----------------------------+
| 1| 3.0| true| a,b|
+---+------+------+-----------------------------+

Extract columns from unordered data in scala spark

I am learning scala-spark and want to know how can we extract required columns from an unordered data based on column name? Details below-
Input Data: RDD[Array[String]]
id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA
Required Output:
Name,id
abc,1
def,2
ghi,3
Any help would be much appreciated.Thanks in advance!
If you have RDD[Array[String]] then you can get the desired data as
You can define a case class as
case class Data(Name: String, Id: Long)
Then parse each line to case class
val df = rdd.map( row => {
//split the line and convert to map so you can extract the data
val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
convert to Dataframe and display
df.toDF().show(false)
Output:
+----+---+
|Name|Id |
+----+---+
|abc |1 |
|def |2 |
|ghi |3 |
+----+---+
Here is full code to read the file
case class Data(Name: String, Id: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.textFile("path to file ")
val df = rdd.map(row => {
val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
df.toDF().show(false)
}

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala