spark scala long converts to timestamp with milliseconds in parquet dataframe - scala

Could someone please guide me that how to convert long to timestamp with milliseconds?
I know how to do to the yyyy-MM-dd HH:mm:ss
But I would like to the milliseconds yyyy-MM-dd HH:mm:ss.SSS
My parquet structure is like this
|-- header: struct (nullable = true)
| |-- time: long (nullable = true)
...
One sample for time is 1600676073054:
Scala
scala> spark.sql("select from_unixtime(word) as ts, word from tmp_1").show(false)
+--------------------+-------------+
|ts |word |
+--------------------+-------------+
|52693-05-28 18:30:54|1600676073054|
+--------------------+-------------+
scala> spark.sql("select from_unixtime(word/1000) as ts, word from tmp_1").show(false)
+-------------------+-------------+
|ts |word |
+-------------------+-------------+
|2020-09-21 16:14:33|1600676073054|
+-------------------+-------------+
scala> spark.sql("select from_unixtime(word) as ts, word from tmp_1").show(false)
+--------------------+-------------+
|ts |word |
+--------------------+-------------+
|52693-05-28 18:30:54|1600676073054|
+--------------------+-------------+
Sql Server
declare #StartDate datetime2(3) = '1970-01-01 00:00:00.000'
, #milliseconds bigint = 1600676073054
, #MillisecondsPerDay int = 60 * 60 * 24 * 1000 -- = 86400000
SELECT DATEADD(MILLISECOND, TRY_CAST(#milliseconds % #millisecondsPerDay AS INT), DATEADD(DAY, TRY_CAST(#milliseconds / #millisecondsPerDay AS INT), #StartDate));
--2020-09-21 08:14:33.054
I would like to know how to convert out the 054 as milliseconds.
Thanks.

Spark does not support epoch milliseconds, so you need to divide it by a 1000.
val df = spark.createDataFrame(
Seq(
(1, "1600676073054")
)
).toDF("id","long_timestamp")
df.withColumn(
"timestamp_mili",
(col("long_timestamp")/1000).cast("timestamp")
).show(false)
//+---+--------------+-----------------------+
//|id |long_timestamp|timestamp_mili |
//+---+--------------+-----------------------+
//|1 |1600676073054 |2020-09-21 08:14:33.054|
//+---+--------------+-----------------------+

Related

Convert seconds to hhmmss Spark

I have a UDF that creates a timestamp out of 2 field values with date and time. However, the field with time is of seconds format.
So how сan I merge 2 fields of type date and seconds into an hour of a type Unix timestamp?
My current implementation looks like this:
private val unix_epoch = udf[Long, String, String]{ (date, time) =>
deltaDateFormatter.parseDateTime(s"$date $formatted").getSeconds
}
def transform(inputDf: DataFrame): Unit = {
inputDf
.withColumn("event_hour", unix_epoch($"event_date", $"event_time"))
.withColumn("event_ts", from_unixtime($"event_hour").cast(TimestampType))
}
Input data:
event_date,event_time
20170501,87721
20170501,87728
20170501,87721
20170501,87726
Desired output:
event_tmstp, event_hour
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:08,1493598128
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:06,1493598126
Update. data schema:
event_date: string (nullable = true)
event_time: integer (nullable = true)
Cast event_date to a unix timestamp, add the event_time column to get event_hour, and convert back to normal timestamp event_tmstp.
PS I'm not sure why event_time has 86400 seconds (1 day) more. I needed to subtract that to get your expected output.
val df = Seq(
("20170501", 87721),
("20170501", 87728),
("20170501", 87721),
("20170501", 87726)
).toDF("event_date","event_time")
val df2 = df.select(
unix_timestamp(to_date($"event_date", "yyyyMMdd")) + $"event_time" - 86400
).toDF("event_hour").select(
$"event_hour".cast("timestamp").as("event_tmstp"),
$"event_hour"
)
df2.show
+-------------------+----------+
| event_tmstp|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
Check below code if this helps without UDF
val df = Seq(
(20170501,87721),
(20170501,87728),
(20170501,87721),
(20170501,87726)
).toDF("date","time")
df
.withColumn("date",
to_date(
unix_timestamp($"date".cast("string"),
"yyyyMMdd"
).cast("timestamp")
)
)
.withColumn(
"event_hour",
unix_timestamp(
concat_ws(
" ",
$"date",
from_unixtime($"time","HH:mm:ss.S")
).cast("timestamp")
)
)
.withColumn(
"event_ts",
from_unixtime($"event_hour")
)
.show(false)
+----------+-----+----------+-------------------+
|date |time |event_hour|event_ts |
+----------+-----+----------+-------------------+
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87728|1493598128|2017-05-01 00:22:08|
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87726|1493598126|2017-05-01 00:22:06|
+----------+-----+----------+-------------------+

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

Spark scala - calculating dynamic timestamp interval

have dataframe with a timestamp column (timestamp type) called "maxTmstmp" and another column with hours, represented as integers called "WindowHours". I would like to dynamically subtract timestamp and integer columns to get lower timestamp.
My data and desired effect ("minTmstmp" column):
+-----------+-------------------+-------------------+
|WindowHours| maxTmstmp| minTmstmp|
| | |(maxTmstmp - Hours)|
+-----------+-------------------+-------------------+
| 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
| 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
| 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
| 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
+-----------+-------------------+-------------------+
root
|-- WindowHours: integer (nullable = true)
|-- maxTmstmp: timestamp (nullable = true)
I have already found an expressions with hours interval solution, but it isn't dynamic. Code below doesn't work as intended.
standards.
.withColumn("minTmstmp", $"maxTmstmp" - expr("INTERVAL 10 HOURS"))
.show()
Operate on Spark 2.4 and scala.
One simple way would be to convert maxTmstmp to unix time, subtract the value of WindowHours in seconds from it, and convert the result back to Spark Timestamp, as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2016-01-01 23:00:00")),
(2, Timestamp.valueOf("2016-03-01 12:00:00")),
(8, Timestamp.valueOf("2016-03-05 20:00:00")),
(24, Timestamp.valueOf("2016-04-12 11:00:00"))
).toDF("WindowHours", "maxTmstmp")
df.withColumn("minTmstmp",
from_unixtime(unix_timestamp($"maxTmstmp") - ($"WindowHours" * 3600))
).show
// +-----------+-------------------+-------------------+
// |WindowHours| maxTmstmp| minTmstmp|
// +-----------+-------------------+-------------------+
// | 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
// | 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
// | 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
// | 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
// +-----------+-------------------+-------------------+

Converting String to Date in Spark Dataframe

I have a dataframe (df1) with 2 StringType fields.
Field1 (StringType) Value-X
Field2 (StringType) value-20180101
All I am trying to do is create another dataframe (df2) from df1 with 2 fields-
Field1 (StringType) Value-X
Field2 (Date Type) Value-2018-01-01
I am using the below code-
df2=df1.select(
col("field1").alias("f1"),
unix_timestamp(col("field2"),"yyyyMMdd").alias("f2")
)
df2.show
df2.printSchema
For this field 2, I tried multiple things - unix_timestamp , from_unixtimestamp, to_date, cast(“date”) but nothing worked
I need the following schema as output:
df2.printSchema
|-- f1: string (nullable = false)
|-- f2: date (nullable = false)
I'm using Spark 2.1
to_date seems to work fine for what you need:
import org.apache.spark.sql.functions._
val df1 = Seq( ("X", "20180101"), ("Y", "20180406") ).toDF("c1", "c2")
val df2 = df1.withColumn("c2", to_date($"c2", "yyyyMMdd"))
df2.show
// +---+----------+
// | c1| c2|
// +---+----------+
// | X|2018-01-01|
// | Y|2018-04-06|
// +---+----------+
df2.printSchema
// root
// |-- c1: string (nullable = true)
// |-- c2: date (nullable = true)
[UPDATE]
For Spark 2.1 or prior, to_date doesn't take format string as a parameter, hence explicit string formatting to the standard yyyy-MM-dd format using, say, regexp_replace is needed:
val df2 = df1.withColumn(
"c2", to_date(regexp_replace($"c2", "(\\d{4})(\\d{2})(\\d{2})", "$1-$2-$3"))
)

Converting pattern of date in spark dataframe

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.