Spark 2.0 Timestamp Difference in Milliseconds using Scala - scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala:
Need the time-stamp difference in milliseconds between two Data-frame column values.
Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067
Data-types for both is timestamp.
Note:Applying the function unix_timestamp(Column s) on both values and subtracting works but not upto the milliseconds value which is the requirement.
Final query would look like this:
Select **timestamp_diff**(Value_2,Value_1) from table1
this should return the following output:
1023 milliseconds
where timestamp_diff is the function that would calculate the difference in milliseconds.

One way would be to use Unix epoch time, the number of milliseconds since 1 January 1970. Below is an example using an UDF, it takes two timestamps and returns the difference between them in milliseconds.
val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
})
val df = // dataframe with two timestamp columns (col1 and col2)
.withColumn("diff", timestamp_diff(col("col2"), col("col1")))
Alternatively, you can register the function to use with SQL commands:
val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
}
spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")
val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")

The same for PySpark:
import datetime
def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
return int((time1-time2).total_seconds()*1000)
int and *1000 are only to output milliseconds
Example usage:
spark.udf.register("timestamp_diff", timestamp_diff)
df.registerTempTable("table1")
df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
It's not an optimal solution since UDFs are usually slow, so you might run into performance issues.

Bit late to the party, but hope it's still useful.
import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")
df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))
Of course you can define a separate method for the difference:
def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)
df.withColumn("diff", timestampDiff(col("col1"), col("col2")))
To make life easier one can define an overloaded method for Strings with a default diff name:
def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")
Now in action:
scala> df.show(false)
+-----------------------+-----------------------+
|min_time |max_time |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+
scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time |max_time |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142 |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209 |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219 |
+-----------------------+-----------------------+-----+
scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142 |
|65363|
|209 |
|219 |
+-----+

Related

Spark scala - calculating dynamic timestamp interval

have dataframe with a timestamp column (timestamp type) called "maxTmstmp" and another column with hours, represented as integers called "WindowHours". I would like to dynamically subtract timestamp and integer columns to get lower timestamp.
My data and desired effect ("minTmstmp" column):
+-----------+-------------------+-------------------+
|WindowHours| maxTmstmp| minTmstmp|
| | |(maxTmstmp - Hours)|
+-----------+-------------------+-------------------+
| 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
| 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
| 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
| 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
+-----------+-------------------+-------------------+
root
|-- WindowHours: integer (nullable = true)
|-- maxTmstmp: timestamp (nullable = true)
I have already found an expressions with hours interval solution, but it isn't dynamic. Code below doesn't work as intended.
standards.
.withColumn("minTmstmp", $"maxTmstmp" - expr("INTERVAL 10 HOURS"))
.show()
Operate on Spark 2.4 and scala.
One simple way would be to convert maxTmstmp to unix time, subtract the value of WindowHours in seconds from it, and convert the result back to Spark Timestamp, as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2016-01-01 23:00:00")),
(2, Timestamp.valueOf("2016-03-01 12:00:00")),
(8, Timestamp.valueOf("2016-03-05 20:00:00")),
(24, Timestamp.valueOf("2016-04-12 11:00:00"))
).toDF("WindowHours", "maxTmstmp")
df.withColumn("minTmstmp",
from_unixtime(unix_timestamp($"maxTmstmp") - ($"WindowHours" * 3600))
).show
// +-----------+-------------------+-------------------+
// |WindowHours| maxTmstmp| minTmstmp|
// +-----------+-------------------+-------------------+
// | 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
// | 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
// | 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
// | 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
// +-----------+-------------------+-------------------+

Spark Scala: Add 10 days to a date string (not a column)

I have a date and want to add and subtract 10 days to it. Start_date and end_date are dynamic variables from one table and will be used to filter another table.
eg.
val start_date = "2018-09-08"
val end_date = "2018-09-15"
I want to use the two dates above in a filter shown below;
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
The functions date_add and date_sub only take in columns as an input. How can I add/subtract 10 (this is an arbitrary number) from my dates?
Thanks
Thank you Luis! Your solution worked, for anyone interested the solution looks like;
val start_date = lit("2018-09-08")
val end_date = lit("2018-09-15")
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
Another way...If you can create a temp view, then you can access the vals using $ interpolation.
You should make sure the format is of default ones for date/timestamp.
Check this out:
scala> val start_date = "2018-09-08"
start_date: String = 2018-09-08
scala> val end_date = "2018-09-15"
end_date: String = 2018-09-15
scala> val myDF=Seq(("2018-09-08"),("2018-09-15")).toDF("timestamp").withColumn("timestamp",to_timestamp('timestamp))
myDF: org.apache.spark.sql.DataFrame = [timestamp: timestamp]
scala> myDF.show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-09-08 00:00:00|
|2018-09-15 00:00:00|
+-------------------+
scala> myDF.createOrReplaceTempView("ts_table")
scala> spark.sql(s""" select timestamp, date_sub('$start_date',10) as d_sub, date_add('$end_date',10) d_add from ts_table """).show(false)
+-------------------+----------+----------+
|timestamp |d_sub |d_add |
+-------------------+----------+----------+
|2018-09-08 00:00:00|2018-08-29|2018-09-25|
|2018-09-15 00:00:00|2018-08-29|2018-09-25|
+-------------------+----------+----------+
scala>

Convert from timestamp to specific date in pyspark

I would like to convert on a specific column the timestamp in a specific date.
Here is my input :
+----------+
| timestamp|
+----------+
|1532383202|
+----------+
What I would expect :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
If possible, I would like to put minutes and seconds to 0 even if it's not 0.
For example, if I have this :
+------------------+
| date |
+------------------+
|24/7/2018 1:06:32 |
+------------------+
I would like this :
+------------------+
| date |
+------------------+
|24/7/2018 1:00:00 |
+------------------+
What I tried is :
from pyspark.sql.functions import unix_timestamp
table = table.withColumn(
'timestamp',
unix_timestamp(date_format('timestamp', 'yyyy-MM-dd HH:MM:SS'))
)
But I have NULL.
Update
Inspired by #Tony Pellerin's answer, I realize you can go directly to the :00:00 without having to use regexp_replace():
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:00:00"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
Your code doesn't work because pyspark.sql.functions.unix_timestamp() will:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
You actually want to do the inverse of this operation, which is convert from an integer timestamp to a string. For this you can use pyspark.sql.functions.from_unixtime():
import pyspark.sql.functions as f
table = table.withColumn("date", f.from_unixtime("timestamp", "dd/MM/yyyy HH:MM:SS"))
table.show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:07:00|
#+----------+-------------------+
Now the date column is a string:
table.printSchema()
#root
# |-- timestamp: long (nullable = true)
# |-- date: string (nullable = true)
So you can use pyspark.sql.functions.regexp_replace() to make the minutes and seconds zero:
table.withColumn("date", f.regexp_replace("date", ":\d{2}:\d{2}", ":00:00")).show()
#+----------+-------------------+
#| timestamp| date|
#+----------+-------------------+
#|1532383202|23/07/2018 18:00:00|
#+----------+-------------------+
The regex pattern ":\d{2}" means match a literal : followed by exactly 2 digits.
Maybe you could use the datetime library to convert timestamps to your wanted format. You should also use user-defined functions to work with spark DF columns. Here's what I would do:
# Import the libraries
from pyspark.sql.functions import udf
from datetime import datetime
# Create a function that returns the desired string from a timestamp
def format_timestamp(ts):
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:00:00')
# Create the UDF
format_timestamp_udf = udf(lambda x: format_timestamp(x))
# Finally, apply the function to each element of the 'timestamp' column
table = table.withColumn('timestamp', format_timestamp_udf(table['timestamp']))
Hope this helps.

How to Transform Spark Dataframe column for HH:MM:SS:Ms to value in seconds?

I would like to transform a spark dataframe column from its value hour min seconds
E.g "01:12:17.8370000"
Would become 4337 s thanks for the comment.
or "00:00:39.0390000"
would become 39 s.
I have read this question but I am lost on how I can use this code to transform my spark dataframe column.
Convert HH:mm:ss in seconds
Something like this
df.withColumn("duration",col("duration")....)
I am using scala 2.10.5 and spark 1.6
Thank you
Assuming the column "duration" contains the duration in a string, you can just use "unix_timestamp" function of the functions package to get the number of seconds passing the pattern:
import org.apache.spark.sql.functions._
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
val newColumn = unix_timestamp(col("duration"), "HH:mm:ss")
val result = df.withColumn("duration", newColumn)
result.show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
If you have a string column, you can write a udf to calculate this manually:
val df = Seq("01:12:17.8370000", "00:00:39.0390000").toDF("duration")
def str_sec = udf((s: String) => {
val Array(hour, minute, second) = s.split(":")
hour.toInt * 3600 + minute.toInt * 60 + second.toDouble.toInt
})
df.withColumn("duration", str_sec($"duration")).show
+--------+
|duration|
+--------+
| 4337|
| 39|
+--------+
there are inbuilt functions you can take advantage of which are faster and efficient than using udf functions
given input dataframe as
+----------------+
|duration |
+----------------+
|01:12:17.8370000|
|00:00:39.0390000|
+----------------+
so you can do something like below
df.withColumn("seconds", hour($"duration")*3600+minute($"duration")*60+second($"duration"))
you should be getting output as
+----------------+-------+
|duration |seconds|
+----------------+-------+
|01:12:17.8370000|4337 |
|00:00:39.0390000|39 |
+----------------+-------+

how to find week difference between two dates

I have a dataframe which has two column dates in unixtime and I want to find the week difference between these two columns. There is a weekOfYear UDF in SparkSQL but that is only useful when both dates fall in the same year. How can I find the week difference then?
p.s. I'm using Scala Spark.
You can take the approach of creating a custom UDF for this:
scala> val df=sc.parallelize(Seq((1480401142453L,1480399932853L))).toDF("date1","date2")
df: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint]
scala> df.show
+-------------+-------------+
| date1| date2|
+-------------+-------------+
|1480401142453|1480399932853|
+-------------+-------------+
scala> val udfDateDifference=udf((date1:Long,date2:Long)=>((date1-date2)/(60*60*24*7)).toInt
|
| )
udfDateDifference: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(LongType, LongType)))
scala> val resultDF=df.withColumn("dateDiffernece",udfDateDifference(df("date1"),df("date2")))
resultDF: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint ... 1 more field]
scala> resultDF.show
+-------------+-------------+--------------+
| date1| date2|dateDiffernece|
+-------------+-------------+--------------+
|1480401142453|1480399932853| 2|
+-------------+-------------+--------------+
And hence you can get the difference !
As you have UNIXTIME date format we can do this expression.
((date1-date2)/(60*60*24*7)).toInt
Edit:
Updating this answer with example
spark.udf.register("weekdiff", (from: Long, to: Long) => ((from - to) / (604800)).toInt)
// 60*60*24*7 => 604800
df.withColumn("weekdiff", weekdiff(df("date1_col_name"), df("date2_col_name")))