How to round timestamp to 10 minutes in Spark 3.0?

How to round timestamp to 10 minutes in Spark 3.0? - scala

I have a timestamp like that in $"my_col":
2022-01-21 22:11:11
with date_trunc("minute",($"my_col"))
2022-01-21 22:11:00
with date_trunc("hour",($"my_col"))
2022-01-21 22:00:00
What is a Spark 3.0 way to get
2022-01-21 22:10:00
?

Convert the timestamp into seconds using unix_timestamp function, then perform the rounding by dividing by 600 (10 minutes), round the result of division and multiply by 600 again:
val df = Seq(
("2022-01-21 22:11:11"),
("2022-01-21 22:04:04"),
("2022-01-21 22:19:34"),
("2022-01-21 22:57:14")
).toDF("my_col").withColumn("my_col", to_timestamp($"my_col"))
df.withColumn(
"my_col_rounded",
from_unixtime(round(unix_timestamp($"my_col") / 600) * 600)
).show
//+-------------------+-------------------+
//|my_col |my_col_rounded |
//+-------------------+-------------------+
//|2022-01-21 22:11:11|2022-01-21 22:10:00|
//|2022-01-21 22:04:04|2022-01-21 22:00:00|
//|2022-01-21 22:19:34|2022-01-21 22:20:00|
//|2022-01-21 22:57:14|2022-01-21 23:00:00|
//+-------------------+-------------------+
You can also truncate the original timestamp to hours, get the minutes that your round to 10 and add them to truncated timestamp using interval:
df.withColumn(
"my_col_rounded",
date_trunc("hour", $"my_col") + format_string(
"interval %s minute",
expr("round(extract(MINUTE FROM my_col)/10.0)*10")
).cast("interval")
)

Related

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+

The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+

get date difference from the columns in dataframe and get seconds -Spark scala

I have a dataframe with two date columns .Now I need to get the difference and the results should be seconds
UNIX_TIMESTAMP(SUBSTR(date1, 1, 19)) - UNIX_TIMESTAMP(SUBSTR(date2, 1, 19)) AS delta
that hive query I am trying to convert into dataframe query using scala
df.select(col("date").substr(1,19)-col("poll_date").substr(1,19))
from here I am not able to convert into seconds , Can any body help on this .Thanks in advance

Using DataFrame API, you can calculate the date difference in seconds simply by subtracting one column from the other in unix_timestamp:
val df = Seq(
("2018-03-05 09:00:00", "2018-03-05 09:01:30"),
("2018-03-06 08:30:00", "2018-03-08 15:00:15")
).toDF("date1", "date2")
df.withColumn("tsdiff", unix_timestamp($"date2") - unix_timestamp($"date1")).
show
// +-------------------+-------------------+------+
// | date1| date2|tsdiff|
// +-------------------+-------------------+------+
// |2018-03-05 09:00:00|2018-03-05 09:01:30| 90|
// |2018-03-06 08:30:00|2018-03-08 15:00:15|196215|
// +-------------------+-------------------+------+
You could perform the calculation in Spark SQL as well, if necessary:
df.createOrReplaceTempView("dfview")
spark.sql("""
select date1, date2, (unix_timestamp(date2) - unix_timestamp(date1)) as tsdiff
from dfview
""")

Scala Operation on TimeStamp values

I have the input in timestamp , based on some condition i need to minus 1 sec or minus 3 months using scala programming
Input:
val date :String = "2017-10-31T23:59:59.000"
Output:
For Minus 1 sec
val lessOneSec = "2017-10-31T23:59:58.000"
For Minus 3 Months
val less3Mon = "2017-07-31T23:59:58.000"
How to convert a string value to Timestamp and do the operations like minus in scala programming ?

I assume you are working with Dataframes, since you have the spark-dataframe tag.
You can use the SQL INTERVAL to reduce the time, but your column should be in timestamp format for that:
df.show(false)
+-----------------------+
|ts |
+-----------------------+
|2017-10-31T23:59:59.000|
+-----------------------+
import org.apache.spark.sql.functions._
df.withColumn("minus1Sec" , date_format($"ts".cast("timestamp") - expr("interval 1 second") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.withColumn("minus3Mon" , date_format($"ts".cast("timestamp") - expr("interval 3 month ") , "yyyy-MM-dd'T'HH:mm:ss.SSS") )
.show(false)
+-----------------------+-----------------------+-----------------------+
|ts |minus1Sec |minus3Mon |
+-----------------------+-----------------------+-----------------------+
|2017-10-31T23:59:59.000|2017-10-31T23:59:58.000|2017-07-31T23:59:59.000|
+-----------------------+-----------------------+-----------------------+

Try this below code
val yourDate = "2017-10-31T23:59:59.000"
val formater = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SSS")
val date = LocalDateTime.parse(yourDate, formater)
println(date.minusSeconds(1).toString(formater))
println(date.minusMonths(3).toString(formater))
Output
2017-10-31T23:59:58.000
2017-07-31T23:59:59.000

Look at the jodatime library. it has all the APIs you need to minus seconds or months from a timestamp
http://www.joda.org/joda-time/
sbt dependency
"joda-time" % "joda-time" % "2.9.9"

Using Window.rowsBetween in spark scala [duplicate]

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a Window Function like:
Window \
.partitionBy('id') \
.orderBy('start')
I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)

Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types

Spark 3.3 is released, but...
The answer may be as old as Spark 1.5.0:
datediff.
datediff(col_name, '1000') will return an integer difference of days from 1000-01-01 to col_name.
As the first argument, it accepts dates, timestamps and even strings.
As the second, it even accepts 1000.
The answer
Date difference in days - depending on the data type of the order column:
date
Spark 3.1+
.orderBy(F.expr("unix_date(col_name)")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
timestamp
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
long - UNIX time in microseconds (e.g. 1672534861000000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000000).rangeBetween(-7, 0)
long - UNIX time in milliseconds (e.g. 1672534861000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000).rangeBetween(-7, 0)
long - UNIX time in seconds (e.g. 1672534861)
Spark 2.1+
.orderBy(F.col("col_name") / 86400).rangeBetween(-7, 0)
long in format yyyyMMdd
Spark 3.3+
.orderBy(F.expr("unix_date(to_date(col_name, 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(cast(col_name as string), 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(cast(col_name as string), 'yyyyMMdd'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp(F.col("col_name").cast('string'), 'yyyyMMdd') / 86400).rangeBetween(-7, 0)
string in date format of 'yyyy-MM-dd'
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name))")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other date format (e.g. 'MM-dd-yyyy')
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name, 'MM-dd-yyyy'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp("col_name", 'MM-dd-yyyy') / 86400).rangeBetween(-7, 0)
string in timestamp format of 'yyyy-MM-dd HH:mm:ss'
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other timestamp format (e.g. 'MM-dd-yyyy HH:mm:ss')
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy HH:mm:ss'), '1000')")).rangeBetween(-7, 0)

Fantastic solution #zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, sum(total) OVER (
ORDER BY CAST(reading_date AS timestamp)
RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW
) AS sum_total FROM df""").show()

Weekly Aggregation using Windows Function in Spark

I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
I am having data in dataframe as
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
I am getting output as :
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?

For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
The syntax is like follows,
window(column, window duration, sliding duration, starting offset)
With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00.
val data = Seq(("2017-01-01 00:00:00",1.0),
("2017-01-01 00:15:00",2.0),
("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
.withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))
val df2 = df
.groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
.agg(sum("value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
Will give this resulting dataframe:
+-------------------+-------------------+-------------+
| start| end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00| 3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00| 1.43|
+-------------------+-------------------+-------------+

The solution with the python API looks a bit more intuitive since the window function works with the following options:
window(timeColumn, windowDuration, slideDuration=None, startTime=None)
see:
https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/functions.html
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC
with which to start window intervals. For example, in order to have
hourly tumbling windows that start 15 minutes past the hour, e.g.
12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
No need for a workaround with sliding duration, I used a 3 days "delay" as startTime to match the desired tumbling window:
from datetime import datetime
from pyspark.sql.functions import sum, window
df_ex = spark.createDataFrame([(datetime(2017,1,1, 0,0) , 1.), \
(datetime(2017,1,1,0,15) , 2.), \
(datetime(2017,1,8,23,30) , 1.43)], \
["Datetime", "value"])
weekly_ex = df_ex \
.groupBy(window("Datetime", "1 week", startTime="3 day" )) \
.agg(sum("value").alias('aggregate_sum'))
weekly_ex.show(truncate=False)
For the same result:
+------------------------------------------+-------------+
|window |aggregate_sum|
+------------------------------------------+-------------+
|[2017-01-01 00:00:00, 2017-01-08 00:00:00]|3.0 |
|[2017-01-08 00:00:00, 2017-01-15 00:00:00]|1.43 |
+------------------------------------------+-------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to round timestamp to 10 minutes in Spark 3.0? - scala

I have a timestamp like that in $"my_col": 2022-01-21 22:11:11 with date_trunc("minute",($"my_col")) 2022-01-21 22:11:00 with date_trunc("hour",($"my_col")) 2022-01-21 22:00:00 What is a Spark 3.0 way to get 2022-01-21 22:10:00 ?

Related

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

get date difference from the columns in dataframe and get seconds -Spark scala

Scala Operation on TimeStamp values

Using Window.rowsBetween in spark scala [duplicate]

Weekly Aggregation using Windows Function in Spark

Categories

Resources