Add Hours, minutes and seconds to Spark dataframe - pyspark

Is there a Spark SQL function to add Hours, Minutes and Seconds to existing timestamp column.
For example:
+----------+-------------------+-------------------+
| dt| txn_dt| txn_dt_tm|
+----------+-------------------+-------------------+
|2008-08-15|2008-08-15 00:00:00|2008-08-15 05:00:00|
+----------+-------------------+-------------------+
I need add 23 hours 59 minutes and 59 seconds to txn_dt column.
Output:
+----------+-------------------+-------------------+
| dt| txn_dt| txn_dt_tm|
+----------+-------------------+-------------------+
|2008-08-15|2008-08-15 23:59:59|2008-08-15 05:00:00|
+----------+-------------------+-------------------+
Update:
I was able to get it using INTERVAL, but not sure this is efficient way of doing it.
df.select((F.col("txn_dt") + F.expr("INTERVAL 23 HOURS") + F.expr("INTERVAL 59 MINUTES") + F.expr("INTERVAL 59 SECONDS")).alias("txn_dt_tm"))

You need to customize the udf function, such as:
import org.apache.spark.sql.functions._
val timeUdf = udf{(time: java.sql.Timestamp) => new java.sql.Timestamp(time.getTime + 24*60*60*1000 - 1000)}
df.withColumn("dt", timeUdf(df("dt"))).show()
and the result:
+--------------------+---+
| dt| id|
+--------------------+---+
|2008-08-15 23:59:...| 1|
+--------------------+---+
i hope this will help you.

Related

Average of sales every 3 months in pyspark

I want every 3 month of average sales in pyspark .
Input
Input:
Product Date Sales
A 01/04/2020 50
A 02/04/2020 60
A 01/05/2020 70
A 05/05/2020 80
A 10/06/2020 100
A 13/06/2020 150
A 25/07/2020 160
output:output
Product Date Sales 3month Avg sales
A 01/04/2020 50 36.67
A 02/04/2020 60 36.67
A 01/05/2020 70 86.67
A 05/05/2020 80 86.67
A 10/06/2020 100 170
A 13/06/2020 150 170
A 25/07/2020 160 186.67
Avg of july is sales of (may+june+july)/3=560/3=186.67
Sometimes, the dense_rank is quite expensive and so I have calculated the custom index and to similar steps with #Cena.
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('Product').orderBy('index').rangeBetween(-2, 0)
df.withColumn('Date', to_date('Date', 'dd/MM/yyyy')) \
.withColumn('index', (year('Date') - 2020) * 12 + month('Date')) \
.withColumn('avg', sum('Sales').over(w) / 3) \
.show()
+-------+----------+-----+-----+------------------+
|Product| Date|Sales|index| avg|
+-------+----------+-----+-----+------------------+
| A|2020-04-01| 50| 4|36.666666666666664|
| A|2020-04-02| 60| 4|36.666666666666664|
| A|2020-05-01| 70| 5| 86.66666666666667|
| A|2020-05-05| 80| 5| 86.66666666666667|
| A|2020-06-10| 100| 6| 170.0|
| A|2020-06-13| 150| 6| 170.0|
| A|2020-07-25| 160| 7|186.66666666666666|
+-------+----------+-----+-----+------------------+
You can use dense_rank() over the month column to compute the moving average. Cast the date and extract the month from it. dense_rank() rank over the month gives you consecutive ranks.
For the moving average, you can use rangeBetween(-2, 0) to look back 2 months from the current month. Sum by sales and divide by 3 for the output.
Your df:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
from pyspark.sql.window import Window
row = Row("Product", "Date", "Sales")
df = sc.parallelize([row("A", "01/04/2020", 50),row("A", "02/04/2020", 60),row("A", "01/05/2020", 70),row("A", "05/05/2020", 80),row("A", "10/06/2020", 100),row("A", "13/06/2020", 150),row("A", "25/07/2020", 160)]).toDF()
df = df.withColumn('date_cast', from_unixtime(unix_timestamp('Date', 'dd/MM/yyyy')).cast(DateType()))
df = df.withColumn('month', month("date_cast"))
w=Window().partitionBy("Product").orderBy("month")
df = df.withColumn('rank', F.dense_rank().over(w))
w2 = (Window().partitionBy(col("Product")).orderBy("rank").rangeBetween(-2, 0))
df.select(col("*"), ((F.sum("Sales").over(w2))/3).alias("mean"))\
.drop("date_cast", "month", "rank").show()
Output:
+-------+----------+-----+------------------+
|Product| Date|Sales| mean|
+-------+----------+-----+------------------+
| A|01/04/2020| 50|36.666666666666664|
| A|02/04/2020| 60|36.666666666666664|
| A|01/05/2020| 70| 86.66666666666667|
| A|05/05/2020| 80| 86.66666666666667|
| A|10/06/2020| 100| 170.0|
| A|13/06/2020| 150| 170.0|
| A|25/07/2020| 160|186.66666666666666|
+-------+----------+-----+------------------+

Spark Scala - 7 Day Rolling Sum

I have some data that I want to calculate a 7 day rolling sum on. Every row for a specific date should be counted as 1 occurrence. My thought process here is to use something like:
val myWindow = Window.orderBy("Date").rangeBetween(currentRow,days(7))
val myData = df.withColumn("Count",df.count().over(myWindow))
But the rangeBetween piece doesn't allow for days(7), for looking 7 days ahead from the current date.
Any thoughts?
Input Data:
val df = Seq(
("08/04/2013",22),
("08/05/2013",24),
("08/06/2013",26),
("08/07/2013",29),
("08/08/2013",24),
("08/09/2013",24),
("08/10/2013",22),
("08/11/2013",24),
("08/11/2013",26)
).toDF("Date","Code")
+----------+----+
| Date|Code|
+----------+----+
|08/04/2013| 22|
|08/05/2013| 24|
|08/06/2013| 26|
|08/07/2013| 29|
|08/08/2013| 24|
|08/09/2013| 24|
|08/10/2013| 22|
|08/11/2013| 24|
|08/11/2013| 26|
+----------+----+
Expected output:
+----------+-----------+------+
| Start|End|Amount|Count |
+----------+-----------+------+
|08/04/2013| 08/10/2013|7 |
|08/05/2013| 08/11/2013|8 |
+----------+-----------+------+
From Spark 2.3 you have to use long values with rangeBetween. As one day has 86400 seconds, you can express your query as:
val myWindow = Window.orderBy("Date").rangeBetween(0, 7 * 86400)
val myData = df
.withColumn("Date", to_date($"Date", "MM/dd/yyyy").cast("timestamp").cast("long"))
.withColumn("Count", count($"*").over(myWindow))
.withColumn("Date", $"Date".cast("timestamp").cast("date"))

Extract only Hour from Epochtime in scala

I am having a dataframe with one of its column as epochtime.
I want to extract only hour from it and display it as a separate column.
Below is the sample dataframe:
+----------+-------------+
| NUM_ID| STIME|
+----------+-------------+
|xxxxxxxx01|1571634285000|
|xxxxxxxx01|1571634299000|
|xxxxxxxx01|1571634311000|
|xxxxxxxx01|1571634316000|
|xxxxxxxx02|1571634318000|
|xxxxxxxx02|1571398176000|
|xxxxxxxx02|1571627596000|
Below is the expected output.
+----------+-------------+-----+
| NUM_ID| STIME| HOUR|
+----------+-------------+-----+
|xxxxxxxx01|1571634285000| 10 |
|xxxxxxxx01|1571634299000| 10 |
|xxxxxxxx01|1571634311000| 10 |
|xxxxxxxx01|1571634316000| 10 |
|xxxxxxxx02|1571634318000| 10 |
|xxxxxxxx02|1571398176000| 16 |
|xxxxxxxx02|1571627596000| 08 |
I have tried
val test = test1DF.withColumn("TIME", extract HOUR(from_unixtime($"STIME"/1000)))
which throws exception at
<console>:46: error: not found: value extract
Tried as below to obtain date format and even it is not working.
val test = test1DF.withColumn("TIME", to_timestamp(from_unixtime(col("STIME")))
The datatype of STIME in dataframe is Long.
Any leads to extract hour from epochtime in Long datatype?
Extracting the hours from a timestamp is as simple as using the hour() function:
import org.apache.spark.sql.functions._
val df_with_hour = df.withColumn("TIME", hour(from_unixtime($"STIME" / 1000)))
df_with_hour.show()
// +-------------+----+
// | STIME|TIME|
// +-------------+----+
// |1571634285000| 5|
// |1571398176000| 11|
// |1571627596000| 3|
// +-------------+----+
(Note: I'm in a different timezone)

adding time interval to a column in dataframe spark

Below is my dataframe.
import spark.implicits._
val lastRunDtDF = sc.parallelize(Seq(
(1, 2,"2019-07-18 13:34:24")
)).toDF("id", "cnt","run_date")
lastRunDtDF.show
+---+---+-------------------+
| id|cnt| run_date|
+---+---+-------------------+
| 1| 2|2019-07-18 13:34:24|
+---+---+-------------------+
I want to create a new dataframe with a new column as new_run_date by adding 2 minutes to the existing run_date column. sample Output like below.
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
I am trying something like below
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date")+"INTERVAL 2 MINUTE")
Looks like its not the right way. Thanks in advance for any help.
Try wrapping INTERVAL 2 MINUTE in expr function.
import org.apache.spark.sql.functions.expr
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date") + expr("INTERVAL 2 MINUTE"))
.show()
Result:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
(or)
By using from_unixtime,unix_timestamp functions:
import org.apache.spark.sql.functions._
lastRunDtDF.selectExpr("*","from_unixtime(unix_timestamp(run_date) + 2*60,
'yyyy-MM-dd HH:mm:ss') as new_run_date")
.show()
Result:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+

Spark dataframe data aggregation

I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.