Let's assume we have a DataFrame(df) defined below in PySpark. And, how to use PySpark to get the duration between the first biking action and the last biking action within the same day. And save the results into a date framework including first_biking_timedeatails, last_biking_timedeatails, durations_bewteween_first_last, etc. Notes: there can be other actions between the first and last biking action. And, if there is only one biking action within a day, then we should not get the duration (since we will not able to do the calculation, such as date 3/3/18)
Below is the example result for the date 3/01/2018:
duration_03_01 = 13:12 (last biking time) - 5:12 (first biking time) = 8 hours
Sample df below:
timedeatils
actions
3/1/18 5:12
Biking
3/1/18 6:12
Running
3/1/18 7:12
Swimming
3/1/18 8:12
Running
3/1/18 9:12
Swimming
3/1/18 10:12
Biking
3/1/18 11:12
Biking
3/1/18 12:12
Running
3/1/18 13:12
Biking
3/2/18 4:12
Biking
3/2/18 5:12
Swimming
3/2/18 6:12
Running
3/2/18 7:12
Biking
3/2/18 8:12
Running
3/3/18 4:16
Biking
3/4/18 5:13
Running
3/4/18 6:13
Biking
3/4/18 7:13
Running
3/4/18 8:13
Swimming
3/4/18 9:13
Running
3/4/18 10:13
Running
3/4/18 11:13
Biking
Some of my code
df = spark.createDataFrame(
[
(3/1/2018 5:12','Biking')
,(3/1/2018 6:12',Running)
,(3/1/2018 7:12',Swimming)
,(3/1/2018 8:12',Running)
,(3/1/2018 9:12',Swimming)
,(3/1/2018 10:12','Biking')
,(3/1/2018 11:12','Biking')
,(3/1/2018 12:12',Running)
,(3/1/2018 13:12','Biking')
,(3/2/2018 4:12','Biking')
,(3/2/2018 5:12',Swimming)
,(3/2/2018 6:12',Running)
,(3/2/2018 7:12','Biking')
,(3/2/2018 8:12',Running)
,(3/3/2018 4:16','Biking')
,(3/4/2018 5:13','Biking')
,(3/4/2018 6:13',Running)
,(3/4/2018 7:13',Running)
,(3/4/2018 8:13',Swimming)
,(3/4/2018 9:13',Running)
,(3/4/2018 10:13',Running)
,(3/4/2018 11:13',Biking)
], ['TimeDetails','Actions']
)
And sample output is below:
First_Biking_time
action_1
Last_Biking_time
action_2
Durations_in_Hour
1
3/1/18 5:12
Biking
3/1/18 13:12
Biking
8
2
3/2/18 4:12
Biking
3/2/18 7:12
Biking
3
3
3/4/18 6:13
Biking
3/4/18 11:13
Biking
5
Can someone please provide me with some code in PySpark? On the other hand, is there any way to solve the problem in PySpark SQL as well?
Thank you
Your df:
df = spark.createDataFrame(
[
('3/1/2018 5:12','Biking')
,('3/1/2018 6:12','Running')
,('3/1/2018 7:12','Swimming')
,('3/1/2018 8:12','Running')
,('3/1/2018 9:12','Swimming')
,('3/1/2018 10:12','Biking')
,('3/1/2018 11:12','Biking')
,('3/1/2018 12:12','Running')
,('3/1/2018 13:12','Biking')
,('3/2/2018 4:12','Biking')
,('3/2/2018 5:12','Swimming')
,('3/2/2018 6:12','Running')
,('3/2/2018 7:12','Biking')
,('3/2/2018 8:12','Running')
,('3/3/2018 4:16','Biking')
,('3/4/2018 5:13','Biking')
,('3/4/2018 6:13','Running')
,('3/4/2018 7:13','Running')
,('3/4/2018 8:13','Swimming')
,('3/4/2018 9:13','Running')
,('3/4/2018 10:13','Running')
,('3/4/2018 11:13','Biking')
], ['TimeDetails','Actions']
)
Using a window function. You can adapt this solution to other Actions as well:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.\
withColumn('TimeDetails', F.to_timestamp('TimeDetails', 'M/d/y H:m'))\
.withColumn('date', F.to_date('TimeDetails'))\
w = Window.partitionBy('Actions', 'date').orderBy("date")
generic = df\
.withColumn('first_record', F.first(F.col('TimeDetails'), ignorenulls=True).over(w))\
.withColumn('last_record', F.last(F.col('TimeDetails'), ignorenulls=True).over(w))\
.withColumn('Durations_in_Hours',(F.unix_timestamp("last_record") - F.unix_timestamp('first_record'))/3600)\
.orderBy('TimeDetails')
biking = generic\
.filter(F.col('Actions') == 'Biking')\
.select(F.col('first_record').alias('First_Biking_time'),
F.col('Actions').alias('action_1'),
F.col('last_record').alias('Last_Biking_time'),
F.col('Actions').alias('action_2'),
F.col('Durations_in_Hours'))\
.dropDuplicates()\
.filter(F.col('Durations_in_Hours') != 0)\
.orderBy('First_Biking_time')
biking.show()
Output:
+-------------------+--------+-------------------+--------+------------------+
| First_Biking_time|action_1| Last_Biking_time|action_2|Durations_in_Hours|
+-------------------+--------+-------------------+--------+------------------+
|2018-03-01 05:12:00| Biking|2018-03-01 13:12:00| Biking| 8.0|
|2018-03-02 04:12:00| Biking|2018-03-02 07:12:00| Biking| 3.0|
|2018-03-04 05:13:00| Biking|2018-03-04 11:13:00| Biking| 6.0|
+-------------------+--------+-------------------+--------+------------------+
Related
I created the following dataframe:
import pandas as pd
import databricks.koalas as ks
df = ks.DataFrame(
{'Date1': pd.date_range('20211101', '20211110', freq='1D'),
'Date2': pd.date_range('20201101', '20201110', freq='1D')})
df
Out[0]:
Date1
Date2
0
2021-11-01
2020-11-01
1
2021-11-02
2020-11-02
2
2021-11-03
2020-11-03
3
2021-11-04
2020-11-04
4
2021-11-05
2020-11-05
5
2021-11-06
2020-11-06
6
2021-11-07
2020-11-07
7
2021-11-08
2020-11-08
8
2021-11-09
2020-11-09
9
2021-11-10
2020-11-10
When trying to get the minimum of Date1 I get the correct result:
df.Date1.min()
Out[1]:
Timestamp('2021-11-01 00:00:00')
Also, when trying to get the minimum values of each row the correct result is returned:
df.min(axis=1)
Out[2]:
0 2020-11-01
1 2020-11-02
2 2020-11-03
3 2020-11-04
4 2020-11-05
5 2020-11-06
6 2020-11-07
7 2020-11-08
8 2020-11-09
9 2020-11-10
dtype: datetime64[ns]
However, using the same functions on columns fails:
df.min(axis=0)
Out[3]:
Series([], dtype: float64)
Does anyone know why this is and if there's an elegant way around it?
Try this:
df.apply(min, axis=0)
Out[1]:
Date1 2021-11-01
Date2 2020-11-01
dtype: datetime64[ns]
This was indeed a bug in the code, but since then Koalas was merged with pyspark and the pandas on spark API was born. More information here.
Using spark 3.2.0 and above, one needs to replace
import databricks.koalas as ks
With
import pyspark.pandas as ps
and replace ks.DataFrame with ps.DataFrame. This completely eliminates the issue.
I have the below table where I have the increasing streak if the activity_date is consecutive. If not, streak will be set to 1.
Now I need to get the min and max of each group of streaks.
Using Spark and scala or Spark SQL.
Input
floor activity_date streak
--------------------------------
floor1 2018-11-08 1
floor1 2019-01-24 1
floor1 2019-04-05 1
floor1 2019-04-08 1
floor1 2019-04-09 2
floor1 2019-04-14 1
floor1 2019-04-17 1
floor1 2019-04-20 1
floor2 2019-05-04 1
floor2 2019-05-05 2
floor2 2019-06-04 1
floor2 2019-07-28 1
floor2 2019-08-14 1
floor2 2019-08-22 1
Output
floor activity_date end_activity_date
----------------------------------------
floor1 2018-11-08 2018-11-08
floor1 2019-01-24 2019-01-24
floor1 2019-04-05 2019-04-05
floor1 2019-04-08 2019-04-09
floor1 2019-04-14 2019-04-14
floor1 2019-04-17 2019-04-17
floor1 2019-04-20 2019-04-20
floor2 2019-05-04 2019-05-05
floor2 2019-06-04 2019-06-04
floor2 2019-07-28 2019-07-28
floor2 2019-08-14 2019-08-14
floor2 2019-08-22 2019-08-22
You may use the following approach
Using Spark SQL
SELECT
floor,
activity_date,
MAX(activity_date) OVER (PARTITION BY gn,floor) as end_activity_date
FROM (
SELECT
*,
SUM(is_same_streak) OVER (
PARTITION BY floor ORDER BY activity_date
) as gn
FROM (
SELECT
*,
CASE
WHEN streak > LAG(streak,1,streak-1) OVER (
PARTITION BY floor
ORDER BY activity_date
) THEN 0
ELSE 1
END as is_same_streak
FROM
df
) t1
) t2
ORDER BY
"floor",
activity_date
View working demo db fiddle
Using scala api
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val floorWindow = Window.partitionBy("floor").orderBy("activity_date")
val output = df.withColumn(
"is_same_streak",
when(
col("streak") > lag(col("streak"),1,col("streak")-1).over(floorWindow) , 0
).otherwise(1)
)
.withColumn(
"gn",
sum(col("is_same_streak")).over(floorWindow)
)
.select(
"floor",
"activity_date",
max(col("activity_date")).over(
Window.partitionBy("gn","floor")
).alias("end_activity_date")
)
Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import Window
floorWindow = Window.partitionBy("floor").orderBy("activity_date")
output = (
df.withColumn(
"is_same_streak",
F.when(
F.col("streak") > F.lag(F.col("streak"),1,F.col("streak")-1).over(floorWindow) , 0
).otherwise(1)
)
.withColumn(
"gn",
F.sum(F.col("is_same_streak")).over(floorWindow)
)
.select(
"floor",
"activity_date",
F.max(F.col("activity_date")).over(
Window.partitionBy("gn","floor")
).alias("end_activity_date")
)
)
Let me know if this works for you.
I want to compare Prev.data with Current data by monthly . I am having data like below.
Data-set 1 : (Prev) Data-set 2 : (Latest)
Year-month Sum-count Year-Month Sum-count
-- -- 201808 48
201807 30 201807 22
201806 20 201806 20
201805 35 201805 20
201804 12 201804 9
201803 15 -- --
I have data sets like as shown above. I want to compare the both data sets based on year-month column and sum-count and need to find out difference in percentage.
I am using spark 2.3.0 and Scala 2.11.
Here is mode :
import org.apache.spark.sql.functions.lag
val mdf = spark.read.format("csv").
option("InferSchema","true").
option("header","true").
option("delimiter",",").
option("charset","utf-8").
load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")
val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()
I need output like this :
if percentage is more than 10% then i need to set flag as F.
set1 cnt set2 cnt output(Percentage) Flag
201807 30 201807 22 7% T
201806 20 201806 20 0% T
201805 35 201805 20 57% F
Please help me on this.
Can be done in such way:
val data1 = List(
("201807", 30),
("201806", 20),
("201805", 35),
("201804", 12),
("201803", 15)
)
val data2 = List(
("201808", 48),
("201807", 22),
("201806", 20),
("201805", 20),
("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")
val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
.withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
.withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
.show(false)
Output:
+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807 |30 |22 |0.26666666666666666|F |
|201806 |20 |20 |0.0 |T |
|201805 |35 |20 |0.42857142857142855|F |
|201804 |12 |9 |0.25 |F |
+----------+---------+---------+-------------------+----+
Here's my solution :
val values1 = List(List("1201807", "30")
,List("1201806", "20") ,
List("1201805", "35"),
List("1201804","12"),
List("1201803","15")
).map(x =>(x(0), x(1)))
val values2 = List(List("201808", "48")
,List("1201807", "22") ,
List("1201806", "20"),
List("1201805","20"),
List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
.withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
.withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
.withColumn("percentage", abs(col("percentage2") - col("percentage1")))
.withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()
Here's the result :
+-------+----+----+------------------+------------------+------------------+----+
| set|cnt1|cnt2| percentage1| percentage2| percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804| 12| 9|10.714285714285714| 7.563025210084033| 3.15126050420168| T|
|1201807| 30| 22|26.785714285714285|18.487394957983195| 8.29831932773109| T|
|1201806| 20| 20|17.857142857142858| 16.80672268907563|1.0504201680672267| T|
|1201805| 35| 20| 31.25| 16.80672268907563|14.443277310924369| F|
+-------+----+----+------------------+------------------+------------------+----+
I hope it helps :)
FileA has data like this with start and end time stamps as the last two columns
dataa, data1, 9:10, 9:15
datab, data2, 10:00, 10:10
datac, data3, 11:20, 11:30
datad, data4, 12:30, 12:40
FileB has data like this with start and end time stamps as the last two columns
dataaa, data11, 9:13, 9:17
databb, data22, 10:02, 10:08
datacc, data33, 6:20, 6:30
datadd, data44, 12:31, 12:35
Perform a join between this two file, which should result the following from FileB,
databb, data22, 10:02, 10:08
datadd, data44, 12:31, 12:35
The criteria for join is the start time of FileB should be greater than start time of FileA whereas end time of FileB should be less than start time of FileA.
how to to write the code it in spark-sql.?
You can create common schema for both files as the structures of both files are same.
val schema = StructType(Array("col1", "col2", "start", "end").map(StructField(_, StringType, true)))
Then you can read the first file to dataframe as
val fileAdf = sqlContext.read.format("path to fileA")
//+-----+------+------+------+
//|col1 |col2 |start |end |
//+-----+------+------+------+
//|dataa| data1| 9:10 | 9:15 |
//|datab| data2| 10:00| 10:10|
//|datac| data3| 11:20| 11:30|
//|datad| data4| 12:30| 12:40|
//+-----+------+------+------+
Similarly you can read second file (fileB)
val fileBdf = sqlContext.read.format("path to fileB")
//+------+-------+------+------+
//|col1 |col2 |start |end |
//+------+-------+------+------+
//|dataaa| data11| 9:13 | 9:17 |
//|databb| data22| 10:02| 10:08|
//|datacc| data33| 6:20 | 6:30 |
//|datadd| data44| 12:31| 12:35|
//+------+-------+------+------+
Just use the same logic you explianed in the question in spark-sql as
import org.apache.spark.sql.functions._
fileBdf.as("fileB").join(fileAdf.as("fileA"), col("fileB.start") > col("fileA.start") && col("fileB.end") < col("fileA.end"))
.select(col("fileB.col1"), col("fileB.col2"), col("fileB.start"), col("fileB.end"))
which should give you
+------+-------+------+------+
|col1 |col2 |start |end |
+------+-------+------+------+
|databb| data22| 10:02| 10:08|
|datadd| data44| 12:31| 12:35|
+------+-------+------+------+
I hope the answer is helpful
I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
I am having data in dataframe as
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
I am getting output as :
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?
For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
The syntax is like follows,
window(column, window duration, sliding duration, starting offset)
With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00.
val data = Seq(("2017-01-01 00:00:00",1.0),
("2017-01-01 00:15:00",2.0),
("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
.withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))
val df2 = df
.groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
.agg(sum("value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
Will give this resulting dataframe:
+-------------------+-------------------+-------------+
| start| end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00| 3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00| 1.43|
+-------------------+-------------------+-------------+
The solution with the python API looks a bit more intuitive since the window function works with the following options:
window(timeColumn, windowDuration, slideDuration=None, startTime=None)
see:
https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/functions.html
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC
with which to start window intervals. For example, in order to have
hourly tumbling windows that start 15 minutes past the hour, e.g.
12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
No need for a workaround with sliding duration, I used a 3 days "delay" as startTime to match the desired tumbling window:
from datetime import datetime
from pyspark.sql.functions import sum, window
df_ex = spark.createDataFrame([(datetime(2017,1,1, 0,0) , 1.), \
(datetime(2017,1,1,0,15) , 2.), \
(datetime(2017,1,8,23,30) , 1.43)], \
["Datetime", "value"])
weekly_ex = df_ex \
.groupBy(window("Datetime", "1 week", startTime="3 day" )) \
.agg(sum("value").alias('aggregate_sum'))
weekly_ex.show(truncate=False)
For the same result:
+------------------------------------------+-------------+
|window |aggregate_sum|
+------------------------------------------+-------------+
|[2017-01-01 00:00:00, 2017-01-08 00:00:00]|3.0 |
|[2017-01-08 00:00:00, 2017-01-15 00:00:00]|1.43 |
+------------------------------------------+-------------+