dataframe spark scala take the (MAX-MIN) for each group - scala

i have a dataframe from a processing part, looks like :
+---------+------+-----------+
|Time |group |value |
+---------+------+-----------+
| 28371| 94| 906|
| 28372| 94| 864|
| 28373| 94| 682|
| 28374| 94| 574|
| 28383| 95| 630|
| 28384| 95| 716|
| 28385| 95| 913|
i would like to take the (value for max time - value for min time) for each group, to have this result :
+------+-----------+
|group | value |
+------+-----------+
| 94| -332|
| 95| 283|
Thank you in advance for the help

df.groupBy("groupCol").agg(max("value")-min("value"))
Based on the question edit by the OP, here is a way to do this in PySpark. The idea is to compute the row numbers in ascending and descending order of time per group and use those values for subtraction.
from pyspark.sql import Window
from pyspark.sql import functions as func
w_asc = Window.partitionBy(df.groupCol).orderBy(df.time)
w_desc = Window.partitionBy(df.groupCol).orderBy(func.desc(df.time))
df = df.withColumn(func.row_number().over(w_asc).alias('rnum_asc')) \
.withColumn(func.row_number().over(w_desc).alias('rnum_desc'))
df.groupBy(df.groupCol) \
.agg((func.max(func.when(df.rnum_desc==1,df.value))-func.max(func.when(df.rnum_asc==1,df.value))).alias('diff')).show()
It would have been easier if window function first_value were available in Spark SQL. A generic way to solve this using SQL is
select distinct groupCol,diff
from (
select t.*
,first_value(val) over(partition by groupCol order by time) -
first_value(val) over(partition by groupCol order by time desc) as diff
from tbl t
) t

Related

pyspark - aggregation

Say, I have a dataframe as below
mid | bid | m_date1 | m_date2 | m_date3 |
100 | ws | | | 2022-02-01|
200 | gs | 2022-02-01| | |
Now I have an sql aggregation as below
SELECT
mid,
bid,
min(NEXT(m_date1, 'SAT')) as dat1,
min(NEXT(m_date2, 'SAT')) as dat2,
min(NEXT(m_date3, 'SAT')) as dat3
FROM df
GROUPBY 1,2
I am looking to implement above aggregation using Pyspark but wondering if I can use any form of iteration to achieve dat1, dat2 and dat3 as same 'min' function is applied on those columns. I could use below aggregation syntax in PySpark for each column but I am looking to avoid repeating the 'min' function on each aggregated column.
df.groupBy('mid','bid').agg(...)
Thank you
A sample output would have been better. If I got you right you are after
df.groupby('mid','bid').agg(*[min(i).alias(f"min{i}") for i in df.drop('mid','bid').columns]).show()

Return the rows of a dataframe that satisfy one condition while fixing the values of another column

I have a dataframe that looks like this:
Genres | Year | Number_Movies
Drama |2015 | 705
Romance|2015 | 203
Comedy |2015 | 586
Drama |2014 | 605
Romance|2014 | 293
Comedy |2014 | 786
I would like to return the gender by year that has the maximum number of movies:
Genres | Year | Number_Movies
Drama |2015 | 705
Comedy |2014 | 786
Please help if possible. Thanks a lot.
Here are few options that can solve this -
df = spark.createDataFrame([('Drama',2015,705),('Romance',2015,203),('Comedy',2015,586),('Drama',2014,605),('Romance',2014,293),('Comedy ',2014,786)],['Genres','Year','Number_Movies'])
First Option: Define a rank using window function (partition by - Year and order by - Number_Movies desc). Highest Number_Movies each year will get rank "1".
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,desc
w = Window.partitionBy("Year").orderBy(desc("Number_Movies"))
rank = row_number().over(w).alias('rank')
df.withColumn("rank", rank)\
.where("rank=1")\
.drop("rank")\
.show()
#+-------+----+-------------+
#| Genres|Year|Number_Movies|
#+-------+----+-------------+
#|Comedy |2014| 786|
#| Drama|2015| 705|
#+-------+----+-------------+
Second Option: Get maxumum of Number_Movies for each year and self join with dataframe to get the Genres.
from pyspark.sql.functions import max,col
joining_condition = [col('a.Year') == col('b.Year'), col('a.max_Number_Movies') == col('b.Number_Movies')]
df.groupBy("Year").\
agg(max("Number_Movies").alias("max_Number_Movies")).alias("a").\
join(df.alias("b"), joining_condition).\
selectExpr("b.Genres","b.Year","b.Number_Movies").\
show()
#+-------+----+-------------+
#| Genres|Year|Number_Movies|
#+-------+----+-------------+
#|Comedy |2014| 786|
#| Drama|2015| 705|
#+-------+----+-------------+

Calculating and aggregating data by date/time

I am working with a dataframe like this:
Id | TimeStamp | Event | DeviceId
1 | 5.2.2019 8:00:00 | connect | 1
2 | 5.2.2019 8:00:05 | disconnect| 1
I am using databricks and pyspark to do the ETL process. How can I calculate and create such a dataframe as shown at the bottom? I have already tried using a UDF but I could not find a way to make it work. I also tried to do it by iterating over the whole data frame, but this is extremely slow.
I want to aggregate this dataframe to get a new dataframe which tells me the times, how long each device has been connected and disconnected:
Id | StartDateTime | EndDateTime | EventDuration |State | DeviceId
1 | 5.2.19 8:00:00 | 5.2.19 8:00:05| 0.00:00:05 |connected| 1
I think you can make this work with a window function and some further column creations with withColumn.
The code I did should create the mapping for devices and create a table with the duration for each state. The only requirement is that connect and disconnect appear alternatively.
Then you can use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import datetime
test_df = sqlContext.createDataFrame([(1,datetime.datetime(2019,2,5,8),"connect",1),
(2,datetime.datetime(2019,2,5,8,0,5),"disconnect",1),
(3,datetime.datetime(2019,2,5,8,10),"connect",1),
(4,datetime.datetime(2019,2,5,8,20),"disconnect",1),],
["Id","TimeStamp","Event","DeviceId"])
#creation of dataframe with 4 events for 1 device
test_df.show()
Output:
+---+-------------------+----------+--------+
| Id| TimeStamp| Event|DeviceId|
+---+-------------------+----------+--------+
| 1|2019-02-05 08:00:00| connect| 1|
| 2|2019-02-05 08:00:05|disconnect| 1|
| 3|2019-02-05 08:10:00| connect| 1|
| 4|2019-02-05 08:20:00|disconnect| 1|
+---+-------------------+----------+--------+
Then you can create the helper functions and the window:
my_window = Window.partitionBy("DeviceId").orderBy(col("TimeStamp").desc()) #create window
get_prev_time = lag(col("Timestamp"),1).over(my_window) #get previous timestamp
time_diff = get_prev_time.cast("long") - col("TimeStamp").cast("long") #compute duration
test_df.withColumn("EventDuration",time_diff)\
.withColumn("EndDateTime",get_prev_time)\ #apply the helper functions
.withColumnRenamed("TimeStamp","StartDateTime")\ #rename according to your schema
.withColumn("State",when(col("Event")=="connect", "connected").otherwise("disconnected"))\ #create the state column
.filter(col("EventDuration").isNotNull()).select("Id","StartDateTime","EndDateTime","EventDuration","State","DeviceId").show()
#finally some filtering for the last events, which do not have a previous time
Output:
+---+-------------------+-------------------+-------------+------------+--------+
| Id| StartDateTime| EndDateTime|EventDuration| State|DeviceId|
+---+-------------------+-------------------+-------------+------------+--------+
| 3|2019-02-05 08:10:00|2019-02-05 08:20:00| 600| connected| 1|
| 2|2019-02-05 08:00:05|2019-02-05 08:10:00| 595|disconnected| 1|
| 1|2019-02-05 08:00:00|2019-02-05 08:00:05| 5| connected| 1|
+---+-------------------+-------------------+-------------+------------+--------+

trim dataframe in spark using last appearance of value in the column

I have dataframe where I want to trim it by last appearance of value Good in column PDP. This is to consider rows 5 and below. Anything above row 5 does not matter.
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
| 1006| New|
| 1007| New|
| 1008| New|
| 1009| New|
+------+----+
What i need is this dataframe. Since last Good action happened on row 5th
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
+------+----+
You can try:
df
.filter($"PDP" === "Good") // Filter good
.select(max("custId").alias("maxId")) // Find max id
.crossJoin(df)
.where($"custId" <= $"maxId") // Filter records with id <= lastGoodId
.drop("maxId") // Remove obsolete column
You have to find the last row index with Good in PDP column, and then filter in only rows less than that index.
custId
If your custId column contains increasing ids in sorted order then you can do the following
import org.apache.spark.sql.functions._
val maxIdToFilter = df.filter(lower(col("PDP")) === "good").select(max(col("custId").cast("long"))).first().getLong(0)
df.filter(col("custId") <= maxIdToFilter).show(false)
monotically_increasing_id
If your custId is not sorted and increasing order then you can use following logic
import org.apache.spark.sql.functions._
val dfWithRow = df.withColumn("rowNo", monotonically_increasing_id())
val maxIdToFilter = dfWithRow.filter(lower(col("PDP")) === "good").select(max("rowNo")).first().getLong(0)
dfWithRow.filter(col("rowNo") <= maxIdToFilter).drop("rowNo").show(false)
I hope the answer is helpful

How to drop duplicates using conditions [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following DataFrame df:
How can I delete duplicates, while keeping the minimum value of level per each duplicated pair of item_id and country_id.
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 312330| 13535670| 369|
| 312330| 13535670| 376|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 150|
| 311480| 69628930| 138|
| 311480| 69628930| 405|
+-----------+----------+---------------+
The expected output:
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 138|
+-----------+----------+---------------+
I know how to delete duplicates without conditions using dropDuplicates, but I don't know how to do it for my particular case.
One of the method is to use orderBy (default is ascending order), groupBy and aggregation first
import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
You can define the order as well by using .asc for ascending and .desc for descending as below
df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
And you can do the operation using window and row_number function too as below
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)
import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()