Average function in pyspark dataframe - pyspark

I have a dataframe shown below
A user supplies a value, I want to calculate the average of the second number in the tuple from all the rows above that particular value.
example: lets say the value is 10. I want to take all the rows whose value in the "value" column is greater or equal to 10 and calculate the average of those rows. In this case, it'll take up the first two rows and the output will be as shown below
Can someone help me with this please?

Another option: You can filter the data frame first and then calculate the average; Use getItem method to access the value1 field in the struct column:
import pyspark.sql.functions as f
df.filter(df.value >= 10)
.agg(f.avg(df.tuple.getItem('value1')).alias('Avg'),
f.lit(10).alias('value')
).show()
+------+-----+
| Avg|value|
+------+-----+
|2200.0| 10|
+------+-----+

Related

Group by and save the max value with overlapping columns in scala spark

I have data that looks like this:
id,start,expiration,customerid,content
1,13494,17358,0001,whateveriwanthere
2,14830,28432,0001,somethingelsewoo
3,11943,19435,0001,yes
4,39271,40231,0002,makingfakedata
5,01321,02143,0002,morefakedata
In the data above, I want to group by customerid for overlapping start and expiration (essentially just merge intervals). I am doing this successfully by grouping by the customer id, then aggregating on a first("start") and max("expiration").
df.groupBy("customerid").agg(first("start"), max("expiration"))
However, this drops the id column entirely. I want to save the id of the row that had the max expiration. For instance, I want my output to look like this:
id,start,expiration,customerid
2,11934,28432,0001
4,39271,40231,0002
5,01321,02143,0002
I am not sure how to add that id column for whichever row had the maximum expiration.
You can use a cumulative conditional sum along with lag function to define group column that flags rows that overlap. Then, simply group by customerid + group and get min start and max expiration. To get the id value associated with max expiration date, you can use this trick with struct ordering:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("customerid").orderBy("start")
val result = df.withColumn(
"group",
sum(
when(
col("start").between(lag("start", 1).over(w), lag("expiration", 1).over(w)),
0
).otherwise(1)
).over(w)
).groupBy("customerid", "group").agg(
min(col("start")).as("start"),
max(struct(col("expiration"), col("id"))).as("max")
).select("max.id", "customerid", "start", "max.expiration")
result.show
//+---+----------+-----+----------+
//| id|customerid|start|expiration|
//+---+----------+-----+----------+
//| 5| 0002|01321| 02143|
//| 4| 0002|39271| 40231|
//| 2| 0001|11943| 28432|
//+---+----------+-----+----------+

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

Implementing retain functionality of SAS in pyspark code

I am trying to convert a piece of code containing the retain functionality and multiple if-else statements in SAS to pyspark.. I had no luck when I tried to search for similar answers.
Input Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1222.0001
2
ADAMAJ091
1222.0000
3
BASSDE012
5221.0123
1
BASSDE012
5111.0022
2
BASSDE012
5110.0000
3
I have calculated the rank using df.withColumn("rank", row_number().over(window.partitionBy('Prod_code'))).orderBy('Rate') function
The value in Rate column must be replicated to all other values in the partition containing rank from 1 to N
Expected Output Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1234.0091
2
ADAMAJ091
1234.0091
3
BASSDE012
5221.0123
1
BASSDE012
5221.0123
2
BASSDE012
5221.0123
3
Rate column's value present at rank=1 must be replicated to all other rows in the same partition. This is retain functionality and I need help in replicating the same in Pyspark code.
I tried using df.withColumn() approach for individual rows, but i was not able to achieve this functionality in pyspark.
Since you already had Rank column, then you can use first function to get the first Rate value in a window ordered by Rank.
from pyspark.sql.functions import first
from pyspark.sql.window import Window
df = df.withColumn('Rate', first('Rate').over(Window.partitionBy('prod_code').orderBy('rank')))
df.show()
# +---------+---------+----+
# |Prod_Code| Rate|Rank|
# +---------+---------+----+
# |BASSDE012|5221.0123| 1|
# |BASSDE012|5221.0123| 2|
# |BASSDE012|5221.0123| 3|
# |ADAMAJ091|1234.0091| 1|
# |ADAMAJ091|1234.0091| 2|
# |ADAMAJ091|1234.0091| 3|
# +---------+---------+----+

Dropping entries of close timestamps

I would like to drop all records which are duplicate entries but have said a difference in the timestamp could be of any amount of time as an offset but for simplicity will use 2 minutes.
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:21|ABC |DEF |
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I would like my dataframe to have only rows
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I tried something like this but this does not remove duplicates.
val joinedDfNoDuplicates = joinedDFTransformed.as("df1").join(joinedDFTransformed.as("df2"), col("df1.ColA") === col("df2.ColA") &&
col("df1.ColB") === col("df2.ColB") &&
&& abs(unix_timestamp(col("Date")) - unix_timestamp(col("Date"))) > offset
)
For now, I am just selecting distinct or a group by min here Find minimum for a timestamp through Spark groupBy dataframe on the data based on certain columns but I would like a more robust solution the reason for this is that data outside of that interval may be valid data. Also, the offset could be changed so maybe within 5s or 5 minutes depending on requirements.
Somebody mentioned to me about creating a UDF comparing dates and if all other columns are the same but I am not sure exactly how to do that such that either I would filter out rows or add a flag and then remove those rows any help would be greatly appreciated.
Similiar sql question here Duplicate entries with different timestamp
Thanks!
I would do it like this:
Define a Window to order to dates over a dummy column.
Add a dummy column, and add a constant value to it.
Add a new column containing the date of the previous record.
calculate the difference between the date and the previous date.
Filter your records based on the value of the difference.
The Code can be something like the follow:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("dummy").orderBy("Date") // step 1
df.withColumn("dummy", lit(1)) // this is step 1
.withColumn("previousDate", lag($"Date", 1) over w) // step 2
.withColumn("difference", unix_timestamp($"Date") - unix_timestamp("previousDate")) // step 3
This above solution is valid if you have pairs of records that might be close in time. If you have more than two records, you can compare each record to the first record (not the previous one) in the window, so instead of using lag($"Date",1), you use first($"Date"). In this case the 'difference' column contains the difference in time between the current record and the first record in the window.

How to get few rows from Spark data frame on basis of some condition

I have Data Frame like below with three column
id|visit_class|in_date
+--+-----------+--------
|1|Non Hf |24-SEP-2017
|1|Non Hf |23-SEP-2017
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Non Hf |24-SEP-2017
|2|Hf |25-SEP-2017
I want to group this data frame on id then sort this grouped data on indate column and want only those rows which are coming after first occurrence of HF. The output will be like below. Means first 2 rows will drop for id =1 and first 1 row will drop for id = 2.
id|visit_class|in_date
+--+-----------+--------
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Hf |25-SEP-2017
How I will achieve this in Spark and Scala.
Steps:
1) Create the WindowSpec, order by date and partition by id:
2) Create a cumulative sum as indicates of whether Hf has appeared, and then filter based on the condition:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(to_date($"in_date", "dd-MMM-yyyy"))
(df.withColumn("rn", sum(when($"visit_class" === "Hf", 1).otherwise(0)).over(w))
.filter($"rn" >= 1).drop("rn").show)
+---+-----------+-----------+
| id|visit_class| in_date|
+---+-----------+-----------+
| 1| Hf|27-SEP-2017|
| 1| Non Hf|28-SEP-2017|
| 2| Hf|25-SEP-2017|
+---+-----------+-----------+
Using spark 2.2.0, to_date with the format signature is a new function in 2.2.0
If you are using spark < 2.2.0, you can use unix_timestamp in place of to_date:
val w = Window.partitionBy("id").orderBy(unix_timestamp($"in_date", "dd-MMM-yyyy"))