Implementing retain functionality of SAS in pyspark code - pyspark

I am trying to convert a piece of code containing the retain functionality and multiple if-else statements in SAS to pyspark.. I had no luck when I tried to search for similar answers.
Input Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1222.0001
2
ADAMAJ091
1222.0000
3
BASSDE012
5221.0123
1
BASSDE012
5111.0022
2
BASSDE012
5110.0000
3
I have calculated the rank using df.withColumn("rank", row_number().over(window.partitionBy('Prod_code'))).orderBy('Rate') function
The value in Rate column must be replicated to all other values in the partition containing rank from 1 to N
Expected Output Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1234.0091
2
ADAMAJ091
1234.0091
3
BASSDE012
5221.0123
1
BASSDE012
5221.0123
2
BASSDE012
5221.0123
3
Rate column's value present at rank=1 must be replicated to all other rows in the same partition. This is retain functionality and I need help in replicating the same in Pyspark code.
I tried using df.withColumn() approach for individual rows, but i was not able to achieve this functionality in pyspark.

Since you already had Rank column, then you can use first function to get the first Rate value in a window ordered by Rank.
from pyspark.sql.functions import first
from pyspark.sql.window import Window
df = df.withColumn('Rate', first('Rate').over(Window.partitionBy('prod_code').orderBy('rank')))
df.show()
# +---------+---------+----+
# |Prod_Code| Rate|Rank|
# +---------+---------+----+
# |BASSDE012|5221.0123| 1|
# |BASSDE012|5221.0123| 2|
# |BASSDE012|5221.0123| 3|
# |ADAMAJ091|1234.0091| 1|
# |ADAMAJ091|1234.0091| 2|
# |ADAMAJ091|1234.0091| 3|
# +---------+---------+----+

Related

How to apply custom logic inside an aggregate function

I'm currently learning Spark and let's say we have the following DataFrame
user_id
activity
1
liked
2
comment
1
liked
1
liked
1
comment
2
liked
Each type of activity has its own weight which is used to calculate the score
activity
weight
liked
1
comment
3
And this is the desired output
user_id
score
1
6
2
4
The calculation of score involves counting how many times an event occurred followed by their weight. For instance, user 1 perform 3 likes and a comment, so the weight is given by
(3 * 1) + (1 * 3)
How do we do this calculation in Spark?
My initial attempt is below
val df1 = evidenceDF
.groupBy("user_id")
.agg(collect_set("event") as "event_ids")
but I got stuck on the mapping portion. What I want to achieve is after I aggregated the events into its event_ids field, I'm going to split them and do the calculation in a map function, but I'm having difficulty moving further.
I searched about using a custom aggregator function but it sounds complicated, is there a straight forward way to do this?
You can join with the weights dataframe the group by and sum weights :
val df1 = evidenceDF.join(df_weight, Seq("activity"))
.groupBy("user_id")
.agg(
sum(col("weight")).as("score")
)
df1.show
//+-------+-----+
//|user_id|score|
//+-------+-----+
//| 1| 6|
//| 2| 4|
//+-------+-----+
Or if actually you have only 2 categories then using when expression directly in the sum :
val df1 = evidenceDF.groupBy("user_id")
.agg(
sum(
when(col("activity") === "liked", 1)
.when(col("activity") === "comment", 3)
).as("score")
)

How to select the N highest values for each category in spark scala

Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.

Average function in pyspark dataframe

I have a dataframe shown below
A user supplies a value, I want to calculate the average of the second number in the tuple from all the rows above that particular value.
example: lets say the value is 10. I want to take all the rows whose value in the "value" column is greater or equal to 10 and calculate the average of those rows. In this case, it'll take up the first two rows and the output will be as shown below
Can someone help me with this please?
Another option: You can filter the data frame first and then calculate the average; Use getItem method to access the value1 field in the struct column:
import pyspark.sql.functions as f
df.filter(df.value >= 10)
.agg(f.avg(df.tuple.getItem('value1')).alias('Avg'),
f.lit(10).alias('value')
).show()
+------+-----+
| Avg|value|
+------+-----+
|2200.0| 10|
+------+-----+

how to take value for same answered more than once and need to create each value one column

I have data like below, want to take data for same id from one column and put each answer in different new columns respectively
actual
ID Brandid
1 234
1 122
1 134
2 122
3 234
3 122
Excpected
ID BRANDID_1 BRANDID_2 BRANDID_3
1 234 122 134
2 122 - -
3 234 122 -
You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# create the window on ID and as you need orderBy after,
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1))
# create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')))
# groupby the ID and pivot on the created column
.groupBy('ID').pivot('pv')
# in aggregation, you need a function so we use first
.agg(F.first('Brandid')))
and you get
pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
| 1| 234| 122| 134|
| 3| 234| 122| null|
| 2| 122| null| null|
+---+---------+---------+---------+
EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:
nb_pad = 3
and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by
F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))
and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by
nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))

How to get the name of the group with maximum value of parameter? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")
There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.