I'm currently learning Spark and let's say we have the following DataFrame
user_id
activity
1
liked
2
comment
1
liked
1
liked
1
comment
2
liked
Each type of activity has its own weight which is used to calculate the score
activity
weight
liked
1
comment
3
And this is the desired output
user_id
score
1
6
2
4
The calculation of score involves counting how many times an event occurred followed by their weight. For instance, user 1 perform 3 likes and a comment, so the weight is given by
(3 * 1) + (1 * 3)
How do we do this calculation in Spark?
My initial attempt is below
val df1 = evidenceDF
.groupBy("user_id")
.agg(collect_set("event") as "event_ids")
but I got stuck on the mapping portion. What I want to achieve is after I aggregated the events into its event_ids field, I'm going to split them and do the calculation in a map function, but I'm having difficulty moving further.
I searched about using a custom aggregator function but it sounds complicated, is there a straight forward way to do this?
You can join with the weights dataframe the group by and sum weights :
val df1 = evidenceDF.join(df_weight, Seq("activity"))
.groupBy("user_id")
.agg(
sum(col("weight")).as("score")
)
df1.show
//+-------+-----+
//|user_id|score|
//+-------+-----+
//| 1| 6|
//| 2| 4|
//+-------+-----+
Or if actually you have only 2 categories then using when expression directly in the sum :
val df1 = evidenceDF.groupBy("user_id")
.agg(
sum(
when(col("activity") === "liked", 1)
.when(col("activity") === "comment", 3)
).as("score")
)
Related
I have a dataframe containing the id of some person and the date on which he performed a certain action:
+----+----------+
| id| date|
+----+----------+
| 1|2022-09-01|
| 1|2022-10-01|
| 1|2022-11-01|
| 2|2022-07-01|
| 2|2022-10-01|
| 2|2022-11-01|
| 3|2022-09-01|
| 3|2022-10-01|
| 3|2022-11-01|
+----+----------+
I need to determine the fact that this person performed some action over a certain period of time (suppose the last 3 months). In a specific example, person number 2 missed months 08 and 09, respectively, the condition was not met. So I expect to get the following result:
+----+------------------------------------+------+
| id| dates|3month|
+----+------------------------------------+------+
| 1|[2022-09-01, 2022-10-01, 2022-11-01]| true|
| 2|[2022-07-01, 2022-10-01, 2022-11-01]| false|
| 3|[2022-09-01, 2022-10-01, 2022-11-01]| true|
+----+------------------------------------+------+
I understand that I should group by person ID and get an array of dates that correspond to it.
data.groupBy(col("id")).agg(collect_list("date") as "dates").withColumn("3month", ???)
However, I'm at a loss in writing a function that would carry out a check for compliance with the requirement.I have an option using recursion, but it does not suit me due to low performance (there may be more than one thousand dates). I would be very grateful if someone could help me with my problem.
A simple trick is to use a set instead of a list in your aggregation, in order to have distinct values, and then check the size of that set.
Here are some possible solutions:
Solution 1
Assuming you have a list of months of interest on which you want to check, you can perform a preliminary filter on the required months, then aggregate and validate.
import org.apache.spark.sql.{functions => F}
import java.time.{LocalDate, Duration}
val requiredMonths = Seq(
LocalDate.parse("2022-09-01"),
LocalDate.parse("2022-10-01"),
LocalDate.parse("2022-11-01")
);
df
.filter(F.date_trunc("month", $"date").isInCollection(requiredMonths))
.groupBy($"id")
.agg(F.collect_set(F.date_trunc("month", $"date")) as "months")
.withColumn("is_valid", F.size($"months") === requiredMonths.size)
date_trunc is used to truncate the date column to month.
Solution 2
Similar to the previous one, with preliminary filter, but here assuming you have a range of months
import java.time.temporal.ChronoUnit
val firstMonth = LocalDate.parse("2022-09-01");
val lastMonth = LocalDate.parse("2022-11-01");
val requiredNumberOfMonths = ChronoUnit.MONTHS.between(firstMonth, lastMonth) + 1;
df
.withColumn("month", F.date_trunc("month", $"date"))
.filter($"month" >= firstMonth && $"month" <= lastMonth)
.groupBy($"id")
.agg(F.collect_set($"month") as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Solution 3
Both solution 1 and 2 have a problem that causes the complete exclusion from the final result of the ids that have no intersection with the dates of interest.
This is caused by the filter applied before grouping.
Here is a solution based on solution 2 that does not filter and solves this problem.
df
.withColumn("month", F.date_trunc("month", $"date"))
.groupBy($"id")
.agg(F.collect_set(F.when($"month" >= firstMonth && $"month" <= lastMonth, $"month")) as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Now the filter is performed using a conditional collect_set.
It is right to consider also solution 1 and 2 because the preliminary filter can have advantages and in some cases that could be the expected result.
I have data that looks like this:
id,start,expiration,customerid,content
1,13494,17358,0001,whateveriwanthere
2,14830,28432,0001,somethingelsewoo
3,11943,19435,0001,yes
4,39271,40231,0002,makingfakedata
5,01321,02143,0002,morefakedata
In the data above, I want to group by customerid for overlapping start and expiration (essentially just merge intervals). I am doing this successfully by grouping by the customer id, then aggregating on a first("start") and max("expiration").
df.groupBy("customerid").agg(first("start"), max("expiration"))
However, this drops the id column entirely. I want to save the id of the row that had the max expiration. For instance, I want my output to look like this:
id,start,expiration,customerid
2,11934,28432,0001
4,39271,40231,0002
5,01321,02143,0002
I am not sure how to add that id column for whichever row had the maximum expiration.
You can use a cumulative conditional sum along with lag function to define group column that flags rows that overlap. Then, simply group by customerid + group and get min start and max expiration. To get the id value associated with max expiration date, you can use this trick with struct ordering:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("customerid").orderBy("start")
val result = df.withColumn(
"group",
sum(
when(
col("start").between(lag("start", 1).over(w), lag("expiration", 1).over(w)),
0
).otherwise(1)
).over(w)
).groupBy("customerid", "group").agg(
min(col("start")).as("start"),
max(struct(col("expiration"), col("id"))).as("max")
).select("max.id", "customerid", "start", "max.expiration")
result.show
//+---+----------+-----+----------+
//| id|customerid|start|expiration|
//+---+----------+-----+----------+
//| 5| 0002|01321| 02143|
//| 4| 0002|39271| 40231|
//| 2| 0001|11943| 28432|
//+---+----------+-----+----------+
If I have key,value pairs that compromise item(key) and the sales(value):
bolt 45
bolt 5
drill 1
drill 1
screw 1
screw 2
screw 3
So I want to obtain an RDD where each element is the sum of the values for every unique key:
bolt 50
drill 2
screw 6
My current code is like that:
val salesRDD = sc.textFile("/user/bigdata/sales.txt")
val pairs = salesRDD.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect().foreach(println)
But my results get this:
(bolt 5,1)
(drill 1,2)
(bolt 45,1)
(screw 2,1)
(screw 3,1)
(screw 1,1)
How should I edit my code to get the above result?
Java way, hope you can convert this to scala. Looks like you just need a groupby and count
salesRDD.groupBy(salesRDD.col("name")).count();
+-----+-----+
| name|count|
+-----+-----+
| bolt| 50|
|drill| 2|
|screw| 6 |
+-----+-----+
Also,
please use Datasets and Dataframes rather than RDDs. You will find it a lot handy
I am trying to convert a piece of code containing the retain functionality and multiple if-else statements in SAS to pyspark.. I had no luck when I tried to search for similar answers.
Input Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1222.0001
2
ADAMAJ091
1222.0000
3
BASSDE012
5221.0123
1
BASSDE012
5111.0022
2
BASSDE012
5110.0000
3
I have calculated the rank using df.withColumn("rank", row_number().over(window.partitionBy('Prod_code'))).orderBy('Rate') function
The value in Rate column must be replicated to all other values in the partition containing rank from 1 to N
Expected Output Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1234.0091
2
ADAMAJ091
1234.0091
3
BASSDE012
5221.0123
1
BASSDE012
5221.0123
2
BASSDE012
5221.0123
3
Rate column's value present at rank=1 must be replicated to all other rows in the same partition. This is retain functionality and I need help in replicating the same in Pyspark code.
I tried using df.withColumn() approach for individual rows, but i was not able to achieve this functionality in pyspark.
Since you already had Rank column, then you can use first function to get the first Rate value in a window ordered by Rank.
from pyspark.sql.functions import first
from pyspark.sql.window import Window
df = df.withColumn('Rate', first('Rate').over(Window.partitionBy('prod_code').orderBy('rank')))
df.show()
# +---------+---------+----+
# |Prod_Code| Rate|Rank|
# +---------+---------+----+
# |BASSDE012|5221.0123| 1|
# |BASSDE012|5221.0123| 2|
# |BASSDE012|5221.0123| 3|
# |ADAMAJ091|1234.0091| 1|
# |ADAMAJ091|1234.0091| 2|
# |ADAMAJ091|1234.0091| 3|
# +---------+---------+----+
Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.