Bring max value from a table based on the values of another table with PySpark - pyspark

I have two Spark DataFrames, the first one (Events) contains events information as following:
Event_id
Date
User_id
1
2019-04-19
1
2
2019-05-30
2
3
2020-01-20
1
The second one (User) contains information from users as below:
Id
User_id
Date
Weight-kg
1
1
2019-04-05
78
2
1
2019-04-17
75
3
2
2019-10-10
50
4
1
2020-02-10
76
What I wonder to know is how to bring the latest value of weight from User before the Event Date using PySpark?
The return of this code must be the following table:
Event_id
Date
User_id
Weight-kg
1
2019-04-19
1
75
2
2019-05-30
2
null
3
2020-01-20
1
75

The idea is left join events and users then ranking the weight based on dates to get the latest ones
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(event
# left join to keep all events
# note the join condition where
# event's date >= user's date
.join(
user,
on=[
event['User_id'] == user['User_id'],
event['Date'] >= user['Date'],
],
how='left'
)
# rank user's weight to get the latest
# based on the dates that already filtered by event's date
.withColumn('rank_weight', F.rank().over(W.partitionBy(user['User_id']).orderBy(User['Date'].desc())))
.where(F.col('rank_weight') == 1)
.drop('rank_weight')
# drop unnecessary columns
.drop(user['User_id'])
.drop(user['Date'])
.drop('Id')
.orderBy('Event_id')
.show()
)
# Output
# +--------+----------+-------+------+
# |Event_id| Date|User_id|Weight|
# +--------+----------+-------+------+
# | 1|2019-04-19| 1| 75|
# | 2|2019-05-30| 2| null|
# | 3|2020-01-20| 1| 75|
# +--------+----------+-------+------+

Related

Count unique ids between two consecutive dates that are values of a column in PySpark

I have a PySpark DF, with ID and Date column, looking like this.
ID
Date
1
2021-10-01
2
2021-10-01
1
2021-10-02
3
2021-10-02
I want to count the number of unique IDs that did not exist in the date one day before. So, here the result would be 1 as there is only one new unique ID in 2021-10-02.
ID
Date
Count
1
2021-10-01
-
2
2021-10-01
-
1
2021-10-02
1
3
2021-10-02
1
I tried following this solution but it does not work on date type value. Any help would be highly appreciated.
Thank you!
If you want to avoid a self-join (e.g. for performance reasons), you could work with Window functions:
from pyspark.sql import Row, Window
import datetime
df = spark.createDataFrame([
Row(ID=1, date=datetime.date(2021,10,1)),
Row(ID=2, date=datetime.date(2021,10,1)),
Row(ID=1, date=datetime.date(2021,10,2)),
Row(ID=2, date=datetime.date(2021,10,2)),
Row(ID=1, date=datetime.date(2021,10,3)),
Row(ID=3, date=datetime.date(2021,10,3)),
])
First add the number of days since an ID was last seen (will be None if it never appeared before)
df = df.withColumn('days_since_last_occurrence', F.datediff('date', F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))
Second, we add a column marking rows where this number of days is not 1. We add a 1 into this column so that we can later sum over this column to count the rows
df = df.withColumn('is_new', F.when(F.col('days_since_last_occurrence') == 1, None).otherwise(1))
Now we do the sum of all rows with the same date and then remove the column we do not require anymore:
(
df
.withColumn('count', F.sum('is_new').over(Window.partitionBy('date'))) # sum over all rows with the same date
.drop('is_new', 'days_since_last_occurrence')
.sort('date', 'ID')
.show()
)
# Output:
+---+----------+-----+
| ID| date|count|
+---+----------+-----+
| 1|2021-10-01| 2|
| 2|2021-10-01| 2|
| 1|2021-10-02| null|
| 2|2021-10-02| null|
| 1|2021-10-03| 1|
| 3|2021-10-03| 1|
+---+----------+-----+
Take out the id list of the current day and the previous day, and then get the size of the difference between the two to get the final result.
Update to a solution to eliminate join.
df = df.select('date', F.expr('collect_set(id) over (partition by date) as id_arr')).dropDuplicates() \
.select('*', F.expr('size(array_except(id_arr, lag(id_arr,1,id_arr) over (order by date))) as count')) \
.select(F.explode('id_arr').alias('id'), 'date', 'count')
df.show(truncate=False)

how to take value for same answered more than once and need to create each value one column

I have data like below, want to take data for same id from one column and put each answer in different new columns respectively
actual
ID Brandid
1 234
1 122
1 134
2 122
3 234
3 122
Excpected
ID BRANDID_1 BRANDID_2 BRANDID_3
1 234 122 134
2 122 - -
3 234 122 -
You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# create the window on ID and as you need orderBy after,
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1))
# create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')))
# groupby the ID and pivot on the created column
.groupBy('ID').pivot('pv')
# in aggregation, you need a function so we use first
.agg(F.first('Brandid')))
and you get
pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
| 1| 234| 122| 134|
| 3| 234| 122| null|
| 2| 122| null| null|
+---+---------+---------+---------+
EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:
nb_pad = 3
and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by
F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))
and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by
nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))

How to associate dates with a count or an integer [duplicate]

This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.

How to find earliest and latest dates in a DataFrame in a grouping operation?

I have the following DataFrame df:
url user date followers
www.test1.com A 2017-01-04 05:46:00 45
www.test1.com B 2017-01-03 10:46:00 10
www.test1.com C 2017-01-05 05:46:00 11
www.test2.com B 2017-01-03 17:00:00 10
www.test2.com A 2017-01-04 15:05:00 45
For each distinct url I need to find the find the total sum of followers, the user who has the earliest date, the number of unique user values, the earliest date and the latest date.
This is what I did so far:
val wFirstUser = Window.partitionBy($"url",$"user").orderBy($"date".asc)
val result = df
.groupBy("url")
.agg(sum("followers")", countDistinct("user"), min("date"), max("date"))
.withColumn("rn", row_number.over(wFirstUser)).where($"rn" === 1).drop("rn")
Expected output:
url first_user earliest_date latest_date sum_followers distinct_users
www.test1.com B 2017-01-03 10:46:00 2017-01-05 05:46:00 66 3
www.test2.com B 2017-01-04 15:05:00. 2017-01-03 17:00:00 55 2
But I cannot find the user who has the earliest date (i.e. first_user). Can someone please help me?
You don't need a window function. All you need is to create a struct column to sort it by date to find the minimumm date and corresponding user and the rest of the things are as you have done
import org.apache.spark.sql.functions._
val result = df.withColumn("struct", struct("date", "user"))
.groupBy("url")
.agg(sum("followers").as("sum_followers"), countDistinct("user").as("distinct_users"), max("date").as("latest_date"), min("struct").as("struct"))
.select(col("url"), col("struct.user").as("first_user"), col("struct.date").as("earliest_date"), col("latest_date"), col("sum_followers"), col("distinct_users"))
which should give you
+-------------+----------+-------------------+-------------------+-------------+--------------+
|url |first_user|earliest_date |latest_date |sum_followers|distinct_users|
+-------------+----------+-------------------+-------------------+-------------+--------------+
|www.test1.com|B |2017-01-03 10:46:00|2017-01-05 05:46:00|66.0 |3 |
|www.test2.com|B |2017-01-03 17:00:00|2017-01-04 15:05:00|55.0 |2 |
+-------------+----------+-------------------+-------------------+-------------+--------------+

How to remove records with their count per group below a threshold?

Here's the DataFrame:
id | sector | balance
---------------------------
1 | restaurant | 20000
2 | restaurant | 20000
3 | auto | 10000
4 | auto | 10000
5 | auto | 10000
How to find the count of each sector type and remove the records with sector type count below a specific LIMIT?
The following:
dataFrame.groupBy(columnName).count()
gives me the number of times a value appears in that column.
How to do it in Spark and Scala using DataFrame API?
You can use SQL Window to do so.
import org.apache.spark.sql.expressions.Window
yourDf.withColumn("count", count("*")
.over(Window.partitionBy($"colName")))
.where($"count">2)
// .drop($"count") // if you don't want to keep count column
.show()
For your given dataframe
import org.apache.spark.sql.expressions.Window
dataFrame.withColumn("count", count("*")
.over(Window.partitionBy($"sector")))
.where($"count">2)
.show()
You should see results like this:
id | sector | balance | count
------------------------------
3 | auto | 10000 | 3
4 | auto | 10000 | 3
5 | auto | 10000 | 3
Don't know if it is the best way. But this worked for me.
def getRecordsWithColumnFrequnecyLessThanLimit(dataFrame: DataFrame, columnName: String, limit: Integer): DataFrame = {
val g = dataFrame.groupBy(columnName)
.count()
.filter("count<" + limit)
.select(columnName)
.rdd
.map(r => r(0)).collect()
dataFrame.filter(dataFrame(columnName) isin (g:_*))
}
Since it's a dataframe you can use SQL query like
select sector, count(1)
from TABLE
group by sector
having count(1) >= LIMIT