pyspark - left join with random row matching the key - pyspark

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas

One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

Related

PySpark: groupby() count('*') not working as expected, or I'm misunderstanding

I'm trying to get
row of things within categories
row of all things within categories
Below is what I've tried.
# This is PySpark
# df has variables 'id' 'category' 'thing'
# 'category' one : many 'id'
#
# sample data:
# id | category | thing
# alpha | A | X
# alpha | A | X
# alpha | A | Y
# beta | A | X
# beta | A | Z
# beta | A | Z
# gamma | B | X
# gamma | B | Y
# gamma | B | Z
df_count_per_category = df.\
select('category', 'thing').\
groupby('category', 'thing').\
agg(F.count('*').alias('thing_count'))
# Proposition total, to join with df_turnover_segmented
df_total = df.\
select('category').\
groupby('category').\
agg(F.count('*').alias('thing_total'))
df_merge = df.\
join(df_count_per_category,\
(df_count_per_category.thing== df_count_per_category.thing) & \
(df_count_per_category.category== df_count_per_category.category), \
'inner').\
drop(df_count_per_category.thing).\
drop(df_count_per_category.category).\
join(df_total,\
(df.category== df_total.category), \
'inner').\
drop(df_total.category)
df_rate = df_merge.\
withColumn('thing_rate', F.round(F.col('thing_count') / F.col('thing_total'), 3))
I'm expecting thing_count, thing_total, and thing_rate to be the same for same thing since each thing is category exclusive. However, although thing_count is same value across rows, thing_rate is not. Why is that?
This is the R equivalent I would like to achieve:
# This is R
library(tidytable)
df_total = df |>
mutate(.by = c(category, thing),
thing_count = n()) |>
mutate(.by = category,
thing_total = n()) |>
mutate(thing_rate = thing_count / thing_total)
This is the expected result (+/- some columns):
# This is a table
category | thing | thing_count | thing_total | thing_rate
A | X | 3 | 6 | 0,5
A | Y | 1 | 6 | 0,1667
A | Z | 2 | 6 | 0,3333
B | X | 1 | 3 | 0,3333
B | Y | 1 | 3 | 0,3333
B | Z | 1 | 3 | 0,3333
I think your 2nd join is not what you intend to do.
You are referencing the original df in the 2nd join condition which resulting in creating a wrong association. Instead, you want to join the df_total to the result of the first join.
df_merge = df.\
join(df_count_per_category ,\
(df.thing== df_count_per_category.thing) & \
(df.category== df_count_per_category.category), \
'inner').\
drop(df_count_per_category .thing).\
drop(df_count_per_category .category)
df_merge = df_merge.join(df_total ,\
(df_merge.category== df_total.category), \ # Reference df_merge.category.
'inner').\
drop(df_total.category)
Alternatively, you can achieve your expected dataframe with window functions without multiple joins.
from pyspark.sql import Window
from pyspark.sql import functions as F
df = (df.select('category', 'thing',
F.count('*').over(Window.partitionBy('category', 'thing')).alias('thing_count'),
F.count('*').over(Window.partitionBy('category')).alias('thing_total'))
.withColumn('thing_rate', F.round(F.col('thing_count') / F.col('thing_total'), 3)))

Average element wise List of Dense vectors in each row of a pyspark dataframe

I have a column in a pyspark dataframe that contains Lists of DenseVectors. Different rows might have Lists of different sizes but each vector in the list is of the same size. I want to calculate the element-wise average of each of those lists.
To be more concrete, lets say I have the following df:
|ID | Column |
| -------- | ------------------------------------------- |
| 0 | List(DenseVector(1,2,3), DenseVector(2,4,5))|
| 1 | List(DenseVector(1,2,3)) |
| 2 | List(DenseVector(2,2,3), DenseVector(2,4,5))|
What I would like to obtain is
|ID | Column |
| -------- | --------------------|
| 0 | DenseVector(1.5,3,4)|
| 1 | DenseVector(2,4,5) |
| 2 | DenseVector(2,3,4) |
Many thanks!
I don't think there is a direct pyspark function to do this. There is an ElementwiseProduct(which works different to what is expected here) and others here. So, you could try to achieve this with a udf.
from pyspark.sql import functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
def elementwise_avg(vector_list):
x = y = z = 0
no_of_v = len(vector_list)
for i, elem in enumerate(vector_list):
x += elem[i][0]
y += elem[i][1]
z += elem[i][2]
return Vectors.dense(x/no_of_v,y/no_of_v,z/no_of_v)
elementwise_avg_udf = F.udf(elementwise_avg, VectorUDT())
df = df.withColumn("Elementwise Avg", elementwise_avg_udf("Column"))

Pyspark calculated field based off time difference

I have a table that looks like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime|
+-------------+----------------------+----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 |
In the end, I need to get create a speed column for each row, so something like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| speed |
+-------------+----------------------+----------------------+-------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 13.5 |
So this is what I'm trying to do to get there. I figure I should add an interium column to help out, called trip_time which is a calculation of tpep_dropoff_datetime - tpep_pickup_datetime. Here is the code I'm doing to get that:
df4 = df.withColumn('trip_time', df.tpep_dropoff_datetime - df.tpep_pickup_datetime)
which is producing a nice trip_time column:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| trip_time|
+-------------+----------------------+----------------------+-----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 6 minutes 40 seconds|
But now I want to do the speed column, and this how I'm trying to do that:
df4 = df4.withColumn('speed', (F.col('trip_distance') / F.col('trip_time')))
But that is giving me this error:
AnalysisException: cannot resolve '(trip_distance/trip_time)' due to data type mismatch: differing types in '(trip_distance/trip_time)' (float and interval).;;
Is there a better way?
One option is to convert your time to unix_timestamp which is in unit of seconds, and then you can do the subtraction, which gives you interval as integer that can be further used to calculate speed:
import pyspark.sql.functions as f
df.withColumn('speed', f.col('trip_distance') * 3600 / (
f.unix_timestamp('tpep_dropoff_datetime') - f.unix_timestamp('tpep_pickup_datetime'))
).show()
+-------------+--------------------+---------------------+-----+
|trip_distance|tpep_pickup_datetime|tpep_dropoff_datetime|speed|
+-------------+--------------------+---------------------+-----+
| 1.5| 2019-01-01 00:46:40| 2019-01-01 00:53:20| 13.5|
+-------------+--------------------+---------------------+-----+

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

How to create a column of row id in Spark dataframe for each distinct column value using Scala

I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.