Pyspark: devide one row by another in groupBy - group-by

I have a pyspark data frame and I'd like to divide one row by another within groups. Within groups there will be two rows: one with a count value where removal == 1 and the other with a count value where removal == 0.
How do I divide one count by the other to get the ratio for each group in a new column? The groupBy is on limit and test_id
columns = ['removal', 'limit', 'test_id', 'count']
vals = [
(1, 'UL', 'AB', 141),
(0, 'UL', 'AB', 140),
(1, 'LL', 'AB', 21),
(0, 'LL', 'AB',12),
(0, 'UL', 'EF', 200),
(1, 'UL', 'EF',12)
]
What I want: (or in a similar layout)
columns = ['limit', 'test_id', 'ratio', count_1, count_0]
vals = [
('UL', 'AB', 1.007, 141, 140)
('LL', 'AB', 1.75, 21, 12),
('UL', 'EF', 0.06, 12, 200)
]
I know ways to do it by splitting and then merging the data again, but I'd rather have a nicer agg function.

Since there is only one row per value of removal, the straightforward way is to use where to filter for each distinct value and join:
from pyspark.sql.functions import col
df.where("removal = 1").alias("a")\
.join(df.where("removal = 0").alias("b"), on=["limit", "test_id"])\
.select(
"limit",
"test_id",
(col("a.count") / col("b.count")).alias("ratio"),
col("a.count").alias("count_1"),
col("b.count").alias("count_0")
).show()
#+-----+-------+------------------+-------+-------+
#|limit|test_id| ratio|count_1|count_0|
#+-----+-------+------------------+-------+-------+
#| UL| AB|1.0071428571428571| 141| 140|
#| LL| AB| 1.75| 21| 12|
#| UL| EF| 0.06| 12| 200|
#+-----+-------+------------------+-------+-------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Selecting subset spark dataframe by months

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.
You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

Calculate value based on value from same column of the previous row in spark

I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row.
I am unable to figure it out using withColumn API.
I need to calculate a new column, using the formula:
MovingRate = MonthlyRate + (0.7 * MovingRatePrevious)
... where the MovingRatePrevious is the MovingRate of the prior row.
For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I need to partition by Type.
This is my original dataset:
Desired results in MovingRate column:
Altough its possible to do with Widow Functions (See #Leo C's answer), I bet its more performant to aggregate once per Type using a groupBy. Then, explode the results of the UDF to get all rows back:
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None)
)
.toDF("Month", "Type", "MonthlyRate", "MovingRate")
// this udf produces an Seq of Tuple3 (Month, MonthlyRate, MovingRate)
val calcMovingRate = udf((startRate:Double,rates:Seq[Row]) => rates.tail
.scanLeft((rates.head.getInt(0),startRate,startRate))((acc,curr) => (curr.getInt(0),curr.getDouble(1),acc._3+0.7*curr.getDouble(1)))
)
df
.groupBy($"Type")
.agg(
first($"MovingRate",ignoreNulls=true).as("startRate"),
collect_list(struct($"Month",$"MonthlyRate")).as("rates")
)
.select($"Type",explode(calcMovingRate($"startRate",$"rates")).as("movingRates"))
.select($"Type",$"movingRates._1".as("Month"),$"movingRates._2".as("MonthlyRate"),$"movingRates._3".as("MovingRate"))
.show()
gives:
+----+-----+-----------+------------------+
|Type|Month|MonthlyRate| MovingRate|
+----+-----+-----------+------------------+
|blue| 1| 0.33| 0.33|
|blue| 2| 0.3| 0.54|
|blue| 3| 0.7| 1.03|
|blue| 4| 0.9|1.6600000000000001|
+----+-----+-----------+------------------+
Given the nature of the requirement that each moving rate is recursively computed from the previous rate, the column-oriented DataFrame API won't shine especially if the dataset is huge.
That said, if the dataset isn't large, one approach would be to make Spark recalculate the moving rates row-wise via a UDF, with a Window-partitioned rate list as its input:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None),
(1, "red", 0.5, Some(0.2)),
(2, "red", 0.6, None),
(3, "red", 0.8, None)
).toDF("Month", "Type", "MonthlyRate", "MovingRate")
val win = Window.partitionBy("Type").orderBy("Month").
rowsBetween(Window.unboundedPreceding, 0)
def movingRate(factor: Double) = udf( (initRate: Double, monthlyRates: Seq[Double]) =>
monthlyRates.tail.foldLeft(initRate)( _ * factor + _ )
)
df.
withColumn("MovingRate", when($"Month" === 1, $"MovingRate").otherwise(
movingRate(0.7)(last($"MovingRate", ignoreNulls=true).over(win), collect_list($"MonthlyRate").over(win))
)).
show
// +-----+----+-----------+------------------+
// |Month|Type|MonthlyRate| MovingRate|
// +-----+----+-----------+------------------+
// | 1| red| 0.5| 0.2|
// | 2| red| 0.6| 0.74|
// | 3| red| 0.8| 1.318|
// | 1|blue| 0.4| 0.33|
// | 2|blue| 0.3|0.5309999999999999|
// | 3|blue| 0.7|1.0716999999999999|
// | 4|blue| 0.9|1.6501899999999998|
// +-----+----+-----------+------------------+
What you are trying to do is compute a recursive formula that looks like:
x[i] = y[i] + 0.7 * x[i-1]
where x[i] is your MovingRate at row i and y[i] your MonthlyRate at row i.
The problem is that this is a purely sequential formula. Each row needs the result of the previous one which in turn needs the result of the one before. Spark is a parallel computation engine and it is going to be hard to use it to speed up a calculation that cannot really be parallelized.

pyspark join two rdds and flatten the results

Environment is pyspark, Spark Version 2.2.
We have two rdds test1 and test2, below are sample data
test1 = [('a', 20), ('b', 10), ('c', 2)]
test2 = [('a', 2), ('b', 3)]
Now we want to generate output1 as below, any help is appreciated.
[('a', 20, 2), ('b', 10, 3)]
You can accomplish this with a simple join followed by a call to map to flatten the values.
test1.join(test2).map(lambda (key, values): (key,) + values).collect()
#[('a', 20, 2), ('b', 10, 3)]
To explain, the result of the join is the following:
test1.join(test2).collect()
#[('a', (20, 2)), ('b', (10, 3))]
This is almost the desired output, but you want to flatten the results. We can accomplish this by calling map and returning a new tuple with the desired format. The syntax (key,) will create a one element tuple with just the key, which we add to the values.
You can also use the DataFrame API, by using pyspark.sql.DataFrame.toDF() to convert your RDDs to DataFrames:
test1.toDF(["key", "value1"]).join(test2.toDF(["key", "value2"]), on="key").show()
#+---+------+------+
#|key|value1|value2|
#+---+------+------+
#| b| 10| 3|
#| a| 20| 2|
#+---+------+------+