How do I get deterministic random ordering in pyspark? - pyspark

I would like to randomly order a dataframe, but in a deterministic way. I thought that the way to do this was to use orderBy with a seeded rand function. However, I found that this is non-deterministic across different machines. For example, consider the following code:
from pyspark.sql import types as T, functions as F
df = spark.createDataFrame(range(10), T.IntegerType())
df = df.orderBy(F.rand(seed=123))
print(df.show())
When I run this on my local machine, it prints
+-----+
|value|
+-----+
| 3|
| 4|
| 9|
| 7|
| 8|
| 0|
| 5|
| 6|
| 2|
| 1|
+-----+
but on an EC2 instance, it prints
+-----+
|value|
+-----+
| 9|
| 5|
| 6|
| 7|
| 0|
| 1|
| 4|
| 8|
| 3|
| 2|
+-----+
How can I get a random ordering that is deterministic, even when running on different machines?
My pyspark version is 2.4.1
EDIT: By the way, I should add that just doing df.select(F.rand(seed=123)).show() produces the same output across both machines, so this is specifically a problem with the combination of orderBy and rand.

Thank you for the additional information from your edit! That turned out to be a pretty important clue.
Problem
I think the problem here is that you are attaching a pseudorandomly-generated column to an already-randomly-ordered data set, and the existing randomness is not deterministic, so attaching another source of randomness that is deterministic doesn't help.
You can verify this by rephrasing your orderBy call like:
df.withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())
If I'm right, you'll see the same random values on both machines, but they'll be attached to different rows: the order in which the random values attach to rows is random!
Solution
And if that's true, the solution should be pretty straightforward: apply deterministic, non-random ordering over "real" values, before applying a random (but still deterministic) order on top.
df.orderBy(F.col('value').asc()).withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())
should produce similar output on both machines.
My result:
+-----+-------------------+
|value| order|
+-----+-------------------+
| 4|0.13617504799810343|
| 5|0.13778573503201175|
| 6|0.15367835411103337|
| 9|0.43774287147238644|
| 0| 0.5029534413816527|
| 1| 0.5230701153994686|
| 7| 0.572063607751534|
| 8| 0.7689696831405166|
| 3| 0.82540915099773|
| 2| 0.8535692890157796|
+-----+-------------------+

Related

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!
You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

Spark monotonically_increasing_id() gives consecutive ids for all the partitions

I have a dataframe df in Spark which looks something like this:
val df = (1 to 10).toList.toDF()
When I check the number of partitions, I see that I have 10 partitions:
df.rdd.getNumPartitions
res0: Int = 10
Now I generate an ID column:
val dfWithID = df.withColumn("id", monotonically_increasing_id())
dfWithID.show()
+-----+---+
|value| id|
+-----+---+
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 4|
| 6| 5|
| 7| 6|
| 8| 7|
| 9| 8|
| 10| 9|
+-----+---+
So all the generated ids are consecutive though I have 10 partitions. Then I repartition the dataframe:
val dfp = df.repartition(10)
val dfpWithID = dfp.withColumn("id", monotonically_increasing_id())
dfpWithID.show()
+-----+-----------+
|value| id|
+-----+-----------+
| 10| 0|
| 1| 8589934592|
| 7|17179869184|
| 5|25769803776|
| 4|42949672960|
| 9|42949672961|
| 2|51539607552|
| 8|60129542144|
| 6|68719476736|
| 3|77309411328|
+-----+-----------+
Now I get the ids which are not consecutive anymore. Based on Spark documentation, it should put the partition ID in the upper 31 bits, and in both cases I have 10 partitions. Why it only adds the partition ID after calling repartition() ?
I assume this is because all your data in your initial dataframe resides in a single partition, the other 9 being empty.
To very this, use the answers given here: Apache Spark: Get number of records per partition

Show all pyspark columns after group and agg

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+

Spark Dataframe maximum on Several Columns of a Group

How can I get the maximum value for different (string and numerical) types of columns in a DataFrame in Scala using Spark?
Let say that is my data
+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
| A| 7| 9| "a"|
| A| 1| 10| null|
| B| 4| 4| "b"|
| B| 3| 6| null|
+----+-----+-------+------+
and the desired outcome is:
+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
| A| 7| 10| "a"|
| B| 4| 6| "b"|
+----+-----+-------+------+
Is there a function like in pandas with apply(max,axis=0) or do I have to write a UDF?
What I can do is a df.groupBy("name").max("value1") but I canot perform two max in a row neither does a Sequence work in max() function.
Any ideas to solve the problem quickly?
Use this
df.groupBy("name").agg(max("value1"), max("value2"))

Spark SQL sum rows with the same key and appending the sum value

Suppose I have the following DataFrame.
+----+-----+
|lang|count|
+----+-----+
| en| 4|
| en| 5|
| de| 2|
| en| 2|
| nl| 4|
| nl| 5|
+----+-----+
How do I sum the values of “count” for each unique language, and appending this value as a new column (thus, without reducing the amount of rows)?
In my example, this would result in:
+----+-----+----------------+
|lang|count|totalCountInLang|
+----+-----+----------------+
| en| 4| 11|
| en| 5| 11|
| de| 2| 2|
| en| 2| 11|
| nl| 4| 9|
| nl| 5| 9|
+----+-----+----------------+
The DataFrames are constructed through a map operation on a DStream.
Any suggestions on what would be the best way to achieve this? Is there a more efficient way than using DataFrames?
Thanks in advance!
You can use one of the following:
sum over a window:
import org.apache.spark.sql.expressions.Window
val df = Seq(
("en", 4), ("en", 5), ("de", 2),
("en", 2), ("nl", 4), ("nl", 5)
).toDF("lang", "count")
val w = Window.partitionBy("lang").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing
)
df.withColumn("totalCountInLang", sum("count").over(w))
aggregation and join:
df.join(df.groupBy("lang").sum(), Seq("lang"))
With small groups the former solution should be behave slightly better. For larger ones the latter one, optionally combined with broadcast function, is usually proffered.