Grouping data without calling aggregation function in pyspark - group-by

This is my data.
CouponNbr,ItemNbr,TypeCode,DeptNbr,MPQ
10,2,1,10,1
10,3,4,50,2
11,2,1,10,1
11,3,4,50,2
I want to group it in spark in such a way such that it looks like this:
CouponNbr,ItemsInfo
10,[[2,1,10,1],[3,4,50,2]]
11,[[2,1,10,1],[3,4,50,2]]
I try to group it by and convert it to dictionary with the following code,
df.groupby("CouponNbr").apply(lambda x:x[["ItemNbr","TypeCode","DeptNbr","MPQ"]].to_dict("r"))
But this is in pandas and it returns the following
CouponNbr,ItemsInfo
10,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}]
11,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}]
Is there a way I could achieve the format I need in pyspark? Thanks.

You can firstly collect columns into a single array column using the array function and then do groupby.agg using collect_list:
import pyspark.sql.functions as F
df.groupBy('CouponNbr').agg(
F.collect_list(
F.array('ItemNbr', 'TypeCode', 'DeptNbr', 'MPQ')
).alias('ItemsInfo')
).show(2, False)
+---------+------------------------------+
|CouponNbr|ItemsInfo |
+---------+------------------------------+
|10 |[[2, 1, 10, 1], [3, 4, 50, 2]]|
|11 |[[2, 1, 10, 1], [3, 4, 50, 2]]|
+---------+------------------------------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Spark get all rows with same values in array in column

I have a Spark Dataframe with columns id and hashes, where the column hashes contains a Seq of integer values of length n. Example:
+----+--------------------+
+ id| hashes|
+----+--------------------+
|0 | [1, 2, 3, 4, 5]|
|1 | [1, 5, 3, 7, 9]|
|2 | [9, 3, 6, 8, 0]|
+-------------------------+
I want to get a dataframe with all the rows for which the arrays in hashes match in at least one position. More formally, I want a dataframe with an additional column matches that for each row r contains a Seq of ids of rows where hashes[r][i] == hashes[k][i] with k being any other row for at leas one value of i.
For my example data, the result would be:
+---+---------------+-------+
|id |hashes |matches|
+---+---------------+-------+
|0 |[1, 2, 3, 4, 5]|[1] |
|1 |[1, 5, 3, 7, 9]|[0] |
|2 |[9, 3, 6, 8, 0]|[] |
+---+---------------+-------+
In Spark 3, the following code compares arrays between rows, keeping only rows where the two arrays share at least one element at the same position. df is your input dataframe:
df.join(
df.withColumnRenamed("id", "id2").withColumnRenamed("hashes", "hashes2"),
exists(arrays_zip(col("hashes"), col("hashes2")), x => x("hashes") === x("hashes2"))
)
.groupBy("id")
.agg(first(col("hashes")).as("hashes"), collect_list("id2").as("matched"))
.withColumn("matched", filter(col("matched"), x => x.notEqual(col("id"))))
Detailed description
First, we perform an auto cross join, filtered by your condition of at least one element in same position on the two hashes arrays.
To build the condition, we zip the two hashes arrays, one from first dataframe, one for the second joined dataframe, that is just the first dataframe with columns renamed. By zipping, we get an array of {"hashes":x, "hashes2":y} and next we just need to check that in this array exists an element where x = y. The complete condition is written as follow:
exists(arrays_zip(col("hashes"), col("hashes2")), x => x("hashes") === x("hashes2"))
Then, we will aggregate by column id to collect all id2 of rows that were kept, meaning rows that matching your condition
to keep the "hashes" column, as for two rows with the same "id" the column "hashes" are equals, we get the first occurrence of "hashes" for each "id". And we collect all the "id2" using collect_list:
.agg(first(col("hashes")).as("hashes"), collect_list("id2").as("matches"))
And finally, we filter out from column "matches" the id of the current row
.withColumn("matches", filter(col("matches"), x => x.notEqual(col("id"))))
if you need the "id" to be in order, you can add an orderBy clause:
.orderBy("id")
Run
With a dataframe df containing the following values:
+---+---------------+
|id |hashes |
+---+---------------+
|0 |[1, 2, 3, 4, 5]|
|1 |[1, 5, 3, 7, 9]|
|2 |[9, 3, 6, 8, 0]|
+---+---------------+
You get the following output:
+---+---------------+-------+
|id |hashes |matches|
+---+---------------+-------+
|0 |[1, 2, 3, 4, 5]|[1] |
|1 |[1, 5, 3, 7, 9]|[0] |
|2 |[9, 3, 6, 8, 0]|[] |
+---+---------------+-------+
Limits
The join is a cartesian product, which is very expensive. Although the condition filters results, it can lead to an huge amount of calculation/shuffle on big datasets, and may have very poor performance.
If you use Spark whose version is before 3.0, you have to replace some build-in spark functions by user-defined functions

Dropping rows from a spark dataframe based on a condition

I want to drop rows from a spark dataframe of lists based on a condition. The condition is the length of the list being a certain length.
I have tried converting it into a list of lists and then using a for loop (demonstrated below) but I'm hoping to do it in one statement within spark and just creating a new immutable df from the original df based on this condition.
newList = df2.values.tolist()
finalList = []
for subList in newList:
if len(subList) < 4:
finalList.append(subList)
So for instance, if the dataframe is a one column dataframe and the column is named sequences, it looks like:
sequences
____________
[1, 2, 4]
[1, 6, 3]
[9, 1, 4, 6]
I want to drop all rows where the length of the list is more than 3, resulting in:
sequences
____________
[1, 2, 4]
[1, 6, 3]
Here it is one approach in Spark >= 1.5 using the build-in size function:
from pyspark.sql import Row
from pyspark.sql.functions import size
df = spark.createDataFrame([Row(a=[9, 3, 4], b=[8,9,10]),Row(a=[7, 2, 6, 4], b=[2,1,5]), Row(a=[7, 2, 4], b=[8,2,1,5]), Row(a=[2, 4], b=[8,2,10,12,20])])
df.where(size(df['a']) <= 3).show()
Output:
+---------+------------------+
| a| b|
+---------+------------------+
|[9, 3, 4]| [8, 9, 10]|
|[7, 2, 4]| [8, 2, 1, 5]|
| [2, 4]|[8, 2, 10, 12, 20]|
+---------+------------------+

Multiplying two columns from different data frames in spark

I have two dataframes representing the following csv data:
Store Date Weekly_Sales
1 05/02/2010 249
2 12/02/2010 455
3 19/02/2010 415
4 26/02/2010 194
Store Date Weekly_Sales
5 05/02/2010 400
6 12/02/2010 460
7 19/02/2010 477
8 26/02/2010 345
What i'm attempting to do is for each date, read the associated weekly sales for it in both dataframes and find the average of the two numbers. I'm not sure how to accomplish this.
Assuming that you want to have individual store data in the result data set, one approach would be to union the two dataframes and use Window function to calculate average weekly sales (along with the corresponding list of stores, if wanted), as follows:
val df1 = Seq(
(1, "05/02/2010", 249),
(2, "12/02/2010", 455),
(3, "19/02/2010", 415),
(4, "26/02/2010", 194)
).toDF("Store", "Date", "Weekly_Sales")
val df2 = Seq(
(5, "05/02/2010", 400),
(6, "12/02/2010", 460),
(7, "19/02/2010", 477),
(8, "26/02/2010", 345)
).toDF("Store", "Date", "Weekly_Sales")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"Date")
df1.union(df2).
withColumn("Avg_Sales", avg($"Weekly_Sales").over(window)).
withColumn("Store_List", collect_list($"Store").over(window)).
orderBy($"Date", $"Store").
show
// +-----+----------+------------+---------+----------+
// |Store| Date|Weekly_Sales|Avg_Sales|Store_List|
// +-----+----------+------------+---------+----------+
// | 1|05/02/2010| 249| 324.5| [1, 5]|
// | 5|05/02/2010| 400| 324.5| [1, 5]|
// | 2|12/02/2010| 455| 457.5| [2, 6]|
// | 6|12/02/2010| 460| 457.5| [2, 6]|
// | 3|19/02/2010| 415| 446.0| [3, 7]|
// | 7|19/02/2010| 477| 446.0| [3, 7]|
// | 4|26/02/2010| 194| 269.5| [4, 8]|
// | 8|26/02/2010| 345| 269.5| [4, 8]|
// +-----+----------+------------+---------+----------+
You should first merge them using union function. Then grouping on Date column find the average ( using avg inbuilt function) as
import org.apache.spark.sql.functions._
df1.union(df2)
.groupBy("Date")
.agg(collect_list("Store").as("Stores"), avg("Weekly_Sales").as("average_weekly_sales"))
.show(false)
which should give you
+----------+------+--------------------+
|Date |Stores|average_weekly_sales|
+----------+------+--------------------+
|26/02/2010|[4, 8]|269.5 |
|12/02/2010|[2, 6]|457.5 |
|19/02/2010|[3, 7]|446.0 |
|05/02/2010|[1, 5]|324.5 |
+----------+------+--------------------+
I hope the answer is helpful

Spark Dataframe Arraytype columns

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.
Something like this:
df = df.withColumn("max_$colname", max(col(colname)))
where each row of the column holds an array of values?
The functions in spark.sql.function appear to work on a column basis only.
You can apply user defined functions on the array column.
1.DataFrame
+------------------+
| arr|
+------------------+
| [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+
2.Creating UDF
import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))
3.Applying UDF in query
df.withColumn("arrMax",maxUDF(df("arr"))).show
4.Result
+------------------+------+
| arr|arrMax|
+------------------+------+
| [1, 2, 3, 4, 5]| 5|
|[4, 5, 6, 7, 8, 9]| 9|
+------------------+------+