PySpark - GroupBy and aggregation with multiple conditions - pyspark

I want to group and aggregate data with several conditions. The dataframe contains a product id, fault codes, date and a fault type. Here, I prepared a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DateType
from datetime import datetime, date
data = [("prod_001","fault_01",date(2020, 6, 4),"minor"),
("prod_001","fault_03",date(2020, 7, 2),"minor"),
("prod_001","fault_09",date(2020, 7, 14),"minor"),
("prod_001","fault_01",date(2020, 7, 14),"minor"),
("prod_001",None,date(2021, 4, 6),"major"),
("prod_001","fault_02",date(2021, 6, 22),"minor"),
("prod_001","fault_09",date(2021, 8, 1),"minor"),
("prod_002","fault_01",date(2020, 6, 13),"minor"),
("prod_002","fault_05",date(2020, 7, 11),"minor"),
("prod_002",None,date(2020, 8, 1),"major"),
("prod_002","fault_01",date(2021, 4, 15),"minor"),
("prod_002","fault_02",date(2021, 5, 11),"minor"),
("prod_002","fault_03",date(2021, 5, 13),"minor"),
]
schema = StructType([ \
StructField("product_id",StringType(),True), \
StructField("fault_code",StringType(),True), \
StructField("date",DateType(),True), \
StructField("fault_type", StringType(), True), \
])
df = spark.createDataFrame(data=data,schema=schema)
display(df)
In general I would like to do a grouping based on the product_id and a following aggregation of the fault_codes (to lists) for the dates. Some specialties here are the continuing aggregation to a list until the fault_type changes from minor to major. In this case the major tagged row will adopt the last state of the aggregation (see screenshot). Within one product_id the aggregation to a list should then start from new (with the following fault_code which is flagged as minor).
see target output here
In some other posts I found the following code snippet which I already tried. Unfortunately I didnt make it to the full aggregation with all conditions yet.
df.sort("product_id", "date").groupby("product_id", "date").agg(F.collect_list("fault_code"))
Edit:
Got a little bit closer with Window.partitionBy() but still not able to start the collect_list() from new once the fault_type changes to major with the following code:
df_test = df.sort("product_id", "date").groupby("product_id", "date", "fault_type").agg(F.collect_list("fault_code")).withColumnRenamed('collect_list(fault_code)', 'fault_code_list')
window_function = Window.partitionBy("product_id").rangeBetween(Window.unboundedPreceding, Window.currentRow).orderBy("date")
df_test = df_test.withColumn("new_version_v2", F.collect_list("fault_code_list").over(Window.partitionBy("product_id").orderBy("date"))) \
.withColumn("new_version_v2", F.flatten("new_version_v2"))
Does someone know how to do that?

Your edit is close. This is not as simple and I only come up with a solution that works but not so neat.
lagw = Window.partitionBy('product_id').orderBy('date')
grpw = Window.partitionBy(['product_id', 'grp']).orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df = (df.withColumn('grp', F.sum(
(F.lag('fault_type').over(lagw).isNull()
| (F.lag('fault_type').over(lagw) == 'major')
).cast('int')).over(lagw))
.withColumn('fault_code', F.collect_list('fault_code').over(grpw)))
df.orderBy(['product_id', 'grp']).show()
# +----------+----------------------------------------+----------+----------+---+
# |product_id| fault_code| date|fault_type|grp|
# +----------+----------------------------------------+----------+----------+---+
# | prod_001|[fault_01] |2020-06-04| minor| 1|
# | prod_001|[fault_01, fault_03] |2020-07-02| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09] |2020-07-14| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09, fault_01]|2020-07-14| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09, fault_01]|2021-04-06| major| 1|
# | prod_001|[fault_02] |2021-06-22| minor| 2|
# | prod_001|[fault_02, fault_09] |2021-08-01| minor| 2|
# | prod_002|[fault_01] |2020-06-13| minor| 1|
# | prod_002|[fault_01, fault_02] |2020-07-11| minor| 1|
...
Explanation:
First I create grp column to categorize the consecutive "minor" + following "major". I use sum and lag to see if the previous row was "major", then I increment, otherwise, I keep the same value as the previous row.
# If cond is True, sum 1, if False, sum 0.
F.sum((cond).cast('int'))
df.orderBy(['product_id', 'date']).select('product_id', 'date', 'fault_type', 'grp').show()
+----------+----------+----------+---+
|product_id| date|fault_type|grp|
+----------+----------+----------+---+
| prod_001|2020-06-04| minor| 1|
| prod_001|2020-07-02| minor| 1|
| prod_001|2020-07-14| minor| 1|
| prod_001|2020-07-14| minor| 1|
| prod_001|2021-04-06| major| 1|
| prod_001|2021-06-22| minor| 2|
| prod_001|2021-08-01| minor| 2|
| prod_002|2020-06-13| minor| 1|
| prod_002|2020-07-11| minor| 1|
...
Once this grp is generated, I can partition by product_id and grp to apply collect_list.

One possible approach is using Pandas UDF with applyInPandas.
Define a "normal" Python function
Input is a Pandas dataframe and output is another dataframe.
The dataframe's size doesn't matter
def grp(df):
df['a'] = 'AAA'
df = df[df['fault_code'] == 'fault_01']
return df[['product_id', 'a']]
Test this function with actual Pandas dataframe
The only thing to remember is this dataframe is just a subset of your actual dataframe
grp(df.where('product_id == "prod_001"').toPandas())
product_id a
0 prod_001 AAA
3 prod_001 AAA
Apply this function into Spark dataframe with applyInPandas
(df
.groupBy('product_id')
.applyInPandas(grp, schema)
.show()
)
+----------+---+
|product_id| a|
+----------+---+
| prod_001|AAA|
| prod_001|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
+----------+---+

Related

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

Use PySpark Dataframe column in another spark sql query

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.
From the first query I have the dataframe below:
+-----------------+
|key |
+-----------------+
| 10000000000004|
| 10000000000003|
| 10000000000008|
| 10000000000009|
| 10000000000007|
| 10000000000006|
| 10000000000010|
| 10000000000002|
+-----------------+
And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:
spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()
I tried the following, however it didn't work:
df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()
Can somebody help me?
You should use an (inner) join between two data frames to get the countries you would like. See my example:
# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]
# Create a list of Ids
numbers = [(1,), (2,)]
# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])
The data frames look like the following:
df_countries:
+-----------+---+
|CountryName| Id|
+-----------+---+
|Netherlands| 1|
| France| 2|
| Germany| 3|
| Belgium| 4|
+-----------+---+
df_numbers:
+---+
| Id|
+---+
| 1|
| 2|
+---+
You can join them as follows:
countries.join(numbers, on='Id', how='inner')
Resulting in:
+---+-----------+
| Id|CountryName|
+---+-----------+
| 1|Netherlands|
| 2| France|
+---+-----------+
Hope that clears things up!

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

pyspark/dataframe - creating a nested structure

i'm using pyspark with dataframe and would like to create a nested structure as below
Before:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
After:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
Is this doable?
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
#F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
First, reproducible example of your dataframe.
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))