I have a table looking like this:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
A10352704
PE
1
2
A10352704
ES
1
2
I would like to keep the IDs whose column country takes the values ES and MX. So, in this case I would like to get an output showing the following:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
Thank you very much!
You can create a countryAgg dataframe which will contain flags for both MX and ES by aggregating it at the id level and further marking it with array_overlaps to check against both of the countries
And further utilise filter to only filter on ids containing both ES and MX as below -
Data Preparation
s = StringIO("""
id country count count_1
A36992434 MX 1 2
A36992434 ES 1 2
A00749707 ES 1 2
A00749707 MX 1 2
A10352704 PE 1 2
A10352704 ES 1 2
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
|A10352704| PE| 1| 2|
|A10352704| ES| 1| 2|
+---------+-------+-----+-------+
Array Overlap Marking
countryAgg = sparkDF.groupBy(F.col('id')).agg(F.collect_set(F.col('country')).alias('country_set'))
countryAgg = countryAgg.withColumn('country_check_mx',F.array(F.lit('MX')))\
.withColumn('country_check_es',F.array(F.lit('ES')))\
.withColumn("overlap_flag_mx"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_mx"))
)\
.withColumn("overlap_flag_es"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_es"))
)
countryAgg.show()
+---------+-----------+----------------+----------------+---------------+---------------+
| id|country_set|country_check_mx|country_check_es|overlap_flag_mx|overlap_flag_es|
+---------+-----------+----------------+----------------+---------------+---------------+
|A36992434| [MX, ES]| [MX]| [ES]| true| true|
|A00749707| [ES, MX]| [MX]| [ES]| true| true|
|A10352704| [ES, PE]| [MX]| [ES]| false| true|
+---------+-----------+----------------+----------------+---------------+---------------+
Joining
countryAgg = countryAgg.filter((F.col('overlap_flag_mx') & F.col('overlap_flag_es')))
sparkDF.join(countryAgg
,sparkDF['id'] == countryAgg['id']
,'inner'
).select(sparkDF['*'])\
.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
+---------+-------+-----+-------+
I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""
You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))
I have the below data and final_column is the exact output what I am trying to get. I am trying to do cumulative sum of flag and want to rest if flag is 0 then set value to 0 as below data
cola date flag final_column
a 2021-10-01 0 0
a 2021-10-02 1 1
a 2021-10-03 1 2
a 2021-10-04 0 0
a 2021-10-05 0 0
a 2021-10-06 0 0
a 2021-10-07 1 1
a 2021-10-08 1 2
a 2021-10-09 1 3
a 2021-10-10 0 0
b 2021-10-01 0 0
b 2021-10-02 1 1
b 2021-10-03 1 2
b 2021-10-04 0 0
b 2021-10-05 0 0
b 2021-10-06 1 1
b 2021-10-07 1 2
b 2021-10-08 1 3
b 2021-10-09 1 4
b 2021-10-10 0 0
I have tried like
import org.apache.spark.sql.functions._
df.withColumn("final_column",expr("sum(flag) over(partition by cola order date asc)"))
I have tried to add condition like case when flag = 0 then 0 else 1 end inside sum function but not working.
You can define a column group using conditional sum on flag, then using row_number with a Window partitioned by cola and group gives the result you want:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn(
"group",
sum(when(col("flag") === 0, 1).otherwise(0)).over(Window.partitionBy("cola").orderBy("date"))
).withColumn(
"final_column",
row_number().over(Window.partitionBy("cola", "group").orderBy("date")) - 1
).drop("group")
result.show
//+----+-----+----+------------+
//|cola| date|flag|final_column|
//+----+-----+----+------------+
//| b|44201| 0| 0|
//| b|44202| 1| 1|
//| b|44203| 1| 2|
//| b|44204| 0| 0|
//| b|44205| 0| 0|
//| b|44206| 1| 1|
//| b|44207| 1| 2|
//| b|44208| 1| 3|
//| b|44209| 1| 4|
//| b|44210| 0| 0|
//| a|44201| 0| 0|
//| a|44202| 1| 1|
//| a|44203| 1| 2|
//| a|44204| 0| 0|
//| a|44205| 0| 0|
//| a|44206| 0| 0|
//| a|44207| 1| 1|
//| a|44208| 1| 2|
//| a|44209| 1| 3|
//| a|44210| 0| 0|
//+----+-----+----+------------+
row_number() - 1 in this case is just equivalent to sum(col("flag")) as flag values are always 0 or 1. So the above final_column can also be written as:
.withColumn(
"final_column",
sum(col("flag")).over(Window.partitionBy("cola", "group").orderBy("date"))
)
I have two pyspark dataframes df1 and df2
df1
id1 id2 id3 x y
0 1 2 0.5 0.4
2 1 0 0.3 0.2
3 0 2 0.8 0.9
2 1 3 0.2 0.1
df2
id name
0 A
1 B
2 C
3 D
I would like to join the two dataframes and have
df3
id1 id2 id3 n1 n2 n3 x y
0 1 2 A B C 0.5 0.4
2 1 0 C B A 0.3 0.2
3 0 2 D A C 0.8 0.9
2 1 3 C B D 0.2 0.1
Here is the multiple joins.
df1.join(df2, df1['id1'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n1') \
.join(df2, df1['id2'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n2') \
.join(df2, df1['id3'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n3') \
.show()
+---+---+---+---+---+---+---+---+
|id1|id2|id3| x| y| n1| n2| n3|
+---+---+---+---+---+---+---+---+
| 0| 1| 2|0.5|0.4| A| B| C|
| 2| 1| 0|0.3|0.2| C| B| A|
| 3| 0| 2|0.8|0.9| D| A| C|
| 2| 1| 3|0.2|0.1| C| B| D|
+---+---+---+---+---+---+---+---+
I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3
1 0 1 -1
2 0 0 0
3 -17 20 15
4 23 1 0
Expected Output:
ID COL1 COL2 COL3 Count
1 0 1 -1 2
2 0 0 0 0
3 -17 20 15 3
4 23 1 0 1
There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([
[1, 0, 1, -1],
[2, 0, 0, 0],
[3, -17, 20, 15],
[4, 23, 1, 0]],
["ID", "COL1", "COL2", "COL3"]
)
#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']
#import functions
from pyspark.sql import functions as F
#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
"count",
sum([
F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
])
).show()
+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
| 1| 0| 1| -1| 2|
| 2| 0| 0| 0| 0|
| 3| -17| 20| 15| 3|
| 4| 23| 1| 0| 2|
+---+----+----+----+-----+