How to label encode for a column in spark scala? - scala

I have a very big dataframe like:
A B
a_1 b_1
a_2 b_2
a_3 b_3
a_1 b_4
a_2 b_4
a_2 b_2
I want to create columns corresponding to each unique value of B and set its value as 1 if it exist for each unique value of A. The expected result should look like this
A B C_b_1 C_b_2 C_b_3 C_b_4
a_1 b_1 1 0 0 1
a_2 b_2 0 1 0 1
a_3 b_3 0 0 1 0
a_1 b_4 1 0 0 1
a_2 b_4 0 1 0 1
a_2 b_2 0 1 0 1
Explanation: For a_1 the distinct values of B are {b_1, b_4} and hence the columns corresponding to them are set to 1. For a_2 the distinct values of B are {b_2, b_4} and hence those columns are 1. Similarly for a_3.
The data is pretty huge, expect 'A' to have about 37000 distinct values while 'B' has about 370. The number of records are over 17 million.

You can use df.stat.crosstab, and join it back to the original dataframe using the A column:
df.join(df.stat.crosstab("A","B").withColumnRenamed("A_B", "A"), "A").show
+---+---+---+---+---+---+
| A| B|b_1|b_2|b_3|b_4|
+---+---+---+---+---+---+
|a_3|b_3| 0| 0| 1| 0|
|a_2|b_2| 0| 2| 0| 1|
|a_2|b_4| 0| 2| 0| 1|
|a_2|b_2| 0| 2| 0| 1|
|a_1|b_4| 1| 0| 0| 1|
|a_1|b_1| 1| 0| 0| 1|
+---+---+---+---+---+---+

Related

How to filter IDs which meet two conditions over another column in pyspark?

I have a table looking like this:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
A10352704
PE
1
2
A10352704
ES
1
2
I would like to keep the IDs whose column country takes the values ES and MX. So, in this case I would like to get an output showing the following:
id
country
count
count_1
A36992434
MX
1
2
A36992434
ES
1
2
A00749707
ES
1
2
A00749707
MX
1
2
Thank you very much!
You can create a countryAgg dataframe which will contain flags for both MX and ES by aggregating it at the id level and further marking it with array_overlaps to check against both of the countries
And further utilise filter to only filter on ids containing both ES and MX as below -
Data Preparation
s = StringIO("""
id country count count_1
A36992434 MX 1 2
A36992434 ES 1 2
A00749707 ES 1 2
A00749707 MX 1 2
A10352704 PE 1 2
A10352704 ES 1 2
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
|A10352704| PE| 1| 2|
|A10352704| ES| 1| 2|
+---------+-------+-----+-------+
Array Overlap Marking
countryAgg = sparkDF.groupBy(F.col('id')).agg(F.collect_set(F.col('country')).alias('country_set'))
countryAgg = countryAgg.withColumn('country_check_mx',F.array(F.lit('MX')))\
.withColumn('country_check_es',F.array(F.lit('ES')))\
.withColumn("overlap_flag_mx"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_mx"))
)\
.withColumn("overlap_flag_es"
,F.arrays_overlap(F.col("country_set"),F.col("country_check_es"))
)
countryAgg.show()
+---------+-----------+----------------+----------------+---------------+---------------+
| id|country_set|country_check_mx|country_check_es|overlap_flag_mx|overlap_flag_es|
+---------+-----------+----------------+----------------+---------------+---------------+
|A36992434| [MX, ES]| [MX]| [ES]| true| true|
|A00749707| [ES, MX]| [MX]| [ES]| true| true|
|A10352704| [ES, PE]| [MX]| [ES]| false| true|
+---------+-----------+----------------+----------------+---------------+---------------+
Joining
countryAgg = countryAgg.filter((F.col('overlap_flag_mx') & F.col('overlap_flag_es')))
sparkDF.join(countryAgg
,sparkDF['id'] == countryAgg['id']
,'inner'
).select(sparkDF['*'])\
.show()
+---------+-------+-----+-------+
| id|country|count|count_1|
+---------+-------+-----+-------+
|A36992434| MX| 1| 2|
|A36992434| ES| 1| 2|
|A00749707| ES| 1| 2|
|A00749707| MX| 1| 2|
+---------+-------+-----+-------+

Window function with PySpark

I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""
You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))

How to do cumulative sum based on conditions in spark scala

I have the below data and final_column is the exact output what I am trying to get. I am trying to do cumulative sum of flag and want to rest if flag is 0 then set value to 0 as below data
cola date flag final_column
a 2021-10-01 0 0
a 2021-10-02 1 1
a 2021-10-03 1 2
a 2021-10-04 0 0
a 2021-10-05 0 0
a 2021-10-06 0 0
a 2021-10-07 1 1
a 2021-10-08 1 2
a 2021-10-09 1 3
a 2021-10-10 0 0
b 2021-10-01 0 0
b 2021-10-02 1 1
b 2021-10-03 1 2
b 2021-10-04 0 0
b 2021-10-05 0 0
b 2021-10-06 1 1
b 2021-10-07 1 2
b 2021-10-08 1 3
b 2021-10-09 1 4
b 2021-10-10 0 0
I have tried like
import org.apache.spark.sql.functions._
df.withColumn("final_column",expr("sum(flag) over(partition by cola order date asc)"))
I have tried to add condition like case when flag = 0 then 0 else 1 end inside sum function but not working.
You can define a column group using conditional sum on flag, then using row_number with a Window partitioned by cola and group gives the result you want:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn(
"group",
sum(when(col("flag") === 0, 1).otherwise(0)).over(Window.partitionBy("cola").orderBy("date"))
).withColumn(
"final_column",
row_number().over(Window.partitionBy("cola", "group").orderBy("date")) - 1
).drop("group")
result.show
//+----+-----+----+------------+
//|cola| date|flag|final_column|
//+----+-----+----+------------+
//| b|44201| 0| 0|
//| b|44202| 1| 1|
//| b|44203| 1| 2|
//| b|44204| 0| 0|
//| b|44205| 0| 0|
//| b|44206| 1| 1|
//| b|44207| 1| 2|
//| b|44208| 1| 3|
//| b|44209| 1| 4|
//| b|44210| 0| 0|
//| a|44201| 0| 0|
//| a|44202| 1| 1|
//| a|44203| 1| 2|
//| a|44204| 0| 0|
//| a|44205| 0| 0|
//| a|44206| 0| 0|
//| a|44207| 1| 1|
//| a|44208| 1| 2|
//| a|44209| 1| 3|
//| a|44210| 0| 0|
//+----+-----+----+------------+
row_number() - 1 in this case is just equivalent to sum(col("flag")) as flag values are always 0 or 1. So the above final_column can also be written as:
.withColumn(
"final_column",
sum(col("flag")).over(Window.partitionBy("cola", "group").orderBy("date"))
)

Pyspark: how to join two dataframes over multiple columns?

I have two pyspark dataframes df1 and df2
df1
id1 id2 id3 x y
0 1 2 0.5 0.4
2 1 0 0.3 0.2
3 0 2 0.8 0.9
2 1 3 0.2 0.1
df2
id name
0 A
1 B
2 C
3 D
I would like to join the two dataframes and have
df3
id1 id2 id3 n1 n2 n3 x y
0 1 2 A B C 0.5 0.4
2 1 0 C B A 0.3 0.2
3 0 2 D A C 0.8 0.9
2 1 3 C B D 0.2 0.1
Here is the multiple joins.
df1.join(df2, df1['id1'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n1') \
.join(df2, df1['id2'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n2') \
.join(df2, df1['id3'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n3') \
.show()
+---+---+---+---+---+---+---+---+
|id1|id2|id3| x| y| n1| n2| n3|
+---+---+---+---+---+---+---+---+
| 0| 1| 2|0.5|0.4| A| B| C|
| 2| 1| 0|0.3|0.2| C| B| A|
| 3| 0| 2|0.8|0.9| D| A| C|
| 2| 1| 3|0.2|0.1| C| B| D|
+---+---+---+---+---+---+---+---+

Pyspark - Count non zero columns in a spark data frame for each row

I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3
1 0 1 -1
2 0 0 0
3 -17 20 15
4 23 1 0
Expected Output:
ID COL1 COL2 COL3 Count
1 0 1 -1 2
2 0 0 0 0
3 -17 20 15 3
4 23 1 0 1
There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([
[1, 0, 1, -1],
[2, 0, 0, 0],
[3, -17, 20, 15],
[4, 23, 1, 0]],
["ID", "COL1", "COL2", "COL3"]
)
#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']
#import functions
from pyspark.sql import functions as F
#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
"count",
sum([
F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
])
).show()
+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
| 1| 0| 1| -1| 2|
| 2| 0| 0| 0| 0|
| 3| -17| 20| 15| 3|
| 4| 23| 1| 0| 2|
+---+----+----+----+-----+