Pyspark: how to join two dataframes over multiple columns? - pyspark

I have two pyspark dataframes df1 and df2
df1
id1 id2 id3 x y
0 1 2 0.5 0.4
2 1 0 0.3 0.2
3 0 2 0.8 0.9
2 1 3 0.2 0.1
df2
id name
0 A
1 B
2 C
3 D
I would like to join the two dataframes and have
df3
id1 id2 id3 n1 n2 n3 x y
0 1 2 A B C 0.5 0.4
2 1 0 C B A 0.3 0.2
3 0 2 D A C 0.8 0.9
2 1 3 C B D 0.2 0.1

Here is the multiple joins.
df1.join(df2, df1['id1'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n1') \
.join(df2, df1['id2'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n2') \
.join(df2, df1['id3'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n3') \
.show()
+---+---+---+---+---+---+---+---+
|id1|id2|id3| x| y| n1| n2| n3|
+---+---+---+---+---+---+---+---+
| 0| 1| 2|0.5|0.4| A| B| C|
| 2| 1| 0|0.3|0.2| C| B| A|
| 3| 0| 2|0.8|0.9| D| A| C|
| 2| 1| 3|0.2|0.1| C| B| D|
+---+---+---+---+---+---+---+---+

Related

Window function with PySpark

I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""
You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))

How to do cumulative sum based on conditions in spark scala

I have the below data and final_column is the exact output what I am trying to get. I am trying to do cumulative sum of flag and want to rest if flag is 0 then set value to 0 as below data
cola date flag final_column
a 2021-10-01 0 0
a 2021-10-02 1 1
a 2021-10-03 1 2
a 2021-10-04 0 0
a 2021-10-05 0 0
a 2021-10-06 0 0
a 2021-10-07 1 1
a 2021-10-08 1 2
a 2021-10-09 1 3
a 2021-10-10 0 0
b 2021-10-01 0 0
b 2021-10-02 1 1
b 2021-10-03 1 2
b 2021-10-04 0 0
b 2021-10-05 0 0
b 2021-10-06 1 1
b 2021-10-07 1 2
b 2021-10-08 1 3
b 2021-10-09 1 4
b 2021-10-10 0 0
I have tried like
import org.apache.spark.sql.functions._
df.withColumn("final_column",expr("sum(flag) over(partition by cola order date asc)"))
I have tried to add condition like case when flag = 0 then 0 else 1 end inside sum function but not working.
You can define a column group using conditional sum on flag, then using row_number with a Window partitioned by cola and group gives the result you want:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn(
"group",
sum(when(col("flag") === 0, 1).otherwise(0)).over(Window.partitionBy("cola").orderBy("date"))
).withColumn(
"final_column",
row_number().over(Window.partitionBy("cola", "group").orderBy("date")) - 1
).drop("group")
result.show
//+----+-----+----+------------+
//|cola| date|flag|final_column|
//+----+-----+----+------------+
//| b|44201| 0| 0|
//| b|44202| 1| 1|
//| b|44203| 1| 2|
//| b|44204| 0| 0|
//| b|44205| 0| 0|
//| b|44206| 1| 1|
//| b|44207| 1| 2|
//| b|44208| 1| 3|
//| b|44209| 1| 4|
//| b|44210| 0| 0|
//| a|44201| 0| 0|
//| a|44202| 1| 1|
//| a|44203| 1| 2|
//| a|44204| 0| 0|
//| a|44205| 0| 0|
//| a|44206| 0| 0|
//| a|44207| 1| 1|
//| a|44208| 1| 2|
//| a|44209| 1| 3|
//| a|44210| 0| 0|
//+----+-----+----+------------+
row_number() - 1 in this case is just equivalent to sum(col("flag")) as flag values are always 0 or 1. So the above final_column can also be written as:
.withColumn(
"final_column",
sum(col("flag")).over(Window.partitionBy("cola", "group").orderBy("date"))
)

How to label encode for a column in spark scala?

I have a very big dataframe like:
A B
a_1 b_1
a_2 b_2
a_3 b_3
a_1 b_4
a_2 b_4
a_2 b_2
I want to create columns corresponding to each unique value of B and set its value as 1 if it exist for each unique value of A. The expected result should look like this
A B C_b_1 C_b_2 C_b_3 C_b_4
a_1 b_1 1 0 0 1
a_2 b_2 0 1 0 1
a_3 b_3 0 0 1 0
a_1 b_4 1 0 0 1
a_2 b_4 0 1 0 1
a_2 b_2 0 1 0 1
Explanation: For a_1 the distinct values of B are {b_1, b_4} and hence the columns corresponding to them are set to 1. For a_2 the distinct values of B are {b_2, b_4} and hence those columns are 1. Similarly for a_3.
The data is pretty huge, expect 'A' to have about 37000 distinct values while 'B' has about 370. The number of records are over 17 million.
You can use df.stat.crosstab, and join it back to the original dataframe using the A column:
df.join(df.stat.crosstab("A","B").withColumnRenamed("A_B", "A"), "A").show
+---+---+---+---+---+---+
| A| B|b_1|b_2|b_3|b_4|
+---+---+---+---+---+---+
|a_3|b_3| 0| 0| 1| 0|
|a_2|b_2| 0| 2| 0| 1|
|a_2|b_4| 0| 2| 0| 1|
|a_2|b_2| 0| 2| 0| 1|
|a_1|b_4| 1| 0| 0| 1|
|a_1|b_1| 1| 0| 0| 1|
+---+---+---+---+---+---+

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

Pyspark - Count non zero columns in a spark data frame for each row

I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3
1 0 1 -1
2 0 0 0
3 -17 20 15
4 23 1 0
Expected Output:
ID COL1 COL2 COL3 Count
1 0 1 -1 2
2 0 0 0 0
3 -17 20 15 3
4 23 1 0 1
There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([
[1, 0, 1, -1],
[2, 0, 0, 0],
[3, -17, 20, 15],
[4, 23, 1, 0]],
["ID", "COL1", "COL2", "COL3"]
)
#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']
#import functions
from pyspark.sql import functions as F
#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
"count",
sum([
F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
])
).show()
+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
| 1| 0| 1| -1| 2|
| 2| 0| 0| 0| 0|
| 3| -17| 20| 15| 3|
| 4| 23| 1| 0| 2|
+---+----+----+----+-----+