is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append) - pyspark

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance

You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

Related

to generate a serial number column for the duplicated values of one column in pyspark

How do I create a column dup_duns_number having the index number for the duplicated values increasing by 1.
DUNS_NUMBER dup_DUNS_NUMBER
0 0 1.0
1 0 1.0
2 0 1.0
3 0 1.0
4 1000231 NaN
5 1000236 NaN
6 1000363 2.0
7 1000363 2.0
8 1000368 NaN
9 1000467 NaN
10 1000470 3.0
11 1000470 3.0
12 1000470 3.0
13 1000553 4.0
14 1000553 4.0
15 1000574 NaN
16 1000657 5.0
17 1000657 5.0
18 1000694 NaN
19 1000744 6.0
20 1000744 6.0
I suggest you do it in two steps: first you have to filter the duplicated values and rank them to give you the increasing index number and then union the non-duplicated values, like so:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, lit, dense_rank
df = spark.createDataFrame([(0,), (0,), (0,), (0,), (1000231,), (1000236,), (1000363,), (1000363,)], "DUNS_NUMBER: int")
count_window = Window.partitionBy("DUNS_NUMBER")
order_window = Window.partitionBy().orderBy("DUNS_NUMBER")
df_with_count = df.withColumn("dup_COUNT", count(col("DUNS_NUMBER")).over(count_window))
df_with_count \
.filter("dup_COUNT > 1") \
.withColumn("dup_DUNS_NUMBER", dense_rank().over(order_window)) \
.union( \
df_with_count.filter("dup_COUNT = 1").withColumn("dup_DUNS_NUMBER", lit("NaN"))) \
.select(["DUNS_NUMBER", "dup_DUNS_NUMBER"]) \
.orderBy("DUNS_NUMBER") \
.show()
+-----------+---------------+
|DUNS_NUMBER|dup_DUNS_NUMBER|
+-----------+---------------+
| 0| 1|
| 0| 1|
| 0| 1|
| 0| 1|
| 1000231| NaN|
| 1000236| NaN|
| 1000363| 2|
| 1000363| 2|
+-----------+---------------+

Pyspark pivot table and create new columns based on values of another column

I am trying to create new versions of existing columns based on the values of another column. Eg. in the input I create new columns for 'var1', 'var2', 'var3' for each value 'var' split can take.
Input:
time
student
split
var1
var2
var3
t1
Student1
A
1
3
7
t1
Student1
B
2
5
6
t1
Student1
C
3
1
9
t2
Student1
A
5
3
7
t2
Student1
B
9
6
3
t2
Student1
C
3
5
3
t1
Student2
A
1
2
8
t1
Student2
C
7
4
0
Output:
time
student
splitA_var1
splitA_var2
splitA_var1
splitB_var1
splitB_var2
splitB_var3
splitC_var1
splitC_var2
splitC_var3
t1
Student1
1
3
7
2
5
6
3
1
9
t2
Student1
5
3
7
9
6
3
3
5
3
t1
Student2
1
2
8
7
4
0
Image of output here if table not formatted
this is an easy pivot with multiple aggregations (within agg()).
see example below
import pyspark.sql.functions as func
data_sdf. \
withColumn('pivot_col', func.concat(func.lit('split'), 'split')). \
groupBy('time', 'student'). \
pivot('pivot_col'). \
agg(func.first('var1').alias('var1'),
func.first('var2').alias('var2'),
func.first('var3').alias('var3')
). \
show()
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# |time| student|splitA_var1|splitA_var2|splitA_var3|splitB_var1|splitB_var2|splitB_var3|splitC_var1|splitC_var2|splitC_var3|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# | t1|Student2| 1| 2| 8| null| null| null| 7| 4| 0|
# | t2|Student1| 5| 3| 7| 9| 6| 3| 3| 5| 3|
# | t1|Student1| 1| 3| 7| 2| 5| 6| 3| 1| 9|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
spark will create new columns with the following nomenclature - <pivot column value>_<aggregation alias>

Calculate variance across columns in pyspark

How can I calculate variance across numerous columns in a pyspark ?
For e.g. if the pyspark.sql.dataframe table is:
ID A B C
1 12 15 7
2 6 15 2
3 56 25 25
4 36 12 5
and output needed is
ID A B C Variance
1 12 15 7 10.9
2 6 15 2 29.6
3 56 25 25 213.6
4 36 12 5 176.2
There is a variance function in pyspark but it works only column-wise.
Just concat the columns that you need using concat_ws function and use udf to calculate variance like below
from pyspark.sql.functions import *
from pyspark.sql.types import *
from statistics import pvariance
def calculateVar(row):
data = [float(x.strip()) for x in row.split(",")]
return pvariance(data)
varUDF = udf(calculateVar,FloatType())
df.withColumn('Variance',varUDF(concat_ws(",",df.a,df.b,df.c))).show()
output :
+---+---+---+---+---------+
| id| a| b| c| Variance|
+---+---+---+---+---------+
| 1| 12| 15| 7|10.888889|
| 2| 6| 15| 2|29.555555|
| 3| 56| 25| 25|213.55556|
| 4| 36| 12| 5|176.22223|
+---+---+---+---+---------+

Spark not isin alternative [duplicate]

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+

How to filter duplicate records having multiple key in Spark Dataframe?

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+