Custom order pyspark dataframe usign a column - pyspark

I have a pyspark dataframe df :
I want to proprtize items based on Type column this order : AAIC>AAFC>TBIC>TBFC and among them uisng RANK column i.e items with lower rank prioritzed within above order groups.
Any values in Type column other than AAIC or AAFC TBIC or TBFC I want to relabel them as NON
ITEM
Type
RANK
1
AAIC
11
2
AAFC
8
3
TBIC
2
4
TBFC
1
5
XYZ
5
6
AAIC
7
7
JHK
10
8
SWE
3
9
TBIC
4
10
AAFC
9
11
AAFC
6
Desired pyspark dataframe df :-
ITEM
Type
RANK
NEW_RANK
6
AAIC
7
1
1
AAIC
11
2
11
AAFC
6
3
2
AAFC
8
4
10
AAFC
9
5
3
TBIC
2
6
9
TBIC
4
7
4
TBFC
1
8
8
NON
3
9
5
NON
5
10
7
NON
10
11

You may check this code:
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
(1, "AAIC", 11),
(2, "AAFC", 8),
(3, "TBIC", 2),
(4, "TBFC", 1),
(5, "XYZ", 5),
(6, "AAIC", 7),
(7, "JHK", 10),
(8, "SWE", 3),
(9, "TBIC", 4),
(10, "AAFC", 9),
(11, "AAFC", 6),
]
inputDf = spark.createDataFrame(inputData, schema=["item", "type", "rank"])
preprocessedDf = inputDf.withColumn(
"type",
F.when(
F.col("type").isin(["AAIC", "AAFC", "TBIC", "TBFC"]), F.col("type")
).otherwise(F.lit("NON")),
).withColumn(
"priority",
F.when(F.col("type") == F.lit("AAIC"), 1).otherwise(
F.when(F.col("type") == F.lit("AAFC"), 2).otherwise(
F.when(F.col("type") == F.lit("TBIC"), 3).otherwise(
F.when(F.col("type") == F.lit("TBFC"), 4).otherwise(F.lit(5))
)
)
),
)
windowSpec = Window.partitionBy().orderBy("priority", "rank")
preprocessedDf.withColumn("NEW_RANK", F.row_number().over(windowSpec)).drop(
"priority"
).show()
Priorities for codes are hardcoded which may be hard to maintain in case of more values. You may want to adjust this part if it needs to be more flexible
I am moving all records to one partition to calculate the correct row_order. Its a common problem, its hard to calculate consistent ids with given order in distributed manner. If your dataset is big, there may be need to think about something else, probably more complicated
output:
+----+----+----+--------+
|item|type|rank|NEW_RANK|
+----+----+----+--------+
| 6|AAIC| 7| 1|
| 1|AAIC| 11| 2|
| 11|AAFC| 6| 3|
| 2|AAFC| 8| 4|
| 10|AAFC| 9| 5|
| 3|TBIC| 2| 6|
| 9|TBIC| 4| 7|
| 4|TBFC| 1| 8|
| 8| NON| 3| 9|
| 5| NON| 5| 10|
| 7| NON| 10| 11|
+----+----+----+--------+

Related

Pyspark pivot table and create new columns based on values of another column

I am trying to create new versions of existing columns based on the values of another column. Eg. in the input I create new columns for 'var1', 'var2', 'var3' for each value 'var' split can take.
Input:
time
student
split
var1
var2
var3
t1
Student1
A
1
3
7
t1
Student1
B
2
5
6
t1
Student1
C
3
1
9
t2
Student1
A
5
3
7
t2
Student1
B
9
6
3
t2
Student1
C
3
5
3
t1
Student2
A
1
2
8
t1
Student2
C
7
4
0
Output:
time
student
splitA_var1
splitA_var2
splitA_var1
splitB_var1
splitB_var2
splitB_var3
splitC_var1
splitC_var2
splitC_var3
t1
Student1
1
3
7
2
5
6
3
1
9
t2
Student1
5
3
7
9
6
3
3
5
3
t1
Student2
1
2
8
7
4
0
Image of output here if table not formatted
this is an easy pivot with multiple aggregations (within agg()).
see example below
import pyspark.sql.functions as func
data_sdf. \
withColumn('pivot_col', func.concat(func.lit('split'), 'split')). \
groupBy('time', 'student'). \
pivot('pivot_col'). \
agg(func.first('var1').alias('var1'),
func.first('var2').alias('var2'),
func.first('var3').alias('var3')
). \
show()
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# |time| student|splitA_var1|splitA_var2|splitA_var3|splitB_var1|splitB_var2|splitB_var3|splitC_var1|splitC_var2|splitC_var3|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# | t1|Student2| 1| 2| 8| null| null| null| 7| 4| 0|
# | t2|Student1| 5| 3| 7| 9| 6| 3| 3| 5| 3|
# | t1|Student1| 1| 3| 7| 2| 5| 6| 3| 1| 9|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
spark will create new columns with the following nomenclature - <pivot column value>_<aggregation alias>

How to filter a column and validate another column using that filtered data frame while using Foundry Data Expectations?

I'm using Data Expectations to validate whether a specific column is satisfying some required condition or not. I was able to write the code for checking whether a column is unique or not. But I'm not able to write the code when it comes to filtering a column and then for that resulting dataframe checking whether another column is unique or not.
For instance, please find the below 2 scenarios, in both the scenarios we need to check whether department_id = "CSE" is having unique roll_no :
Scenario 1:
reg_no
department_id
roll_no
1
CSE
1
2
ECE
1
3
ECE
2
4
CSE
2
5
ME
1
6
EEE
1
7
CSE
2
In this case , It should fail since CSE is having duplicate roll_no :
Scenario 2:
reg_no
department_id
roll_no
1
CSE
8
2
ECE
2
3
ECE
5
4
CSE
4
5
ME
3
6
EEE
2
7
CSE
1
In this case, the job should pass since deparment_id = "CSE" is having unique roll_no values.
Please let me know on how to satisfy the above 2 scenarios where the dataframe should be filtered first and then check whether a column is unique using foundry data expectations.
You can simply build two dataframes and check if they have the same size:
for the first one just filter department_id='CSE' and select roll_no
for the second one filter department_id='CSE', select roll_no and call distinct()
if they are the same size your dataframe was unique with respect to department_id
IIUC - There is two way to answer this problem -
A generic code that will tell only duplicate values -
df = spark.createDataFrame([(1, "CSE", 1),(2, "ECE", 1),(3, "ECE", 2),(4, "CSE", 2),(5, "ME", 1),(6, "EEE", 1),(7, "CSE", 2)],["reg_no","department_id","roll_no"])
df.show()
df \
.groupby(['department_id', 'roll_no']) \
.count() \
.where('count > 1') \
.sort('count', ascending=False) \
.show()
This will help you identify a depertment_id is unique or not
_w = W.partitionBy("department_id").orderBy("department_id")
df = df.withColumn("roll_no_list", F.collect_list("roll_no").over(_w)).withColumn("roll_no_set", F.collect_set("roll_no").over(_w))
df = df.withColumn("cond_col", F.when(F.size(F.col("roll_no_list")) == F.size(F.col("roll_no_set")), "Unique").otherwise("Not Unique"))
df.show()
+------+-------------+-------+------------+-----------+----------+
|reg_no|department_id|roll_no|roll_no_list|roll_no_set| cond_col|
+------+-------------+-------+------------+-----------+----------+
| 1| CSE| 1| [1, 2, 2]| [1, 2]|Not Unique|
| 4| CSE| 2| [1, 2, 2]| [1, 2]|Not Unique|
| 7| CSE| 2| [1, 2, 2]| [1, 2]|Not Unique|
| 2| ECE| 1| [1, 2]| [1, 2]| Unique|
| 3| ECE| 2| [1, 2]| [1, 2]| Unique|
| 6| EEE| 1| [1]| [1]| Unique|
| 5| ME| 1| [1]| [1]| Unique|
+------+-------------+-------+------------+-----------+----------+

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

Spark not isin alternative [duplicate]

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+

How to filter duplicate records having multiple key in Spark Dataframe?

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+