How to write nested if else in pyspark? - pyspark

I have a pyspark dataframe and I want to achieve the following conditions:
if col1 is not none:
if col1 > 17:
return False
else:
return True
return None
I have implemented it in the following way:
out = out.withColumn('col2', out.withColumn(
'col2', when(col('col1').isNull(), None).otherwise(
when(col('col1') > 17, False).otherwise(True)
)))
However when I run this I achieve the following error :
assert isinstance(col, Column), "col should be Column"
AssertionError: col should be Column
Any ideas what I might be doing wrong.

I think the problem comes from the typo you made, writting twice the out.withColumn.
here is my code :
from pyspark.sql import functions as F
a = [
(None,),
(16,),
(18,),
]
b = [
"col1",
]
df = spark.createDataFrame(a, b)
df.withColumn(
"col2",
F.when(F.col("col1").isNull(), None).otherwise(
F.when(F.col("col1") > 17, False).otherwise(True)
),
).show()
+----+-----+
|col1| col2|
+----+-----+
|null| null|
| 16| true|
| 18|false|
+----+-----+
You can also do it a bit differently because you do not need the first otherwise or no need to evaluate explicitly the NULL:
df.withColumn(
"col2",
F.when(F.col("col1").isNull(), None)
.when(F.col("col1") > 17, False)
.otherwise(True),
).show()
# OR
df.withColumn(
"col2", F.when(F.col("col1") > 17, False).when(F.col("col1") <= 17, True)
).show()

Related

Filter df by date using pyspark

everyone!!
I have tried to filter a dataset in pyspark. I had to filter the column date (date type) and I have written this code, but there is somwthing wrong: the dataset is empty. Someone could tell me how to fix it?
df = df.filter((F.col("date") > "2018-12-12") & (F.col("date") < "2019-12-12"))
Tanks
You need first to make sure date column is in date format then use lit for your filter:
df exemple:
df = spark.createDataFrame(
[
('20/12/2018', '1', 50),
('18/01/2021', '2', 23),
('31/01/2022', '3', -10)
], ['date', 'id', 'value']
)
df.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|20/12/2018| 1| 50|
|18/01/2021| 2| 23|
|31/01/2022| 3| -10|
+----------+---+-----+
from pyspark.sql import functions as F
df\
.withColumn('date', F.to_date('date', 'd/M/y'))\
.filter((F.col('date') > F.lit('2018-12-12')) & (F.col("date") < F.lit('2019-12-12')))\
.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|2018-12-20| 1| 50|
+----------+---+-----+

New column creation based on if and else condition using pyspark

I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition
if df2.colx value is present in df1.colx.
I tried below operation in pyspark but its throwing exception.
cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
df1 :
id colx coly
1 678 56789
2 900 67890
3 789 67854
df2
Name colx
seema 900
yash 678
deep 800
harsh 900
My expected Output is
Name colx seg
seema 900 True
harsh 900 True
yash 678 True
deep 800 False
Please help me correcting the given pyspark code or suggest the better way of doing it.
If I understand your question correctly what you want to do is this
res = df2.join(
df1,
on="colx",
how = "left"
).select(
"Name",
"colx"
).withColumn(
"seg",
F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)
let me know if this is the solution you want.
my bad i did write the incorrect code in hurry below is the corrected one
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])
df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])
res = df2.join(
df1.withColumn(
"check",
F.lit(1)
),
on="colx",
how = "left"
).withColumn(
"seg",
F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
"Name",
"colx",
"seg"
)
res.show()
+-----+----+-----+
| Name|colx| seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+
You can join on colx and fill null values with False:
result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')),
on='colx',
how='left')
.fillna(False, subset='seg'))
result.show()
Output:
+----+-----+-----+
|colx| Name| seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+

Scala - Fill "null" column with another column

I want to replicate the problem mentioned here in Scala DataFrames. I have tried using the following approaches, to no success so far.
Input
Col1 Col2
A M
B K
null S
Expected Output
Col1 Col2
A M
B K
S <---- S
Approach 1
val output = df.na.fill("A", Seq("col1"))
The fill method does not take a column as the (first) input.
Approach 2
val output = df.where(df.col("col1").isNull)
I cannot find a suitable method to call after I have identified the null values.
Approach 3
val output = df.dtypes.map(column =>
column._2 match {
case "null" => (column._2 -> 0)
}).toMap
I get a StringType error.
I'd use when/otherwise, as shown below:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("A", "M"), ("B", "K"), (null, "S")
).toDF("Col1", "Col2")
df.withColumn("Col1", when($"Col1".isNull, $"Col2").otherwise($"Col1")).show
// +----+----+
// |Col1|Col2|
// +----+----+
// | A| M|
// | B| K|
// | S| S|
// +----+----+

Transpose Dataframe in Scala

I have dataframe like below.
+---+------+------+
| ID|Field1|Field2|
+---+------+------+
| 1| x| n|
| 2| a| b|
+---+------+------+
And I need the output like below
+---+-------------+------+
| ID| Fields|values|
+---+-------------+------+
| 1|Field1,Field2| x,n|
| 2|Field1,Field2| a,b|
+---+-------------+------+
I am pretty new to scala.. I just need an approach to do this. I already researched on internet regarding transpose, but couldn't get the solution.
Since Fields column is going to be the same in every row, you can add it later.
In this example class Thing has 3 fields: id, Field1, Field2.
val sqlContext = new org.apache.spark.sql.SQLContext( sc )
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df =
sc
.parallelize( List( Thing( 1, "a", "b" ), Thing( 2, "x", "y" ) ) )
.toDF( "id", "Field1", "Field2" )
Column names are returned in the same order so we can just take last two for field names
val fieldNames =
df
.columns
.takeRight( 2 )
The org.apache.spark.sql.functions does all the job combining data from given columns.
val res =
df
.select( $"id", array( $"Field1", $"Field2" ) as "values" )
.withColumn( "Fields", lit( fieldNames ) )
res.show()
Result:
+---+------+----------------+
| id|values| Fields|
+---+------+----------------+
| 1|[a, b]|[Field1, Field2]|
| 2|[x, y]|[Field1, Field2]|
+---+------+----------------+

UnionAll for dataframes with different columns from list in spark scala [duplicate]

I have 2 DataFrames:
I need union like this:
The unionAll function doesn't work because the number and the name of columns are different.
How can I do this?
In Scala you just have to append all missing columns as nulls.
import org.apache.spark.sql.functions._
// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
(50, 2),
(34, 4)
)).toDF("age", "children")
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50| 2| null| null|
| 34| 4| null| null|
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Update
Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.
df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()
+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50| 2| null| null|
| 34| 4| null| null|
+---+--------+---------+------+
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Spark 3.1+
df = df1.unionByName(df2, allowMissingColumns=True)
Test results:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Spark 2.3+
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Here is my Python version:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Here is sample usage:
data = [
Row(zip_code=58542, dma='MIN'),
Row(zip_code=58701, dma='MIN'),
Row(zip_code=57632, dma='MIN'),
Row(zip_code=58734, dma='MIN')
]
firstDF = spark.createDataFrame(data)
data = [
Row(zip_code='534', name='MIN'),
Row(zip_code='353', name='MIN'),
Row(zip_code='134', name='MIN'),
Row(zip_code='245', name='MIN')
]
secondDF = spark.createDataFrame(data)
customUnion(firstDF,secondDF).show()
Here is the code for Python 3.0 using pyspark:
from pyspark.sql.functions import lit
def __order_df_and_add_missing_cols(df, columns_order_list, df_missing_fields):
""" return ordered dataFrame by the columns order list with null in missing columns """
if not df_missing_fields: # no missing fields for the df
return df.select(columns_order_list)
else:
columns = []
for colName in columns_order_list:
if colName not in df_missing_fields:
columns.append(colName)
else:
columns.append(lit(None).alias(colName))
return df.select(columns)
def __add_missing_columns(df, missing_column_names):
""" Add missing columns as null in the end of the columns list """
list_missing_columns = []
for col in missing_column_names:
list_missing_columns.append(lit(None).alias(col))
return df.select(df.schema.names + list_missing_columns)
def __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols):
""" return union of data frames with ordered columns by left_df. """
left_df_all_cols = __add_missing_columns(left_df, left_list_miss_cols)
right_df_all_cols = __order_df_and_add_missing_cols(right_df, left_df_all_cols.schema.names,
right_list_miss_cols)
return left_df_all_cols.union(right_df_all_cols)
def union_d_fs(left_df, right_df):
""" Union between two dataFrames, if there is a gap of column fields,
it will append all missing columns as nulls """
# Check for None input
if left_df is None:
raise ValueError('left_df parameter should not be None')
if right_df is None:
raise ValueError('right_df parameter should not be None')
# For data frames with equal columns and order- regular union
if left_df.schema.names == right_df.schema.names:
return left_df.union(right_df)
else: # Different columns
# Save dataFrame columns name list as set
left_df_col_list = set(left_df.schema.names)
right_df_col_list = set(right_df.schema.names)
# Diff columns between left_df and right_df
right_list_miss_cols = list(left_df_col_list - right_df_col_list)
left_list_miss_cols = list(right_df_col_list - left_df_col_list)
return __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols)
A very simple way to do this - select the columns in the same order from both the dataframes and use unionAll
df1.select('code', 'date', 'A', 'B', 'C', lit(None).alias('D'), lit(None).alias('E'))\
.unionAll(df2.select('code', 'date', lit(None).alias('A'), 'B', 'C', 'D', 'E'))
Here's a pyspark solution.
It assumes that if a field in df1 is missing from df2, then you add that missing field to df2 with null values. However it also assumes that if the field exists in both dataframes, but the type or nullability of the field is different, then the two dataframes conflict and cannot be combined. In that case I raise a TypeError.
from pyspark.sql.functions import lit
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = left_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
# Make sure columns are in the same order
df_left = df_left.select(df_right.columns)
return df_left.union(df_right)
I somehow find most of the python-answers here a bit too clunky in their writing if you're just going with the simple lit(None)-workaround (which is also the only way I know). As alternative this might be useful:
# df1 and df2 are assumed to be the given dataFrames from the question
# Get the lacking columns for each dataframe and set them to null in the respective dataFrame.
# First do so for df1...
for column in [column for column in df1.columns if column not in df2.columns]:
df1 = df1.withColumn(column, lit(None))
# ... and then for df2
for column in [column for column in df2.columns if column not in df1.columns]:
df2 = df2.withColumn(column, lit(None))
Afterwards just do the union() you wanted to do.
Caution: If your column-order differs between df1 and df2 use unionByName()!
result = df1.unionByName(df2)
Modified Alberto Bonsanto's version to preserve the original column order (OP implied the order should match the original tables). Also, the match part caused an Intellij warning.
Here's my version:
def unionDifferentTables(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
val order = df1.columns ++ df2.columns
val sorted = total.toList.sortWith((a,b)=> order.indexOf(a) < order.indexOf(b))
def expr(myCols: Set[String], allCols: List[String]) = {
allCols.map( {
case x if myCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
df1.select(expr(cols1, sorted): _*).unionAll(df2.select(expr(cols2, sorted): _*))
}
in pyspark:
df = df1.join(df2, ['each', 'shared', 'col'], how='full')
I had the same issue and using join instead of union solved my problem.
So, for example with python , instead of this line of code:
result = left.union(right), which will fail to execute for different number of columns,
you should use this one:
result = left.join(right, left.columns if (len(left.columns) < len(right.columns)) else right.columns, "outer")
Note that the second argument contains the common columns between the two DataFrames. If you don't use it, the result will have duplicate columns with one of them being null and the other not.
Hope it helps.
There is much concise way to handle this issue with a moderate sacrifice of performance.
def unionWithDifferentSchema(a: DataFrame, b: DataFrame): DataFrame = {
sparkSession.read.json(a.toJSON.union(b.toJSON).rdd)
}
This is the function which does the trick. Using toJSON to each dataframe makes a json Union. This preserves the ordering and the datatype.
Only catch is toJSON is relatively expensive (however not much you probably get 10-15% slowdown). However this keeps the code clean.
My version for Java:
private static Dataset<Row> unionDatasets(Dataset<Row> one, Dataset<Row> another) {
StructType firstSchema = one.schema();
List<String> anotherFields = Arrays.asList(another.schema().fieldNames());
another = balanceDataset(another, firstSchema, anotherFields);
StructType secondSchema = another.schema();
List<String> oneFields = Arrays.asList(one.schema().fieldNames());
one = balanceDataset(one, secondSchema, oneFields);
return another.unionByName(one);
}
private static Dataset<Row> balanceDataset(Dataset<Row> dataset, StructType schema, List<String> fields) {
for (StructField e : schema.fields()) {
if (!fields.contains(e.name())) {
dataset = dataset
.withColumn(e.name(),
lit(null));
dataset = dataset.withColumn(e.name(),
dataset.col(e.name()).cast(Optional.ofNullable(e.dataType()).orElse(StringType)));
}
}
return dataset;
}
Here's the version in Scala also answered here, Also a Pyspark version..
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+
This function takes in two dataframes (df1 and df2) with different schemas and unions them.
First we need to bring them to the same schema by adding all (missing) columns from df1 to df2 and vice versa. To add a new empty column to a df we need to specify the datatype.
import pyspark.sql.functions as F
def union_different_schemas(df1, df2):
# Get a list of all column names in both dfs
columns_df1 = df1.columns
columns_df2 = df2.columns
# Get a list of datatypes of the columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
# We go through all columns in df1 and if they are not in df2, we add
# them (and specify the correct datatype too)
for col, typ in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2\
.withColumn(col, F.lit(None).cast(typ))
# Now df2 has all missing columns from df1, let's do the same for df1
for col, typ in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1\
.withColumn(col, F.lit(None).cast(typ))
# Now df1 and df2 have the same columns, not necessarily in the same
# order, therefore we use unionByName
combined_df = df1\
.unionByName(df2)
return combined_df
PYSPARK
Scala version from Alberto works great. However, if you want to make a for-loop or some dynamic assignment of variables you can face some problems.
Solution comes with Pyspark - clean code:
from pyspark.sql.functions import *
#defining dataframes
df1 = spark.createDataFrame(
[
(1, 'foo','ok'),
(2, 'pro','ok')
],
['id', 'txt','check']
)
df2 = spark.createDataFrame(
[
(3, 'yep',13,'mo'),
(4, 'bro',11,'re')
],
['id', 'txt','value','more']
)
#retrieving columns
cols1 = df1.columns
cols2 = df2.columns
#getting columns from df1 and df2
total = list(set(cols2) | set(cols1))
#defining function for adding nulls (None in case of pyspark)
def addnulls(yourDF):
for x in total:
if not x in yourDF.columns:
yourDF = yourDF.withColumn(x,lit(None))
return yourDF
df1 = addnulls(df1)
df2 = addnulls(df2)
#additional sorting for correct unionAll (it concatenates DFs by column number)
df1.select(sorted(df1.columns)).unionAll(df2.select(sorted(df2.columns))).show()
+-----+---+----+---+-----+
|check| id|more|txt|value|
+-----+---+----+---+-----+
| ok| 1|null|foo| null|
| ok| 2|null|pro| null|
| null| 3| mo|yep| 13|
| null| 4| re|bro| 11|
+-----+---+----+---+-----+
from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def unionAll(*dfs, fill_by=None):
clmns = {clm.name.lower(): (clm.dataType, clm.name) for df in dfs for clm in df.schema.fields}
dfs = list(dfs)
for i, df in enumerate(dfs):
df_clmns = [clm.lower() for clm in df.columns]
for clm, (dataType, name) in clmns.items():
if clm not in df_clmns:
# Add the missing column
dfs[i] = dfs[i].withColumn(name, F.lit(fill_by).cast(dataType))
return reduce(DataFrame.unionByName, dfs)
unionAll(df1, df2).show()
Case insenstive columns
Will returns the actual column case
Support the existing datatypes
Default value can be customizable
Pass multiple dataframes at once (e.g unionAll(df1, df2, df3, ..., df10))
here's another one:
def unite(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = (cols1 ++ cols2).toSeq.sorted
val expr1 = total.map(c => {
if (cols1.contains(c)) c else "NULL as " + c
})
val expr2 = total.map(c => {
if (cols2.contains(c)) c else "NULL as " + c
})
df1.selectExpr(expr1:_*).union(
df2.selectExpr(expr2:_*)
)
}
Union and outer union for Pyspark DataFrame concatenation. This works for multiple data frames with different columns.
def union_all(*dfs):
return reduce(ps.sql.DataFrame.unionAll, dfs)
def outer_union_all(*dfs):
all_cols = set([])
for df in dfs:
all_cols |= set(df.columns)
all_cols = list(all_cols)
print(all_cols)
def expr(cols, all_cols):
def append_cols(col):
if col in cols:
return col
else:
return sqlfunc.lit(None).alias(col)
cols_ = map(append_cols, all_cols)
return list(cols_)
union_df = union_all(*[df.select(expr(df.columns, all_cols)) for df in dfs])
return union_df
One more generic method to union list of DataFrame.
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
This is my pyspark version:
from functools import reduce
from pyspark.sql.functions import lit
def concat(dfs):
# when the dataframes to combine do not have the same order of columns
# https://datascience.stackexchange.com/a/27231/15325
return reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)
def union_all(dfs):
columns = reduce(lambda x, y : set(x).union(set(y)), [ i.columns for i in dfs ] )
for i in range(len(dfs)):
d = dfs[i]
for c in columns:
if c not in d.columns:
d = d.withColumn(c, lit(None))
dfs[i] = d
return concat(dfs)
Alternate you could use full join.
list_of_files = ['test1.parquet', 'test2.parquet']
def merged_frames():
if list_of_files:
frames = [spark.read.parquet(df.path) for df in list_of_files]
if frames:
df = frames[0]
if frames[1]:
var = 1
for element in range(len(frames)-1):
result_df = df.join(frames[var], 'primary_key', how='full')
var += 1
display(result_df)
If you are loading from files, I guess you could just use the read function with a list of files.
# file_paths is list of files with different schema
df = spark.read.option("mergeSchema", "true").json(file_paths)
The resulting dataframe will have merged columns.