PySpark dataframe: working with duplicated column names after self join - pyspark

I have a dataframe like this (inspired by this question with slightly different setup):
df3 = hive_context.createDataFrame([
Row(a=107831, f=3),
Row(a=107531, f=2),
Row(a=125231, f=2)
])
Based on this, i create two new objects. Each is a subset of the original dataframe:
from pyspark.sql.functions import col
df1 = (df3
.filter(((col('a') == 107831) & (col('f') == 3))|
((col('a') == 125231) & (col('f') == 2))))
df2 = (df3
.filter(((col('a') == 107831) & (col('f') == 3))|
((col('a') == 107531) & (col('f') == 2))))
Then, I would like to join those two datasets and obtain the f columns from each of them as in:
a = (df1
.join(df2, (df1['a'] == df2['a']), how = 'full')
.select(df1['f'], df2['f']).collect())
However, I get [Row(f=None, f=None), Row(f=2, f=2), Row(f=3, f=3)]
Instead of the desired [Row(f=3, f=4), Row(f=None, f=2), Row(f=2, f=None)] or expressed as a table:
+------+----+------+----+
| a| f| a| f|
+------+----+------+----+
|107831| 3|107831| 4|
| null|null|107531| 2|
|125231| 2| null|null|
+------+----+------+----+
Does anyone know how to resolve this? Do i have to store df1 and df2 somewhere?
When I run the scenario as in the above linked question, i get the expected results:
df1 = hive_context.createDataFrame([
Row(a=107831, f=3),
Row(a=125231, f=2),
])
df2 = hive_context.createDataFrame([
Row(a=107831, f=4),
Row(a=107531, f=2),
])
a = df1.join(df2, (df1['a'] == df2['a']), how = 'full').select(df1['f'], df2['f']).collect()
a
I run it on python 3.6 and spark 2.3

In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:
a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]

Related

Maximum of some specific columns in a spark scala dataframe

I have a dataframe like this.
+---+---+---+---+
| M| c2| c3| d1|
+---+---+---+---+
| 1|2_1|4_3|1_2|
| 2|3_4|4_5|1_2|
+---+---+---+---+
I have to transform this df should look like below. Here, c_max = max(c2,c3) after splitting with _.ie, all the columns (c2 and c3) have to be splitted with _ and then getting the max.
In the actual scenario, I have 50 columns ie, c2,c3....c50 and need to take the max from this.
+---+---+---+---+------+
| M| c2| c3| d1|c_Max |
+---+---+---+---+------+
| 1|2_1|4_3|1_2| 4 |
| 2|3_4|4_5|1_2| 5 |
+---+---+---+---+------+
Here is one way using expr and build-in array functions for Spark >= 2.4.0:
import org.apache.spark.sql.functions.{expr, array_max, array}
val df = Seq(
(1, "2_1", "3_4", "1_2"),
(2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")
// get max c for each c column
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}
df.withColumn("max_c", array_max(array(c_cols:_*))).show
Output:
+---+---+---+---+-----+
| M| c2| c3| d1|max_c|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
For older versions use the next code:
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
val c_ar = split(col(c), "_").cast("array<int>")
when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}
df.withColumn("max_c", greatest(c_cols:_*)).show
Use greatest function:
val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")
// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
.flatMap{
c => Seq(split(col(c), "_").getItem(0).cast("int"),
split(col(c), "_").getItem(1).cast("int")
)
}
// apply greatest func
val c_max = greatest(c_cols: _*)
// add new column
df.withColumn("c_Max", c_max).show()
Gives:
+---+---+---+---+-----+
| M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
| 1|2_1|3_4|1_2| 4|
| 2|3_4|4_5|1_2| 5|
+---+---+---+---+-----+
In spark >= 2.4.0, you can use the array_max function and get some code that would work even with columns containing more than 2 values. The idea is to start by concatenating all the columns (concat column). For that, I use concat_ws on an array of all the columns I want to concat, that I obtain with array(cols.map(col) :_*). Then I split the resulting string to get a big array of strings containing all the values of all the columns. I cast it to an array of ints and I call array_max on it.
val cols = (2 to 50).map("c"+_)
val result = df
.withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
.withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
.withColumn("c_max", array_max('array_of_ints))
.drop("concat", "array_of_ints")
In spark < 2.4, you can define array_max yourself like this:
val array_max = udf((s : Seq[Int]) => s.max)
The previous code does not need to be modified. Note however that UDFs can be slower than predefined spark SQL functions.

Is there any way to subtract two dataframes with alphanumeric datatype. I tried using except but the count of records is not coming correctly

I am trying to subtract two data frames in scala and my datatypes are alphanumeric like I have a string as the data type for id column. I tried using except
df1.merge(
df2, how='outer', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)
val df1 = Seq(("1","2019-04-03 14:45:00","1"),("2","2019-04-03 14:45:00","1"),("3","2019-04-03 14:45:00","1")).toDF("ID","Timestamp","RowNum")
val df2 = Seq(("2","2019-04-03 13:45:00","2"),("3","2019-04-03 13:45:00","2")).toDF("ID","Timestamp","RowNum")
val idDiff = df1.select("ID").except(df2.select("ID"))
val outputDF = df1.join(idDiff, "ID")
But nothing helps. I was not getting the correct count. Any help will be appreciated.
So the outputDF should contains only one record "1","2019-04-03 14:45:00","1" ?
I've ran your code and looks like it works, you can get same result with left_anti join.
val idDiff = df1.select("ID").except(df2.select("ID"))
val outputDF = df1.join(idDiff, "ID")
outputDF.show()
df1.join(df2,Seq("ID"),"left_anti").show()
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
+---+-------------------+------+
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
+---+-------------------+------+

Filter Dataframe on columns defined in a list

I have a dataframe
df
+----------+----+----+----+---+---+----+---+---+-------+-------+
| WEEK|DIM1|DIM2| T1| T2| T3| T1| T2| T3|T1_diff|T2_diff|
+----------+----+----+----+---+---+----+---+---+-------+-------+
|2016-04-02| 14|NULL|9874|880| 23|9879|820| 45| -5| 60|
|2016-04-30| 14| FR|9875| 13| 34|9785| 9| 67| 90| 4|
+----------+----+----+----+---+---+----+---+---+-------+-------+
I want to do two things on this data frame:
Select only WEEK, DIM1, DIM2, T1_diff, T2_diff
Filter T1_diff or T2_diff > 3
I am currently doing it like this -
val selectColumns = Seq("WEEK", "DIM1", "DIM2","T1_diff","T2_diff")
df.select(selectColumns.head, selectColumns.tail: _*).filter($"T1_diff" > 3 or $"T2_diff" > 3).show()
I have a use case, where i have my targetColumns defined like this -
val targetColumns = Seq("T1_diff", "T2_diff")
I need to use the above sequence to apply it in the filter. This is in sequence, because more columns can be added in targetColumns list.
I tried something like this -
df.filter(r => !targetColumns.map(x => col(x) > 3).isEmpty).show()
This doesnt seem to work. Can anyone tell me what is the best way to do this ?
You can use reduce on the sequence of target columns after you've mapped each of them into a condition (col(name) > 3), using or to "merge" them together into one condition:
import org.apache.spark.sql.functions._
val selectColumns = Seq("id", "type", "DIM2","T1_diff","T2_diff")
val targetColumns = Seq("T1_diff", "T2_diff")
df.select(selectColumns.head, selectColumns.tail: _*)
.filter(targetColumns.map(name => col(name) > 3).reduce(_ or _))
.show()
You can create a String using targetColumns List and then pass that String to where function.
val targetColumns = List("T1_diff", "T2_diff")
val selectColumns = Seq("WEEK", "DIM1", "DIM2", "T1_diff", "T2_diff")
//create the where condition to filter the columns
val condition = targetColumns.map(c => s"$c>3").mkString(" OR ")
//select the columns and apply filter using where function.
df.select(selectColumns.head, selectColumns.tail: _*).where(condition).show(false)
you can simply do the following making a string query
val targetColumns = Seq("T1_diff", "T2_diff")
df.filter(targetColumns.map(x => s"$x > 3").mkString(" or ")).show()
and you can add as many columns in targetColumns as you want

UnionAll for dataframes with different columns from list in spark scala [duplicate]

I have 2 DataFrames:
I need union like this:
The unionAll function doesn't work because the number and the name of columns are different.
How can I do this?
In Scala you just have to append all missing columns as nulls.
import org.apache.spark.sql.functions._
// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
(50, 2),
(34, 4)
)).toDF("age", "children")
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50| 2| null| null|
| 34| 4| null| null|
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Update
Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.
df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()
+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50| 2| null| null|
| 34| 4| null| null|
+---+--------+---------+------+
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Spark 3.1+
df = df1.unionByName(df2, allowMissingColumns=True)
Test results:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Spark 2.3+
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Here is my Python version:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Here is sample usage:
data = [
Row(zip_code=58542, dma='MIN'),
Row(zip_code=58701, dma='MIN'),
Row(zip_code=57632, dma='MIN'),
Row(zip_code=58734, dma='MIN')
]
firstDF = spark.createDataFrame(data)
data = [
Row(zip_code='534', name='MIN'),
Row(zip_code='353', name='MIN'),
Row(zip_code='134', name='MIN'),
Row(zip_code='245', name='MIN')
]
secondDF = spark.createDataFrame(data)
customUnion(firstDF,secondDF).show()
Here is the code for Python 3.0 using pyspark:
from pyspark.sql.functions import lit
def __order_df_and_add_missing_cols(df, columns_order_list, df_missing_fields):
""" return ordered dataFrame by the columns order list with null in missing columns """
if not df_missing_fields: # no missing fields for the df
return df.select(columns_order_list)
else:
columns = []
for colName in columns_order_list:
if colName not in df_missing_fields:
columns.append(colName)
else:
columns.append(lit(None).alias(colName))
return df.select(columns)
def __add_missing_columns(df, missing_column_names):
""" Add missing columns as null in the end of the columns list """
list_missing_columns = []
for col in missing_column_names:
list_missing_columns.append(lit(None).alias(col))
return df.select(df.schema.names + list_missing_columns)
def __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols):
""" return union of data frames with ordered columns by left_df. """
left_df_all_cols = __add_missing_columns(left_df, left_list_miss_cols)
right_df_all_cols = __order_df_and_add_missing_cols(right_df, left_df_all_cols.schema.names,
right_list_miss_cols)
return left_df_all_cols.union(right_df_all_cols)
def union_d_fs(left_df, right_df):
""" Union between two dataFrames, if there is a gap of column fields,
it will append all missing columns as nulls """
# Check for None input
if left_df is None:
raise ValueError('left_df parameter should not be None')
if right_df is None:
raise ValueError('right_df parameter should not be None')
# For data frames with equal columns and order- regular union
if left_df.schema.names == right_df.schema.names:
return left_df.union(right_df)
else: # Different columns
# Save dataFrame columns name list as set
left_df_col_list = set(left_df.schema.names)
right_df_col_list = set(right_df.schema.names)
# Diff columns between left_df and right_df
right_list_miss_cols = list(left_df_col_list - right_df_col_list)
left_list_miss_cols = list(right_df_col_list - left_df_col_list)
return __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols)
A very simple way to do this - select the columns in the same order from both the dataframes and use unionAll
df1.select('code', 'date', 'A', 'B', 'C', lit(None).alias('D'), lit(None).alias('E'))\
.unionAll(df2.select('code', 'date', lit(None).alias('A'), 'B', 'C', 'D', 'E'))
Here's a pyspark solution.
It assumes that if a field in df1 is missing from df2, then you add that missing field to df2 with null values. However it also assumes that if the field exists in both dataframes, but the type or nullability of the field is different, then the two dataframes conflict and cannot be combined. In that case I raise a TypeError.
from pyspark.sql.functions import lit
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = left_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
# Make sure columns are in the same order
df_left = df_left.select(df_right.columns)
return df_left.union(df_right)
I somehow find most of the python-answers here a bit too clunky in their writing if you're just going with the simple lit(None)-workaround (which is also the only way I know). As alternative this might be useful:
# df1 and df2 are assumed to be the given dataFrames from the question
# Get the lacking columns for each dataframe and set them to null in the respective dataFrame.
# First do so for df1...
for column in [column for column in df1.columns if column not in df2.columns]:
df1 = df1.withColumn(column, lit(None))
# ... and then for df2
for column in [column for column in df2.columns if column not in df1.columns]:
df2 = df2.withColumn(column, lit(None))
Afterwards just do the union() you wanted to do.
Caution: If your column-order differs between df1 and df2 use unionByName()!
result = df1.unionByName(df2)
Modified Alberto Bonsanto's version to preserve the original column order (OP implied the order should match the original tables). Also, the match part caused an Intellij warning.
Here's my version:
def unionDifferentTables(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
val order = df1.columns ++ df2.columns
val sorted = total.toList.sortWith((a,b)=> order.indexOf(a) < order.indexOf(b))
def expr(myCols: Set[String], allCols: List[String]) = {
allCols.map( {
case x if myCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
df1.select(expr(cols1, sorted): _*).unionAll(df2.select(expr(cols2, sorted): _*))
}
in pyspark:
df = df1.join(df2, ['each', 'shared', 'col'], how='full')
I had the same issue and using join instead of union solved my problem.
So, for example with python , instead of this line of code:
result = left.union(right), which will fail to execute for different number of columns,
you should use this one:
result = left.join(right, left.columns if (len(left.columns) < len(right.columns)) else right.columns, "outer")
Note that the second argument contains the common columns between the two DataFrames. If you don't use it, the result will have duplicate columns with one of them being null and the other not.
Hope it helps.
There is much concise way to handle this issue with a moderate sacrifice of performance.
def unionWithDifferentSchema(a: DataFrame, b: DataFrame): DataFrame = {
sparkSession.read.json(a.toJSON.union(b.toJSON).rdd)
}
This is the function which does the trick. Using toJSON to each dataframe makes a json Union. This preserves the ordering and the datatype.
Only catch is toJSON is relatively expensive (however not much you probably get 10-15% slowdown). However this keeps the code clean.
My version for Java:
private static Dataset<Row> unionDatasets(Dataset<Row> one, Dataset<Row> another) {
StructType firstSchema = one.schema();
List<String> anotherFields = Arrays.asList(another.schema().fieldNames());
another = balanceDataset(another, firstSchema, anotherFields);
StructType secondSchema = another.schema();
List<String> oneFields = Arrays.asList(one.schema().fieldNames());
one = balanceDataset(one, secondSchema, oneFields);
return another.unionByName(one);
}
private static Dataset<Row> balanceDataset(Dataset<Row> dataset, StructType schema, List<String> fields) {
for (StructField e : schema.fields()) {
if (!fields.contains(e.name())) {
dataset = dataset
.withColumn(e.name(),
lit(null));
dataset = dataset.withColumn(e.name(),
dataset.col(e.name()).cast(Optional.ofNullable(e.dataType()).orElse(StringType)));
}
}
return dataset;
}
Here's the version in Scala also answered here, Also a Pyspark version..
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+
This function takes in two dataframes (df1 and df2) with different schemas and unions them.
First we need to bring them to the same schema by adding all (missing) columns from df1 to df2 and vice versa. To add a new empty column to a df we need to specify the datatype.
import pyspark.sql.functions as F
def union_different_schemas(df1, df2):
# Get a list of all column names in both dfs
columns_df1 = df1.columns
columns_df2 = df2.columns
# Get a list of datatypes of the columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
# We go through all columns in df1 and if they are not in df2, we add
# them (and specify the correct datatype too)
for col, typ in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2\
.withColumn(col, F.lit(None).cast(typ))
# Now df2 has all missing columns from df1, let's do the same for df1
for col, typ in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1\
.withColumn(col, F.lit(None).cast(typ))
# Now df1 and df2 have the same columns, not necessarily in the same
# order, therefore we use unionByName
combined_df = df1\
.unionByName(df2)
return combined_df
PYSPARK
Scala version from Alberto works great. However, if you want to make a for-loop or some dynamic assignment of variables you can face some problems.
Solution comes with Pyspark - clean code:
from pyspark.sql.functions import *
#defining dataframes
df1 = spark.createDataFrame(
[
(1, 'foo','ok'),
(2, 'pro','ok')
],
['id', 'txt','check']
)
df2 = spark.createDataFrame(
[
(3, 'yep',13,'mo'),
(4, 'bro',11,'re')
],
['id', 'txt','value','more']
)
#retrieving columns
cols1 = df1.columns
cols2 = df2.columns
#getting columns from df1 and df2
total = list(set(cols2) | set(cols1))
#defining function for adding nulls (None in case of pyspark)
def addnulls(yourDF):
for x in total:
if not x in yourDF.columns:
yourDF = yourDF.withColumn(x,lit(None))
return yourDF
df1 = addnulls(df1)
df2 = addnulls(df2)
#additional sorting for correct unionAll (it concatenates DFs by column number)
df1.select(sorted(df1.columns)).unionAll(df2.select(sorted(df2.columns))).show()
+-----+---+----+---+-----+
|check| id|more|txt|value|
+-----+---+----+---+-----+
| ok| 1|null|foo| null|
| ok| 2|null|pro| null|
| null| 3| mo|yep| 13|
| null| 4| re|bro| 11|
+-----+---+----+---+-----+
from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def unionAll(*dfs, fill_by=None):
clmns = {clm.name.lower(): (clm.dataType, clm.name) for df in dfs for clm in df.schema.fields}
dfs = list(dfs)
for i, df in enumerate(dfs):
df_clmns = [clm.lower() for clm in df.columns]
for clm, (dataType, name) in clmns.items():
if clm not in df_clmns:
# Add the missing column
dfs[i] = dfs[i].withColumn(name, F.lit(fill_by).cast(dataType))
return reduce(DataFrame.unionByName, dfs)
unionAll(df1, df2).show()
Case insenstive columns
Will returns the actual column case
Support the existing datatypes
Default value can be customizable
Pass multiple dataframes at once (e.g unionAll(df1, df2, df3, ..., df10))
here's another one:
def unite(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = (cols1 ++ cols2).toSeq.sorted
val expr1 = total.map(c => {
if (cols1.contains(c)) c else "NULL as " + c
})
val expr2 = total.map(c => {
if (cols2.contains(c)) c else "NULL as " + c
})
df1.selectExpr(expr1:_*).union(
df2.selectExpr(expr2:_*)
)
}
Union and outer union for Pyspark DataFrame concatenation. This works for multiple data frames with different columns.
def union_all(*dfs):
return reduce(ps.sql.DataFrame.unionAll, dfs)
def outer_union_all(*dfs):
all_cols = set([])
for df in dfs:
all_cols |= set(df.columns)
all_cols = list(all_cols)
print(all_cols)
def expr(cols, all_cols):
def append_cols(col):
if col in cols:
return col
else:
return sqlfunc.lit(None).alias(col)
cols_ = map(append_cols, all_cols)
return list(cols_)
union_df = union_all(*[df.select(expr(df.columns, all_cols)) for df in dfs])
return union_df
One more generic method to union list of DataFrame.
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
This is my pyspark version:
from functools import reduce
from pyspark.sql.functions import lit
def concat(dfs):
# when the dataframes to combine do not have the same order of columns
# https://datascience.stackexchange.com/a/27231/15325
return reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)
def union_all(dfs):
columns = reduce(lambda x, y : set(x).union(set(y)), [ i.columns for i in dfs ] )
for i in range(len(dfs)):
d = dfs[i]
for c in columns:
if c not in d.columns:
d = d.withColumn(c, lit(None))
dfs[i] = d
return concat(dfs)
Alternate you could use full join.
list_of_files = ['test1.parquet', 'test2.parquet']
def merged_frames():
if list_of_files:
frames = [spark.read.parquet(df.path) for df in list_of_files]
if frames:
df = frames[0]
if frames[1]:
var = 1
for element in range(len(frames)-1):
result_df = df.join(frames[var], 'primary_key', how='full')
var += 1
display(result_df)
If you are loading from files, I guess you could just use the read function with a list of files.
# file_paths is list of files with different schema
df = spark.read.option("mergeSchema", "true").json(file_paths)
The resulting dataframe will have merged columns.

Replace one dataframe column value with another's value

I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?
Indeed, a left join and a select call would do the trick:
// assuming "spark" is an active SparkSession:
import org.apache.spark.sql.functions._
import spark.implicits._
// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")
// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
.select($"id", coalesce($"B.value", $"A.value") as "value")
// result:
// +---+-----+
// | id|value|
// +---+-----+
// | 1| b1|
// | 2| b2|
// | 3| a3|
// +---+-----+
Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.