Calculating edit distance on successive rows of a `Spark Dataframe - scala

I have a data frame as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "AA", "BB", ("AA", "BB")),
(2, "AA", "BB", ("AA", "BB")),
(3, "AB", "BB", ("AB", "BB"))
).toDF("id","name", "surname", "array")
df.show()
and i am looking to calculate the edit distance between the 'array' column in successive row. As an example i want to calculate the edit distance between the 'array' entity in column 1 ("AA", "BB") and the the 'array' entity in column 2 ("AA", "BB"). Here is the edit distance function i am using:
def editDist2[A](a: Iterable[A], b: Iterable[A]): Int = {
val startRow = (0 to b.size).toList
a.foldLeft(startRow) { (prevRow, aElem) =>
(prevRow.zip(prevRow.tail).zip(b)).scanLeft(prevRow.head + 1) {
case (left, ((diag, up), bElem)) => {
val aGapScore = up + 1
val bGapScore = left + 1
val matchScore = diag + (if (aElem == bElem) 0 else 1)
List(aGapScore, bGapScore, matchScore).min
}
}
}.last
}
I know i need to create a UDF for this function but can't seem to be able to. If i use the function as is and using Spark Windowing to get at the pervious row:
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
i get the following error:
<console>:245: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Iterable[?]
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()

I figured out you can use Spark's own levenshtein function for this. This function takes in two string to compare, so it can't be used with the array.
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", levenshtein(($"name"), lag("name", 1).over(window)) + levenshtein(($"surname"), lag("surname", 1).over(window))).show()
giving the desired output:
+---+----+-------+--------+------+
| id|name|surname| array|edit-d|
+---+----+-------+--------+------+
| 1| AA| BB|[AA, BB]| null|
| 2| AA| BB|[AA, BB]| 0|
| 3| AB| BB|[AB, BB]| 1|
+---+----+-------+--------+------+

Related

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Scala: How to add a column with the value of a changed field that was changed between two tables

I have two tables with the same schema (A and B) where every unique ID in table A also exists in table B in a 1 to 1 way. I want to add a column to table B with the name of the column whose value is different between the tables for each row. There is only one difference per row.
For example:
Table A:
{ "id1": 1,"id2": "a","name": "bob","state": "nj"}
{"id1": 2,"id2": "b","name": "sue","state": "ma"}
Table B:
{"id1": 1,"id2": "a","name": "bob","state": "fl"}
{"id1": 2,"id2": "b","name": "susan","state": "ma"}
After comparing them, I want Table B to look like this:
{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}
{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}
I can't find any functions that do this in Spark Scala's data frames. Is there something that I missed?
EDIT: I am working with hundreds to thousands of columns
Here's a way to achieve this without having to "spell-out" the columns, and without a UDF (only using built-in functions):
import org.apache.spark.sql.functions._
import spark.implicits._
// list of columns to compare
val comparableColumns = A.columns.tail // without id
// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
case (result, col) => when(
result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
).otherwise(result)
}
// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
.withColumn("changed_field", changedFieldCol)
.select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)
result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1 |a |bob |fl |state |
// |2 |b |susan|ma |name |
// +---+---+-----+-----+-------------+
You can compare the fields in an UDF which generates the appropriate string:
import spark.implicits._
val df_a = Seq(
(1, "a", "bob", "nj"),
(2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")
val df_b = Seq(
(1, "a", "bob", "fl"),
(2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")
val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
val changedState = if (aState != bState) Some("state") else None
val changedName = if (aName != bName) Some("name") else None
Seq(changedName, changedState).flatten.mkString(",")
}
)
df_b.as("b")
.join(
df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()
gives
+---+---+------+-----+--------------+
|id1|id2| name|state|changed_fields|
+---+---+------+-----+--------------+
| 1| a| bob| fl| state|
| 2| b|susane| ma| name|
+---+---+------+-----+--------------+
EDIT:
Here a more generic version which compares all fields at once:
val compareFields = udf((a:Row,b:Row) => {
assert(a.schema==b.schema)
a.schema
.indices
.map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
.flatten
.mkString(",")
}
)
df_b.as("b")
.join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
.withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
.select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
.show()

UnionAll for dataframes with different columns from list in spark scala [duplicate]

I have 2 DataFrames:
I need union like this:
The unionAll function doesn't work because the number and the name of columns are different.
How can I do this?
In Scala you just have to append all missing columns as nulls.
import org.apache.spark.sql.functions._
// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
(50, 2),
(34, 4)
)).toDF("age", "children")
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50| 2| null| null|
| 34| 4| null| null|
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Update
Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.
df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()
+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50| 2| null| null|
| 34| 4| null| null|
+---+--------+---------+------+
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Spark 3.1+
df = df1.unionByName(df2, allowMissingColumns=True)
Test results:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Spark 2.3+
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Here is my Python version:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Here is sample usage:
data = [
Row(zip_code=58542, dma='MIN'),
Row(zip_code=58701, dma='MIN'),
Row(zip_code=57632, dma='MIN'),
Row(zip_code=58734, dma='MIN')
]
firstDF = spark.createDataFrame(data)
data = [
Row(zip_code='534', name='MIN'),
Row(zip_code='353', name='MIN'),
Row(zip_code='134', name='MIN'),
Row(zip_code='245', name='MIN')
]
secondDF = spark.createDataFrame(data)
customUnion(firstDF,secondDF).show()
Here is the code for Python 3.0 using pyspark:
from pyspark.sql.functions import lit
def __order_df_and_add_missing_cols(df, columns_order_list, df_missing_fields):
""" return ordered dataFrame by the columns order list with null in missing columns """
if not df_missing_fields: # no missing fields for the df
return df.select(columns_order_list)
else:
columns = []
for colName in columns_order_list:
if colName not in df_missing_fields:
columns.append(colName)
else:
columns.append(lit(None).alias(colName))
return df.select(columns)
def __add_missing_columns(df, missing_column_names):
""" Add missing columns as null in the end of the columns list """
list_missing_columns = []
for col in missing_column_names:
list_missing_columns.append(lit(None).alias(col))
return df.select(df.schema.names + list_missing_columns)
def __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols):
""" return union of data frames with ordered columns by left_df. """
left_df_all_cols = __add_missing_columns(left_df, left_list_miss_cols)
right_df_all_cols = __order_df_and_add_missing_cols(right_df, left_df_all_cols.schema.names,
right_list_miss_cols)
return left_df_all_cols.union(right_df_all_cols)
def union_d_fs(left_df, right_df):
""" Union between two dataFrames, if there is a gap of column fields,
it will append all missing columns as nulls """
# Check for None input
if left_df is None:
raise ValueError('left_df parameter should not be None')
if right_df is None:
raise ValueError('right_df parameter should not be None')
# For data frames with equal columns and order- regular union
if left_df.schema.names == right_df.schema.names:
return left_df.union(right_df)
else: # Different columns
# Save dataFrame columns name list as set
left_df_col_list = set(left_df.schema.names)
right_df_col_list = set(right_df.schema.names)
# Diff columns between left_df and right_df
right_list_miss_cols = list(left_df_col_list - right_df_col_list)
left_list_miss_cols = list(right_df_col_list - left_df_col_list)
return __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols)
A very simple way to do this - select the columns in the same order from both the dataframes and use unionAll
df1.select('code', 'date', 'A', 'B', 'C', lit(None).alias('D'), lit(None).alias('E'))\
.unionAll(df2.select('code', 'date', lit(None).alias('A'), 'B', 'C', 'D', 'E'))
Here's a pyspark solution.
It assumes that if a field in df1 is missing from df2, then you add that missing field to df2 with null values. However it also assumes that if the field exists in both dataframes, but the type or nullability of the field is different, then the two dataframes conflict and cannot be combined. In that case I raise a TypeError.
from pyspark.sql.functions import lit
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = left_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
# Make sure columns are in the same order
df_left = df_left.select(df_right.columns)
return df_left.union(df_right)
I somehow find most of the python-answers here a bit too clunky in their writing if you're just going with the simple lit(None)-workaround (which is also the only way I know). As alternative this might be useful:
# df1 and df2 are assumed to be the given dataFrames from the question
# Get the lacking columns for each dataframe and set them to null in the respective dataFrame.
# First do so for df1...
for column in [column for column in df1.columns if column not in df2.columns]:
df1 = df1.withColumn(column, lit(None))
# ... and then for df2
for column in [column for column in df2.columns if column not in df1.columns]:
df2 = df2.withColumn(column, lit(None))
Afterwards just do the union() you wanted to do.
Caution: If your column-order differs between df1 and df2 use unionByName()!
result = df1.unionByName(df2)
Modified Alberto Bonsanto's version to preserve the original column order (OP implied the order should match the original tables). Also, the match part caused an Intellij warning.
Here's my version:
def unionDifferentTables(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
val order = df1.columns ++ df2.columns
val sorted = total.toList.sortWith((a,b)=> order.indexOf(a) < order.indexOf(b))
def expr(myCols: Set[String], allCols: List[String]) = {
allCols.map( {
case x if myCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
df1.select(expr(cols1, sorted): _*).unionAll(df2.select(expr(cols2, sorted): _*))
}
in pyspark:
df = df1.join(df2, ['each', 'shared', 'col'], how='full')
I had the same issue and using join instead of union solved my problem.
So, for example with python , instead of this line of code:
result = left.union(right), which will fail to execute for different number of columns,
you should use this one:
result = left.join(right, left.columns if (len(left.columns) < len(right.columns)) else right.columns, "outer")
Note that the second argument contains the common columns between the two DataFrames. If you don't use it, the result will have duplicate columns with one of them being null and the other not.
Hope it helps.
There is much concise way to handle this issue with a moderate sacrifice of performance.
def unionWithDifferentSchema(a: DataFrame, b: DataFrame): DataFrame = {
sparkSession.read.json(a.toJSON.union(b.toJSON).rdd)
}
This is the function which does the trick. Using toJSON to each dataframe makes a json Union. This preserves the ordering and the datatype.
Only catch is toJSON is relatively expensive (however not much you probably get 10-15% slowdown). However this keeps the code clean.
My version for Java:
private static Dataset<Row> unionDatasets(Dataset<Row> one, Dataset<Row> another) {
StructType firstSchema = one.schema();
List<String> anotherFields = Arrays.asList(another.schema().fieldNames());
another = balanceDataset(another, firstSchema, anotherFields);
StructType secondSchema = another.schema();
List<String> oneFields = Arrays.asList(one.schema().fieldNames());
one = balanceDataset(one, secondSchema, oneFields);
return another.unionByName(one);
}
private static Dataset<Row> balanceDataset(Dataset<Row> dataset, StructType schema, List<String> fields) {
for (StructField e : schema.fields()) {
if (!fields.contains(e.name())) {
dataset = dataset
.withColumn(e.name(),
lit(null));
dataset = dataset.withColumn(e.name(),
dataset.col(e.name()).cast(Optional.ofNullable(e.dataType()).orElse(StringType)));
}
}
return dataset;
}
Here's the version in Scala also answered here, Also a Pyspark version..
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+
This function takes in two dataframes (df1 and df2) with different schemas and unions them.
First we need to bring them to the same schema by adding all (missing) columns from df1 to df2 and vice versa. To add a new empty column to a df we need to specify the datatype.
import pyspark.sql.functions as F
def union_different_schemas(df1, df2):
# Get a list of all column names in both dfs
columns_df1 = df1.columns
columns_df2 = df2.columns
# Get a list of datatypes of the columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
# We go through all columns in df1 and if they are not in df2, we add
# them (and specify the correct datatype too)
for col, typ in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2\
.withColumn(col, F.lit(None).cast(typ))
# Now df2 has all missing columns from df1, let's do the same for df1
for col, typ in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1\
.withColumn(col, F.lit(None).cast(typ))
# Now df1 and df2 have the same columns, not necessarily in the same
# order, therefore we use unionByName
combined_df = df1\
.unionByName(df2)
return combined_df
PYSPARK
Scala version from Alberto works great. However, if you want to make a for-loop or some dynamic assignment of variables you can face some problems.
Solution comes with Pyspark - clean code:
from pyspark.sql.functions import *
#defining dataframes
df1 = spark.createDataFrame(
[
(1, 'foo','ok'),
(2, 'pro','ok')
],
['id', 'txt','check']
)
df2 = spark.createDataFrame(
[
(3, 'yep',13,'mo'),
(4, 'bro',11,'re')
],
['id', 'txt','value','more']
)
#retrieving columns
cols1 = df1.columns
cols2 = df2.columns
#getting columns from df1 and df2
total = list(set(cols2) | set(cols1))
#defining function for adding nulls (None in case of pyspark)
def addnulls(yourDF):
for x in total:
if not x in yourDF.columns:
yourDF = yourDF.withColumn(x,lit(None))
return yourDF
df1 = addnulls(df1)
df2 = addnulls(df2)
#additional sorting for correct unionAll (it concatenates DFs by column number)
df1.select(sorted(df1.columns)).unionAll(df2.select(sorted(df2.columns))).show()
+-----+---+----+---+-----+
|check| id|more|txt|value|
+-----+---+----+---+-----+
| ok| 1|null|foo| null|
| ok| 2|null|pro| null|
| null| 3| mo|yep| 13|
| null| 4| re|bro| 11|
+-----+---+----+---+-----+
from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def unionAll(*dfs, fill_by=None):
clmns = {clm.name.lower(): (clm.dataType, clm.name) for df in dfs for clm in df.schema.fields}
dfs = list(dfs)
for i, df in enumerate(dfs):
df_clmns = [clm.lower() for clm in df.columns]
for clm, (dataType, name) in clmns.items():
if clm not in df_clmns:
# Add the missing column
dfs[i] = dfs[i].withColumn(name, F.lit(fill_by).cast(dataType))
return reduce(DataFrame.unionByName, dfs)
unionAll(df1, df2).show()
Case insenstive columns
Will returns the actual column case
Support the existing datatypes
Default value can be customizable
Pass multiple dataframes at once (e.g unionAll(df1, df2, df3, ..., df10))
here's another one:
def unite(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = (cols1 ++ cols2).toSeq.sorted
val expr1 = total.map(c => {
if (cols1.contains(c)) c else "NULL as " + c
})
val expr2 = total.map(c => {
if (cols2.contains(c)) c else "NULL as " + c
})
df1.selectExpr(expr1:_*).union(
df2.selectExpr(expr2:_*)
)
}
Union and outer union for Pyspark DataFrame concatenation. This works for multiple data frames with different columns.
def union_all(*dfs):
return reduce(ps.sql.DataFrame.unionAll, dfs)
def outer_union_all(*dfs):
all_cols = set([])
for df in dfs:
all_cols |= set(df.columns)
all_cols = list(all_cols)
print(all_cols)
def expr(cols, all_cols):
def append_cols(col):
if col in cols:
return col
else:
return sqlfunc.lit(None).alias(col)
cols_ = map(append_cols, all_cols)
return list(cols_)
union_df = union_all(*[df.select(expr(df.columns, all_cols)) for df in dfs])
return union_df
One more generic method to union list of DataFrame.
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
This is my pyspark version:
from functools import reduce
from pyspark.sql.functions import lit
def concat(dfs):
# when the dataframes to combine do not have the same order of columns
# https://datascience.stackexchange.com/a/27231/15325
return reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)
def union_all(dfs):
columns = reduce(lambda x, y : set(x).union(set(y)), [ i.columns for i in dfs ] )
for i in range(len(dfs)):
d = dfs[i]
for c in columns:
if c not in d.columns:
d = d.withColumn(c, lit(None))
dfs[i] = d
return concat(dfs)
Alternate you could use full join.
list_of_files = ['test1.parquet', 'test2.parquet']
def merged_frames():
if list_of_files:
frames = [spark.read.parquet(df.path) for df in list_of_files]
if frames:
df = frames[0]
if frames[1]:
var = 1
for element in range(len(frames)-1):
result_df = df.join(frames[var], 'primary_key', how='full')
var += 1
display(result_df)
If you are loading from files, I guess you could just use the read function with a list of files.
# file_paths is list of files with different schema
df = spark.read.option("mergeSchema", "true").json(file_paths)
The resulting dataframe will have merged columns.

SumProduct in Spark DataFrame

I want to create essentially a sumproduct across columns in a Spark DataFrame. I have a DataFrame that looks like this:
id val1 val2 val3 val4
123 10 5 7 5
I also have a Map that looks like:
val coefficents = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
I want to take the value in each column of the DataFrame, multiply it by the corresponding value from the map, and return the result in a new column so essentially:
(10*1) + (5*2) + (7*3) + (5*4) = 61
I tried this:
val myDF1 = myDF.withColumn("mySum", {var a:Double = 0.0; for ((k,v) <- coefficients) a + (col(k).cast(DoubleType)*coefficients(k));a})
but got an error that the "+" method was overloaded. Even if I solved that, I'm not sure this would work. Any ideas? I could always dynamically build a SQL query as text string and do it that way but I was hoping for something a little more eloquent.
Any ideas are appreciated.
Problem with your code is that you try to add a Column to Double. cast(DoubleType) affects only a type of stored value, not a type of column itself. Since Double doesn't provide *(x: org.apache.spark.sql.Column): org.apache.spark.sql.Column method everything fails.
To make it work you can for example do something like this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, lit}
val df = sc.parallelize(Seq(
(123, 10, 5, 7, 5), (456, 1, 1, 1, 1)
)).toDF("k", "val1", "val2", "val3", "val4")
val coefficients = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
val dotProduct: Column = coefficients
// To be explicit you can replace
// col(k) * v with col(k) * lit(v)
// but it is not required here
// since we use * f Column.* method not Int.*
.map{ case (k, v) => col(k) * v } // * -> Column.*
.reduce(_ + _) // + -> Column.+
df.withColumn("mySum", dotProduct).show
// +---+----+----+----+----+-----+
// | k|val1|val2|val3|val4|mySum|
// +---+----+----+----+----+-----+
// |123| 10| 5| 7| 5| 61|
// |456| 1| 1| 1| 1| 10|
// +---+----+----+----+----+-----+
It looks like the issue is that you aren't actually doing anything with a
for((k, v) <- coefficients) a + ...
You probably meant a += ...
Also, some advice for cleaning up the block of code inside the withColumn call:
You don't need to call coefficients(k) because you've already got its value in v from for((k,v) <- coefficients)
Scala is pretty good at making one-liners, but it's kinda cheating if you have to put semicolons in that one line :P I'd suggest breaking up the sum calculation section into one line per expression.
The sum expression could be rewritten as a fold which avoids using a var (idiomatic Scala usually avoids vars), e.g.
import org.apache.spark.sql.functions.lit
coefficients.foldLeft(lit(0.0)){
case (sumSoFar, (k,v)) => col(k).cast(DoubleType) * v + sumSoFar
}
I'm not sure if this is possible through the DataFrame API since you are only able to work with columns and not any predefined closures (e.g. your parameter map).
I've outlined a way below using the underlying RDD of the DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Initializing your input example.
val df1 = sc.parallelize(Seq((123, 10, 5, 7, 5))).toDF("id", "val1", "val2", "val3", "val4")
// Return column names as an array
val names = df1.columns
// Grab underlying RDD and zip elements with column names
val rdd1 = df1.rdd.map(row => (0 until row.length).map(row.getInt(_)).zip(names))
// Tack on accumulated total to the existing row
val rdd2 = rdd0.map { seq => Row.fromSeq(seq.map(_._1) :+ seq.map { case (value: Int, name: String) => value * coefficents.getOrElse(name, 0) }.sum) }
// Create output schema (with total)
val totalSchema = StructType(df1.schema.fields :+ StructField("total", IntegerType))
// Apply schema to create output dataframe
val df2 = sqlContext.createDataFrame(rdd1, totalSchema)
// Show output:
df2.show()
...
+---+----+----+----+----+-----+
| id|val1|val2|val3|val4|total|
+---+----+----+----+----+-----+
|123| 10| 5| 7| 5| 61|
+---+----+----+----+----+-----+