Spark DeltaLake Upsert (merge) is throwing "org.apache.spark.sql.AnalysisException" - scala

In the below I'm code trying to merge a dataframe to a delta table.
Here I'm joining the new dataframe with the delta table and then transforming the joined data to match the delta table schema, and then merging that into the delta table.
But I'm getting AnalysisException.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#514 missing from _file_name_#872,age#516,id#879,name#636,age#881,name#880,city#882,id#631,_row_id_#866L,city#641 in operator !Join Inner, (id#514 = id#631). Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;;
!Join Inner, (id#514 = id#631)
:- SubqueryAlias deltaData
: +- Project [id#631, name#636, age#516, city#641]
: +- Project [age#516, id#631, name#636, new_city#510 AS city#641]
: +- Project [age#516, id#631, new_name#509 AS name#636, new_city#510]
: +- Project [age#516, new_id#508 AS id#631, new_name#509, new_city#510]
: +- Project [age#516, new_id#508, new_name#509, new_city#510]
: +- Join Inner, (id#514 = new_id#508)
: :- Relation[id#514,name#515,age#516,city#517] parquet
: +- LocalRelation [new_id#508, new_name#509, new_city#510]
+- Project [id#879, name#880, age#881, city#882, _row_id_#866L, input_file_name() AS _file_name_#872]
+- Project [id#879, name#880, age#881, city#882, monotonically_increasing_id() AS _row_id_#866L]
+- Project [id#854 AS id#879, name#855 AS name#880, age#856 AS age#881, city#857 AS city#882]
+- Relation[id#854,name#855,age#856,city#857] parquet
My setup is Spark 3.0.0, Delta Lake 0.7.0, Hadoop 2.7.4
But the below code is running fine in Databricks 7.4 runtime, and the new dataframe is getting merged with the delta table
Code Snippet:
import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{SaveMode, SparkSession}
object CodePen extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val deltaPath = "<delta-path>"
val oldEmployee = Seq(
Employee(10, "Django", 22, "Bangalore"),
Employee(11, "Stephen", 30, "Bangalore"),
Employee(12, "Calvin", 25, "Hyderabad"))
val newEmployee = Seq(EmployeeNew(10, "Django", "Bangkok"))
spark.createDataFrame(oldEmployee).write.format("delta").mode(SaveMode.Overwrite).save(deltaPath) // Saving the data to a delta table
val newDf = spark.createDataFrame(newEmployee)
val deltaTable = DeltaTable.forPath(deltaPath)
val joinedDf = deltaTable.toDF.join(newDf, col("id") === col("new_id"), "inner")
joinedDf.show()
val cols = newDf.columns
// Transforming the joined Dataframe to match the schema of the delta table
var intDf = joinedDf.drop(cols.map(removePrefix): _*)
for (column <- newDf.columns)
intDf = intDf.withColumnRenamed(column, removePrefix(column))
intDf = intDf.select(deltaTable.toDF.columns.map(col): _*)
deltaTable.toDF.show()
intDf.show()
deltaTable.as("oldData")
.merge(
intDf.as("deltaData"),
col("oldData.id") === col("deltaData.id"))
.whenMatched()
.updateAll()
.execute()
deltaTable.toDF.show()
def removePrefix(column: String) = {
column.replace("new_", "")
}
}
case class Employee(id: Int, name: String, age: Int, city: String)
case class EmployeeNew(new_id: Int, new_name: String, new_city: String)
Below is the output of the dataframes.
New Dataframe:
+---+------+-------+
| id| name| city|
+---+------+-------+
| 10|Django|Bangkok|
+---+------+-------+
Joined Datafame:
+---+------+---+---------+------+--------+--------+
| id| name|age| city|new_id|new_name|new_city|
+---+------+---+---------+------+--------+--------+
| 10|Django| 22|Bangalore| 10| Django| Bangkok|
+---+------+---+---------+------+--------+--------+
Delta Table Data:
+---+-------+---+---------+
| id| name|age| city|
+---+-------+---+---------+
| 11|Stephen| 30|Bangalore|
| 12| Calvin| 25|Hyderabad|
| 10| Django| 22|Bangalore|
+---+-------+---+---------+
Transformed New Dataframe:
+---+------+---+-------+
| id| name|age| city|
+---+------+---+-------+
| 10|Django| 22|Bangkok|
+---+------+---+-------+

You are getting this AnalysisException because the schemas of deltaTable and intDf are slightly different:
deltaTable.toDF.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- city: string (nullable = true)
intDf.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- city: string (nullable = true)
Due to the fact, that intDf table resulted out of a join where the column "id" is used as a key, it will force your join condition column to be non-nullable.
If you change the nullale property as explained here you will get the desired output:
+---+-------+---+---------+
| id| name|age| city|
+---+-------+---+---------+
| 11|Stephen| 30|Bangalore|
| 12| Calvin| 25|Hyderabad|
| 10| Django| 22| Bangkok|
+---+-------+---+---------+
Tested with Spark 3.0.1 and Delta 0.7.0.

Related

How to incorporate spark scala map field to BQ?

I am writing a spark scala code to write the output to BQ, The following are the codes used for forming the output table which has two columns (id and keywords)
val df1 = Seq("tamil", "telugu", "hindi").toDF("language")
val df2 = Seq(
(101, Seq("tamildiary", "tamilkeyboard", "telugumovie")),
(102, Seq("tamilmovie")),
(103, Seq("hindirhymes", "hindimovie"))
).toDF("id", "keywords")
val pattern = concat(lit("^"), df1("language"), lit(".*"))
import org.apache.spark.sql.Row
val arrayToMap = udf{ (arr: Seq[Row]) =>
arr.map{ case Row(k: String, v: Int) => (k, v) }.toMap
}
val final_df = df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(arrayToMap(collect_list(struct($"language", $"count"))).as("keywords"))
The output of final_df is:
+---+--------------------+
| id| app_language|
+---+--------------------+
|101|Map(tamil -> 2, t...|
|103| Map(hindi -> 2)|
|102| Map(tamil -> 1)|
+---+--------------------+
I am defining the below function to pass the schema for this output table. (Since BQ doesn't support map field, I am using array of struct. But this is also not working)
def createTableIfNotExists(outputTable: String) = {
spark.createBigQueryTable(
s"""
|CREATE TABLE IF NOT EXISTS $outputTable(
|ds date,
|id int64,
|keywords ARRAY<STRUCT<key STRING, value INT64>>
|)
|PARTITION BY ds
|CLUSTER BY user_id
""".stripMargin)
}
Could anyone please help me in writing a correct schema for this so that it's compatible in BQ.
You can collect an array of struct as below:
val final_df = df2
.withColumn("keyword", explode($"keywords")).as("df2")
.join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword")
.groupBy("id", "language")
.agg(size(collect_list($"language")).as("count"))
.groupBy("id")
.agg(collect_list(struct($"language", $"count")).as("app_language"))
final_df.show(false)
+---+-------------------------+
|id |app_language |
+---+-------------------------+
|101|[[tamil, 2], [telugu, 1]]|
|103|[[hindi, 2]] |
|102|[[tamil, 1]] |
+---+-------------------------+
final_df.printSchema
root
|-- id: integer (nullable = false)
|-- app_language: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- language: string (nullable = true)
| | |-- count: integer (nullable = false)
And then you can have a schema like
def createTableIfNotExists(outputTable: String) = {
spark.createBigQueryTable(
s"""
|CREATE TABLE IF NOT EXISTS $outputTable(
|ds date,
|id int64,
|keywords ARRAY<STRUCT<language STRING, count INT64>>
|)
|PARTITION BY ds
|CLUSTER BY user_id
""".stripMargin)
}

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:
I have a DataFrame with the following schema :
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
and the following content:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
I'd like to replace the null-array (id=2) with an empty array, i.e.
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
I've tried:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
which gives :
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$
cannot be cast to org.apache.spark.sql.types.StructType
Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime
I'm using Spark 2.1 by the way, therefore I cannot use typedLit
Spark 2.2+ with known external type
In general you can use typedLit to provide empty arrays.
import org.apache.spark.sql.functions.typedLit
typedLit(Seq.empty[(Double, Double)])
To use specific names for nested objects you can use case classes:
case class Item(x: Double, y: Double)
typedLit(Seq.empty[Item])
or rename by cast:
typedLit(Seq.empty[(Double, Double)])
.cast("array<struct<x: Double, y: Double>>")
Spark 2.1+ with schema only
With schema only you can try:
val schema = StructType(Seq(
StructField("arr", StructType(Seq(
StructField("x", DoubleType),
StructField("y", DoubleType)
)))
))
def arrayOfSchema(schema: StructType) =
from_json(lit("""{"arr": []}"""), schema)("arr")
arrayOfSchema(schema).alias("arr")
where schema can be extracted from the existing DataFrame and wrapped with additional StructType:
StructType(Seq(
StructField("arr", df.schema("arr").dataType)
))
One way is the use a UDF :
val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)
val emptyArr = udf(() => Seq.empty[Any],arrSchema)
df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
Another approach would be to use coalesce:
val df = Seq(
(Some(1), Some(Array((1.0, 2.0)))),
(Some(2), None)
).toDF("id", "arr")
df.withColumn("arr", coalesce($"arr", typedLit(Array.empty[(Double, Double)]))).
show
// +---+-----------+
// | id| arr|
// +---+-----------+
// | 1|[[1.0,2.0]]|
// | 2| []|
// +---+-----------+
UDF with case class could also be interesting:
case class Item(x: Double, y: Double)
val udf_emptyArr = udf(() => Seq[Item]())
df
.withColumn("arr",coalesce($"arr",udf_emptyArr()))
.show()

Strange behavior when casting array of structs in spark scala

I'm encountering a strange behavior using spark 2.1.1 and scala 2.11.8:
import spark.implicits._
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
df.show(false)
df.printSchema()
+---+-------+
|id |data |
+---+-------+
|1 |[[a,b]]|
|2 |[[c,d]]|
+---+-------+
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Now I want to rename my struct fields as suggested in https://stackoverflow.com/a/39781382/1138523
df
.select($"id",$"data".cast("array<struct<k:string,v:string>>"))
.show()
Which results in the correct schema, but the content of the dataframe is now:
+---+-------+
| id| data|
+---+-------+
| 1|[[c,d]]|
| 2|[[c,d]]|
+---+-------+
Both lines show now the same array. What am I doing wrong?
EDIT: In spark 2.1.2 (and also spark 2.3.0) I get the expected output. I also get the expected output if I cache the dataframe:
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
.cache

Cannot resolve column (numeric column name) in Spark Dataframe

This is my data:
scala> data.printSchema
root
|-- 1.0: string (nullable = true)
|-- 2.0: string (nullable = true)
|-- 3.0: string (nullable = true)
This doesn't work :(
scala> data.select("2.0").show
Exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];;
'Project ['2.0]
+- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617]
+- LocalRelation [_1#5608, _2#5609, _3#5610]
...
Try this at home (I'm running on the shell v_2.1.0.5)!
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select("2.0").show
You can use backticks to escape the dot, which is reserved for accessing columns for struct type:
data.select("`2.0`").show
+---+
|2.0|
+---+
| , |
+---+
The problem is you can not add dot character in the column name while selecting from dataframe. You can have a look at this question, kind of similar.
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select(sanitize("2.0")).show
def sanitize(input: String): String = s"`$input`"