Create a new column in a dataset using a case class structure - scala

Assuming the following schema for a table - Places:
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
val places is of type Dataset[Row]
and I have the following case class:
case class csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
How would I go about altering or creating a new data set that has the following schema
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- subpremise: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
|-- csm: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
I've been looking into withColumn methods and they seem to require UDFs, the challenge here being that I have to manually specify the columns which will be easy for this use case, but as my problem scales it will be difficult to manually maintain them
Used this as a reference: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe

In your case class declaration you have stateProvince parameter, but in your dataframe there's state_province column instead.
I'm not sure if it's not a typo, so first - some quick-n-dirty not-thoroughly-tested camelCase to snake_case converter just in case:
def normalize(x: String): String =
"([a-z])([A-Z])".r replaceAllIn(x, m => s"${m.group(1)}_${m.group(2).toLowerCase}")
Next, let's get the parameters of a case class:
val case_class_params = Seq[csm]().toDF.columns
And with this, we can now get columns for our case class struct:
val csm_cols = case_class_params.map(x => col(normalize(x)))
val df2 = df.withColumn("csm", struct(csm_cols:_*))
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|place_id|street_address|city |state_province|postal_code|country |neghborhood|csm |
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|123 |str_addr |some_city|some_province |some_zip |some_country|NA |{some_city, some_province, some_country}|
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neghborhood: string (nullable = true)
|-- csm: struct (nullable = false)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
Nested case classes
Seq[csm]().toDF.columns won't give you nested columns. For that some basic schema traversal is required. E.g., one way to do it, adapted from here:
def flatten(schema: StructType): Seq[String] =
schema.fields.flatMap {
field =>
field.dataType match {
case structType: StructType =>
flatten(structType)
case _ =>
field.name :: Nil
}
}
case class StateProvince(
stateProvince: Option[String] = None,
country: Option[String] = None)
case class csm(
city: Option[String] = None,
state: StateProvince
)
val case_class_params = flatten(Seq[csm]().toDF.schema)
// case_class_params: Seq[String] = ArraySeq(city, stateProvince, country)

case class Source(
place_id: Option[String],
street_address: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String]
)
case class Csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
case class Result(
place_id: Option[String],
street_address: Option[String],
subpremise: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String],
csm: Csm
)
import spark.implicits._
val sourceDF = Seq(
Source(
Some("s-1-1"),
Some("s-1-2"),
Some("s-1-3"),
Some("s-1-4"),
Some("s-1-5"),
Some("s-1-6"),
Some("s-1-7")
),
Source(
Some("s-2-1"),
Some("s-2-2"),
Some("s-2-3"),
Some("s-2-4"),
Some("s-2-5"),
Some("s-2-6"),
Some("s-2-7")
)
).toDF()
val resultDF = sourceDF
.map(r => {
Result(
Some(r.getAs[String]("place_id")),
Some(r.getAs[String]("street_address")),
Some("set your value"),
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("postal_code")),
Some(r.getAs[String]("country")),
Some(r.getAs[String]("neighborhood")),
Csm(
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("country"))
)
)
})
.toDF()
resultDF.printSchema()
// root
// |-- place_id: string (nullable = true)
// |-- street_address: string (nullable = true)
// |-- subpremise: string (nullable = true)
// |-- city: string (nullable = true)
// |-- state_province: string (nullable = true)
// |-- postal_code: string (nullable = true)
// |-- country: string (nullable = true)
// |-- neighborhood: string (nullable = true)
// |-- csm: struct (nullable = true)
// | |-- city: string (nullable = true)
// | |-- stateProvince: string (nullable = true)
// | |-- country: string (nullable = true)
resultDF.show(false)
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |place_id|street_address|subpremise |city |state_province|postal_code|country|neighborhood|csm |
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |s-1-1 |s-1-2 |set your value|s-1-3|s-1-4 |s-1-5 |s-1-6 |s-1-7 |[s-1-3, s-1-4, s-1-6]|
// |s-2-1 |s-2-2 |set your value|s-2-3|s-2-4 |s-2-5 |s-2-6 |s-2-7 |[s-2-3, s-2-4, s-2-6]|
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+

Related

Exception in thread “main” org.apache.spark.sql.AnalysisException: No such struct field startId in _1, _2;

0
I have 2 case class as below:
case class IcijAddressRaw(
node_id: Option[Long],
name: Option[String],
address: Option[String],
country_codes: Option[String],
countries: Option[String],
sourceID: Option[String],
valid_until: Option[String],
note: Option[String]
)
case class IcijEdgesRaw(
START_ID: Option[Long],
TYPES: Option[String],
END_ID: Option[Long],
link: Option[String],
start_date: Option[java.sql.Date],
end_date: Option[java.sql.Date],
sourceID: Option[String],
valid_until: Option[String]
)
I join both data sets with the above case class as below:
val addressWithEdgesDS = addressRawDS
.joinWith(edgesRawDS, edgesRawDS("END_ID") === addressRawDS("node_id"), "inner")
val addressGroupDS = addressWithEdgesDS.groupByKey { fullAddress => fullAddress}.mapGroups {
case (startId, fullAddress) =>
(startId, fullAddress.toSeq)
}
def thirdPartyWithAddressDS = thirdPartyDS
.joinWith(addressGroupDS, thirdPartyDS("thirdPartyId") === addressGroupDS("_2.startId"), "left_outer")
.map{
case (thirdParty,null) => thirdParty
case (thirdParty, (thirdPartyId, addressSeq)) => thirdParty.copy(addresses = addressSeq.map(addressCaseClassMap[ThirdPartyCC]))
}
I can gradle build the jar without problem. However, when I run the below script:
./runQSS.sh -s com.quantexa.academytask.etl.projects.icij.CreateCaseClass -e local -r etl.icij -c /home/training-admin/academy-task/application.conf
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No such struct field startId in _1, _2;
I have tried the column name "START_ID" as well as"_2.START_ID" but same error. My understanding is that this joined datasets struct need to use _1 or _2 to refer to the left or right dataset column. But in this case, it seems failed to point to the correct column.
my joined schema as below: scala> addressWithEdgesDS.printSchema() root
|-- _1: struct (nullable = false)
| |-- node_id: integer (nullable = true)
| |-- name: string (nullable = true)
| |-- address: string (nullable = true)
| |-- country_codes: string (nullable = true)
| |-- countries: string (nullable = true)
| |-- sourceID: string (nullable = true)
| |-- valid_until: string (nullable = true)
| |-- note: string (nullable = true)
|-- _2: struct (nullable = false)
| |-- START_ID: integer (nullable = true)
| |-- TYPES: string (nullable = true)
| |-- END_ID: integer (nullable = true)
| |-- link: string (nullable = true)
| |-- start_date: string (nullable = true)
| |-- end_date: string (nullable = true)
| |-- sourceID: string (nullable = true)
| |-- valid_until: string (nullable = true)

Reordering fields in nested dataframe

How do I reorder fields in a nested dataframe in scala?
for e.g below is the expected and desired schemas
currently->
root
|-- domain: struct (nullable = false)
| |-- assigned: string (nullable = true)
| |-- core: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- action: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- dqid: string (nullable = true)
expected->
root
|-- domain: struct (nullable = false)
| |-- core: string (nullable = true)
| |-- assigned: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- dqid: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- action: string (nullable = true)
```
You need to define schema before you read the dataframe.
val schema = val schema = StructType(Array(StructField("root",StructType(Array(StructField("domain",StructType(Array(StructField("core",StringType,true), StructField("assigned",StringType,true), StructField("createdBy",StringType,true))),true), StructField("Event",StructType(Array(StructField("dqid",StringType,true), StructField("eventid",StringType,true), StructField("action",StringType,true))),true))),true)))
Now, you can apply this schema while reading your file.
val df = spark.read.schema(schema).json("path/to/json")
Should work with any nested data.
Hope this helps!
Most efficient approach might be to just select the nested elements and wrap in a couple of structs, as shown below:
case class Domain(assigned: String, core: String, createdBy: Long)
case class Event(action: String, eventid: String, dqid: String)
val df = Seq(
(Domain("a", "b", 1L), Event("c", "d", "e")),
(Domain("f", "g", 2L), Event("h", "i", "j"))
).toDF("domain", "event")
val df2 = df.select(
struct($"domain.core", $"domain.assigned", $"domain.createdBy").as("domain"),
struct($"event.dqid", $"event.action", $"event.eventid").as("event")
)
df2.printSchema
// root
// |-- domain: struct (nullable = false)
// | |-- core: string (nullable = true)
// | |-- assigned: string (nullable = true)
// | |-- createdBy: long (nullable = true)
// |-- event: struct (nullable = false)
// | |-- dqid: string (nullable = true)
// | |-- action: string (nullable = true)
// | |-- eventid: string (nullable = true)
An alternative would be to apply row-wise map:
import org.apache.spark.sql.Row
val df2 = df.map{ case Row(Row(as: String, co: String, cr: Long), Row(ac: String, ev: String, dq: String)) =>
((co, as, cr), (dq, ac, ev))
}.toDF("domain", "event")

UnFlatten Dataframe to a specific structure

I have a flat dataframe (df) with the structure as below:
root
|-- first_name: string (nullable = true)
|-- middle_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- title: string (nullable = true)
|-- start_date: string (nullable = true)
|-- end_Date: string (nullable = true)
|-- city: string (nullable = true)
|-- zip_code: string (nullable = true)
|-- state: string (nullable = true)
|-- country: string (nullable = true)
|-- email_name: string (nullable = true)
|-- company: struct (nullable = true)
|-- org_name: string (nullable = true)
|-- company_phone: string (nullable = true)
|-- partition_column: string (nullable = true)
And I need to convert this dataframe into a structure like (as my next data will be in this format):
root
|-- firstName: string (nullable = true)
|-- middleName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- currentPosition: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- title: string (nullable = true)
| | |-- startDate: string (nullable = true)
| | |-- endDate: string (nullable = true)
| | |-- address: struct (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- zipCode: string (nullable = true)
| | | |-- state: string (nullable = true)
| | | |-- country: string (nullable = true)
| | |-- emailName: string (nullable = true)
| | |-- company: struct (nullable = true)
| | | |-- orgName: string (nullable = true)
| | | |-- companyPhone: string (nullable = true)
|-- partitionColumn: string (nullable = true)
So far I have implemented this:
case class IndividualCompany(orgName: String,
companyPhone: String)
case class IndividualAddress(city: String,
zipCode: String,
state: String,
country: String)
case class IndividualPosition(title: String,
startDate: String,
endDate: String,
address: IndividualAddress,
emailName: String,
company: IndividualCompany)
case class Individual(firstName: String,
middleName: String,
lastName: String,
currentPosition: Seq[IndividualPosition],
partitionColumn: String)
val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))
val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany)
=> List(IndividualPosition(title, startDate, endDate, address, emailName, company)))
val selectData = df.select(
col("first_name").as("firstName"),
col("middle_name).as("middleName"),
col("last_name").as("lastName"),
makePosition(col("job_title"),
col("start_date"),
col("end_Date"),
makeAddress(col("city"),
col("zip_code"),
col("state"),
col("country")),
col("email_name"),
makeCompany(col("org_name"),
col("company_phone"))).as("currentPosition"),
col("partition_column").as("partitionColumn")
).as[Individual]
select_data.printSchema()
select_data.show(10)
I can see a proper schema generated for select_data, but it gives an error on the last line where I am trying to get some actual data. I am getting an error saying failed to execute user defined function.
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)
Is there any better way to achieve this?
The problem here is that an udf can't take IndividualAddress and IndividualCompany directly as input. These are represented as structs in Spark and to use them in an udf the correct input type is Row. That means you need to change the declaration of makePosition to:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
address: Row,
emailName: String,
company: Row)
Inside the udf you now need to use e.g. address.getAs[String]("city") to access the case class elements, and to use the class as a whole you need to create it again.
The easier and better alternative would be to do everything in a single udf as follows:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
city: String,
zipCode: String,
state: String,
country: String,
emailName: String,
orgName: String,
companyPhone: String) =>
Seq(
IndividualPosition(
title,
startDate,
endDate,
IndividualAddress(city, zipCode, state, country),
emailName,
IndividualCompany(orgName, companyPhone)
)
)
)
I had a similar requirement.
What I did was create a typed user defined aggregation which will produce a List of elements.
import org.apache.spark.sql.{Encoder, TypedColumn}
import org.apache.spark.sql.expressions.Aggregator
import scala.collection.mutable
object ListAggregator {
private type Buffer[T] = mutable.ListBuffer[T]
/** Returns a column that aggregates all elements of type T in a List. */
def create[T](columnName: String)
(implicit listEncoder: Encoder[List[T]], listBufferEncoder: Encoder[Buffer[T]]): TypedColumn[T, List[T]] =
new Aggregator[T, Buffer[T], List[T]] {
override def zero: Buffer[T] =
mutable.ListBuffer.empty[T]
override def reduce(buffer: Buffer[T], elem: T): Buffer[T] =
buffer += elem
override def merge(b1: Buffer[T], b2: Buffer[T]): Buffer[T] =
if (b1.length >= b2.length) b1 ++= b2 else b2 ++= b1
override def finish(reduction: Buffer[T]): List[T] =
reduction.toList
override def bufferEncoder: Encoder[Buffer[T]] =
listBufferEncoder
override def outputEncoder: Encoder[List[T]] =
listEncoder
}.toColumn.name(columnName)
}
Now you can use it like this.
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
final case class Flat(id: Int, name: String, age: Int)
final case class Grouped(age: Int, users: List[(Int, String)])
val data =
List(
(1, "Luis", 21),
(2, "Miguel", 21),
(3, "Sebastian", 16)
).toDF("id", "name", "age").as[Flat]
val grouped =
data
.groupByKey(flat => flat.age)
.mapValues(flat => (flat.id, flat.name))
.agg(ListAggregator.create(columnName = "users"))
.map(tuple => Grouped(age = tuple._1, users = tuple._2))
// grouped: org.apache.spark.sql.Dataset[Grouped] = [age: int, users: array<struct<_1:int,_2:string>>]
grouped.show(truncate = false)
// +---+------------------------+
// |age|users |
// +---+------------------------+
// |16 |[[3, Sebastian]] |
// |21 |[[1, Luis], [2, Miguel]]|
// +---+------------------------+

Array of String to Array of Struct in Scala + Spark

I am currently using Spark and Scala 2.11.8
I have the following schema:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- descriptions: array (nullable = true)
|-- |-- element: string (containsNull = true)
I am trying to use UDF to convert it to the following:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- description: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
| | |-- code: string (nullable = true)
| | |-- cost: int(nullable = true)
So source data looks like this:
[WrappedArray(a abc 100,b abc 300)]
[WrappedArray(c abc 400)]
I need to use " " (space) as a delimiter, but don't know how to do this in scala.
def convert(product: Seq[String]): Seq[Row] = {
??/
}
I am fairly new in Scala, so can someone guide me how to construct this type of function?
Thanks.
I do not know if I understand your problem right, but map could be your friend.
case class Row(a: String, b: String, c: Int)
val value = List(List("a", "abc", 123), List("b", "bcd", 321))
value map {
case List(a: String, b: String, c: Int) => Row(a,b,c);
}
if you have to parse it first:
val value2 = List("a b 123", "c d 345")
value2 map {
case s => {
val split = s.toString.split(" ")
Row(split(0), split(1), split(2).toInt)
}
}

scala.MatchError during Spark 2.0.2 DataFrame union

I'm attempting to merge 2 DataFrames, one with old data and one with new data, using the union function. This used to work until I tried to dynamically add a new field to the old DataFrame because my schema is evolving.
This means that my old data will be missing a field and the new data will have it. In order for the union to work, I'm adding the field using the evolveSchema function below.
This resulted in the output/exception I pasted below the code, including my debug prints.
The column ordering and making fields nullable are attempts to fix this issue by making the DataFrames as identical as possible, but it persists. The schema prints show that they are both seemingly identical after these manipulations.
Any help to further debug this would be appreciated.
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.{DataFrame, SQLContext}
object Merger {
def apply(sqlContext: SQLContext, oldDataSet: Option[DataFrame], newEnrichments: Option[DataFrame]): Option[DataFrame] = {
(oldDataSet, newEnrichments) match {
case (None, None) => None
case (None, _) => newEnrichments
case (Some(existing), None) => Some(existing)
case (Some(existing), Some(news)) => Some {
val evolvedOldDataSet = evolveSchema(existing)
println("EVOLVED OLD SCHEMA FIELD NAMES:" + evolvedOldDataSet.schema.fieldNames.mkString(","))
println("NEW SCHEMA FIELD NAMES:" + news.schema.fieldNames.mkString(","))
println("EVOLVED OLD SCHEMA FIELD TYPES:" + evolvedOldDataSet.schema.fields.map(_.dataType).mkString(","))
println("NEW SCHEMA FIELD TYPES:" + news.schema.fields.map(_.dataType).mkString(","))
println("OLD SCHEMA")
existing.printSchema();
println("PRINT EVOLVED OLD SCHEMA")
evolvedOldDataSet.printSchema()
println("PRINT NEW SCHEMA")
news.printSchema()
val nullableEvolvedOldDataSet = setNullableTrue(evolvedOldDataSet)
val nullableNews = setNullableTrue(news)
println("NULLABLE EVOLVED OLD")
nullableEvolvedOldDataSet.printSchema()
println("NULLABLE NEW")
nullableNews.printSchema()
val unionData =nullableEvolvedOldDataSet.union(nullableNews)
val result = unionData.sort(
unionData("timestamp").desc
).dropDuplicates(
Seq("id")
)
result.cache()
}
}
}
def GENRE_FIELD : String = "station_genre"
// Handle missing fields in old data
def evolveSchema(oldDataSet: DataFrame): DataFrame = {
if (!oldDataSet.schema.fieldNames.contains(GENRE_FIELD)) {
val columnAdded = oldDataSet.withColumn(GENRE_FIELD, lit("N/A"))
// Columns should be in the same order for union
val columnNamesInOrder = Seq("id", "station_id", "station_name", "station_timezone", "station_genre", "publisher_id", "publisher_name", "group_id", "group_name", "timestamp")
val reorderedColumns = columnAdded.select(columnNamesInOrder.head, columnNamesInOrder.tail: _*)
reorderedColumns
}
else
oldDataSet
}
def setNullableTrue(df: DataFrame) : DataFrame = {
// get schema
val schema = df.schema
// create new schema with all fields nullable
val newSchema = StructType(schema.map {
case StructField(columnName, dataType, _, metaData) => StructField( columnName, dataType, nullable = true, metaData)
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
}
EVOLVED OLD SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
NEW SCHEMA FIELD NAMES:
id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp
EVOLVED OLD SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
NEW SCHEMA FIELD TYPES:
StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType
OLD SCHEMA
root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT EVOLVED OLD SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = false) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
PRINT NEW SCHEMA root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE EVOLVED OLD root |-- id: string (nullable = true) |--
station_id: long (nullable = true) |-- station_name: string (nullable
= true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long
(nullable = true) |-- publisher_name: string (nullable = true) |--
group_id: long (nullable = true) |-- group_name: string (nullable =
true) |-- timestamp: long (nullable = true)
NULLABLE NEW root |-- id: string (nullable = true) |-- station_id:
long (nullable = true) |-- station_name: string (nullable = true)
|-- station_timezone: string (nullable = true) |-- station_genre:
string (nullable = true) |-- publisher_id: long (nullable = true)
|-- publisher_name: string (nullable = true) |-- group_id: long
(nullable = true) |-- group_name: string (nullable = true) |--
timestamp: long (nullable = true)
2017-01-18 15:59:32 ERROR org.apache.spark.internal.Logging$class
Executor:91 - Exception in task 1.0 in stage 2.0 (TID 4)
scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
at
...
com.companystuff.meta.uploader.Merger$.apply(Merger.scala:49)
...
Caused by: scala.MatchError: false (of class java.lang.Boolean) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296)
...
It's because of ordering in the actual data even though its schema is the same.
So simply select all required columns then do a union query.
Something like this:
val columns:Seq[String]= ....
val df = oldDf.select(columns:_*).union(newDf.select(columns:_*)
Hope it helps you