How to extract values from JSON-encoded column? [duplicate] - scala

This question already has answers here:
Querying Spark SQL DataFrame with complex types
(3 answers)
Closed 4 years ago.
I have a Spark Dataframe with the following schema.
[{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-plan",
"status": {
"state": "SUCCESS"
}
},
{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-proc",
"status": {
"state": "FAILED"
}
}]
I want to transform it into another DF having the following schema
[{"rec": "test-plan", "status": "SUCCESS"}, {"rec": "test-pROC", "status": "FAILED"}]
I have written the following code, but it doesn't compile and complains of wrong encoding.
val fdf = DF.map(f => {
val listCommands = f.get(0).asInstanceOf[WrappedArray[Map[String, Any]]]
val m = listCommands.map(h => {
var rec = "none"
var status = "none"
if(h.exists("status" == "state" -> _)) {
status = (h.get("status") match {
case Some(x) => x.asInstanceOf[HashMap[String, String]].getOrElse("state", "none")
case _ => "none"
})
if(h.contains("rec")) {
rec = (h.get("rec") match {
case Some(x: String) => x
case _ => "none"
})
}
}
Map("status"->status, "rec"->rec)
})
val rm = m.flatten
rm
})
Please suggest the right way.

That's going to be tricky since the top-level elements of the JSONs are not the same, i.e. you have map1 and map2, and hence the schema is inconsistent. I'd speak to the "data producer" and requests a change so the name of the command is described by a separate element.
Given the schema of the DataFrame is as follows:
scala> commands.printSchema
root
|-- commands: array (nullable = true)
| |-- element: string (containsNull = true)
and the number of elements (rows) in it:
scala> commands.count
res1: Long = 1
You have to explode the commands array of elements first followed by accessing the fields of interest.
// 1. Explode the array
val commandsExploded = commands.select(explode($"commands") as "command")
scala> commandsExploded.count
res2: Long = 2
Let's create the schema of the JSON-encoded records. One could be as follows.
// Note that it accepts map1 and map2 fields
import org.apache.spark.sql.types._
val schema = StructType(
StructField("map1",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("map2",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("rec", StringType,true) ::
StructField("status", StructType(
StructField("state", StringType, true) :: Nil), true
) :: Nil)
With that, you should use from_json standard function that takes a column with JSON-encoded strings and a schema.
val commands = commandsExploded.select(from_json($"command", schema) as "command")
scala> commands.show(truncate = false)
+-------------------------------+
|command |
+-------------------------------+
|[[1, 1],, test-plan, [SUCCESS]]|
|[, [1, 1], test-proc, [FAILED]]|
+-------------------------------+
Let's have a look at the schema of the commands dataset.
scala> commands.printSchema
root
|-- command: struct (nullable = true)
| |-- map1: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- map2: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- rec: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- state: string (nullable = true)
The complex fields like rec and status are structs that are .-accessible.
val recs = commands.select(
$"command.rec" as "rec",
$"command.status.state" as "status")
scala> recs.show
+---------+-------+
| rec| status|
+---------+-------+
|test-plan|SUCCESS|
|test-proc| FAILED|
+---------+-------+
Converting it to a single-record JSON-encoded dataset requires Dataset.toJSON followed by collect_list standard function.
val result = recs.toJSON.agg(collect_list("value"))
scala> result.show(truncate = false)
+-------------------------------------------------------------------------------+
|collect_list(value) |
+-------------------------------------------------------------------------------+
|[{"rec":"test-plan","status":"SUCCESS"}, {"rec":"test-proc","status":"FAILED"}]|
+-------------------------------------------------------------------------------+

You didn't provide the schema for the df so the below might not work for you.
I saved the json sample in a test.json file and read it with val df=spark.read.option("multiLine",true).json("test.json") in which case to get the json you want you just df.select($"rec",$"status.state").write.json("test1.json")

Related

Convert a json string to array of key-value pairs in Spark scala

I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?
I think one approach could be :
1. Create a udf which takes entry value as json string, and converts it to List( Tuple(K, V))
2. In udf, check if entry value is array or not and do conversion accordingly.
The code below explains above approach:
// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS
val df = spark.read.json(ds)
// Schema used by udf to generate output column
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
StructField("key", StringType, false),
StructField("value", StringType, false)
)))
// Converts non-array entry value to array
val toArray = udf((json: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
if(!json.startsWith("[")) {
val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
List((jsonMap("key"), jsonMap("value")))
} else {
jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
}
}, outputSchema)
val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))
val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))
val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))
Results in:
scala> arrayResult.printSchema
root
|-- id: long (nullable = true)
|-- entry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = false)
| | |-- value: string (nullable = false)
scala> arrayExploded.printSchema
root
|-- id: long (nullable = true)
|-- entry: struct (nullable = true)
| |-- key: string (nullable = false)
| |-- value: string (nullable = false)
scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry |
+---+--------------------------------+
|1 |[[test, success], [test2, fail]]|
|2 |[[test, success]] |
+---+--------------------------------+
scala> arrayExploded.show(false)
+---+---------------+
|id |entry |
+---+---------------+
|1 |[test, success]|
|1 |[test2, fail] |
|2 |[test, success]|
+---+---------------+

How to change the datatype of a column in StructField of a StructType?

I am trying to change the datatype of a column present in a dataframe I that I am reading from an RDBMS database.
To do that, I got the schema of the dataframe in the below way:
val dataSchema = dataDF.schema
To see the schema of the dataframe, I used the below statement:
println(dataSchema.schema)
Output: StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DecimalType(15,0),true), StructField(creation_date,TimestampType,true), StructField(created_by,DecimalType(15,0),true), StructField(created_by_name,StringType,true), StructField(entered_dr,DecimalType(38,30),true), StructField(entered_cr,DecimalType(38,30),true))
My requirement is find the DecimalType and change it to DoubleType from the above schema.
I can get the column name and the datatypes using: dataSchema.dtype but it gives me the datatypes in the format of ((columnName1, column datatype),(columnName2, column datatype)....(columnNameN, column datatype))
I am trying to find a way to parse the StructType and change the schema in dataSchema in vain.
Could anyone let me know if there is a way to parse the StructType so that I can change the datatype to my requirement and get in the below format
StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DoubleType,true), StructField(creation_date,TimestampType,true), StructField(created_by,DoubleType,true), StructField(created_by_name,StringType,true), StructField(entered_dr,DoubleType,true), StructField(entered_cr,DoubleType,true))
To modify a DataFrame Schema specific to a given data type, you can pattern-match against StructField's dataType, as shown below:
import org.apache.spark.sql.types._
val df = Seq(
(1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
(2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")
val newSchema = df.schema.fields.map{
case StructField(name, _: DecimalType, nullable, _)
=> StructField(name, DoubleType, nullable)
case field => field
}
// newSchema: Array[org.apache.spark.sql.types.StructField] = Array(
// StructField(c1,LongType,false), StructField(c2,DoubleType,true),
// StructField(c3,StringType,true), StructField(c4,DoubleType,true)
// )
However, assuming your end-goal is to transform the dataset with the column type change, it would be easier to just traverse the columns for the targeted data type to iteratively cast them, like below:
import org.apache.spark.sql.functions._
val df2 = df.dtypes.
collect{ case (dn, dt) if dt.startsWith("DecimalType") => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
df2.printSchema
// root
// |-- c1: long (nullable = false)
// |-- c2: double (nullable = true)
// |-- c3: string (nullable = true)
// |-- c4: double (nullable = true)
[UPDATE]
Per additional requirement from comments, if you want to change schema only for DecimalType with positive scale, just apply a Regex pattern-match as the guard condition in method collect:
val pattern = """DecimalType\(\d+,(\d+)\)""".r
val df2 = df.dtypes.
collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
Here is another way:
data.show(false)
data.printSchema
+----+------------------------+----+----------------------+
|col1|col2 |col3|col4 |
+----+------------------------+----+----------------------+
|1 |0.003200000000000000 |a |23.320000000000000000 |
|2 |78787.990030000000000000|c |343.320000000000000000|
+----+------------------------+----+----------------------+
root
|-- col1: integer (nullable = false)
|-- col2: decimal(38,18) (nullable = true)
|-- col3: string (nullable = true)
|-- col4: decimal(38,18) (nullable = true)
Create a schema that you want:
Exampe:
val newSchema = StructType(
Seq(
StructField("col1", StringType, true),
StructField("col2", DoubleType, true),
StructField("col3", StringType, true),
StructField("col4", DoubleType, true)
)
)
Cast the columns to the required datatype.
val newDF = data.selectExpr(newSchema.map(
col => s"CAST ( ${col.name} As ${col.dataType.sql}) ${col.name}"
): _*)
newDF.printSchema
root
|-- col1: string (nullable = false)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)
|-- col4: double (nullable = true)
newDF.show(false)
+----+-----------+----+------+
|col1|col2 |col3|col4 |
+----+-----------+----+------+
|1 |0.0032 |a |23.32 |
|2 |78787.99003|c |343.32|
+----+-----------+----+------+
The accepted solution works great, but its very costly because of huge cost of withColumn, and analyzer has to analyze whole DF for each withColumn, and with a large number of columns it is very costly. I would rather suggest doing this -
val transformedColumns = inputDataDF.dtypes
.collect {
case (dn, dt)
if (dt.startsWith("DecimalType")) =>
(dn, DoubleType)
}
val transformedDF = inputDataDF.select(transformedColumns.map { fieldType =>
inputDataDF(fieldType._1).cast(fieldType._2)
}: _*)
For a very small dataset it took 1+ minute with withColumn approach for me in my machine and 100 ms with approach with select.
you can read more about cost of withColumn here - https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015

Spark - copy a field using df.schema.copy functions for another dataframe

I need to create a schema using existing df field.
Consider this example dataframe
scala> case class prd (a:Int, b:Int)
defined class prd
scala> val df = Seq((Array(prd(10,20),prd(15,30),prd(20,25)))).toDF("items")
df: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>]
scala> df.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
I need one more field "items_day1" similar to "items" for df2. Right now, I'm doing it like below which is a workaround
scala> val df2=df.select('items,'items.as("item_day1"))
df2: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>, item_day1: array<struct<a:int,b:int>>]
scala> df2.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
|-- item_day1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
scala>
But how to get that using the df.schema.add() or df.schema.copy() functions?.
EDIT1:
I'm trying like below
val (a,b) = (df.schema,df.schema) // works
a("items") //works
b.add(a("items").as("items_day1")) //Error..
To add a new field to your DataFrame schema (which is of StructType) with the same structure but a different top-level name of the existing field, you can copy the StructField with a modified StructField member name, as shown below:
import org.apache.spark.sql.types._
case class prd (a:Int, b:Int)
val df = Seq((Array(prd(10,20), prd(15,30), prd(20,25)))).toDF("items")
val schema = df.schema
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )
val newSchema = schema.find(_.name == "items") match {
case Some(field) => schema.add(field.copy(name = "items_day1"))
case None => schema
}
// newSchema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true),
// StructField(items_day1, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )

Get element of Struct Type for a row in scala using any function

I am doing some calculations on row level in Scala/Spark. I have a dataframe created using JSON below-
{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}
You can create dataframe using this JSON directly. My schema looks like below-
root
|-- available: boolean (nullable = true)
|-- createTime: string (nullable = true)
|-- dataValue: struct (nullable = true)
| |-- another_source: string (nullable = true)
| |-- another_source_array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- first: string (nullable = true)
| | | |-- last: string (nullable = true)
| |-- location: string (nullable = true)
| |-- names_source: struct (nullable = true)
| | |-- first_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- last_names_id: array (nullable = true)
| | | |-- element: long (containsNull = true)
| |-- timestamp: string (nullable = true)
|-- deleteTime: string (nullable = true)
I am reading all columns separately with readSchema and writing with writeSchema. Out of two complex columns, I am able to process one but not other.
Below is a part of my read schema-
.add("names_source", StructType(
StructField("first_names", ArrayType.apply(StringType)) ::
StructField("last_names_id", ArrayType.apply(DoubleType)) ::
Nil
))
.add("another_source_array", ArrayType(StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
)))
Here is a part of my write schema-
.add("names_source", StructType.apply(Seq(
StructField("first_names", StringType),
StructField("last_names_id", DoubleType))
))
.add("another_source_array", ArrayType(StructType.apply(Seq(
StructField("first", StringType),
StructField("last", StringType))
)))
In processing, I am using a method to index all columns. below is my piece of code for the function-
def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {
val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean](columnIndex("available")),
parseDate(row.getAs[String](columnIndex("create_time"))),
??I Need help here??
row.getAs[String](columnIndex("another_source")),
anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
row.getAs[String](columnIndex("location")),
row.getAs[String](columnIndex("timestamp")),
parseDate(row.getAs[String](columnIndex("delete_time")))
)
}).distinct
spark.createDataFrame(myRDD, dataWriteSchema)
}
another_source_array column is being processed by anotherSourceArrayFunction method to make sure we get schema as per the requirements. I need a similar function to get names_source column. Below is the function that I am using for another_source_array column.
def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
if (data == null) {
data
} else {
data.map(r => {
val first = r.getAs[String]("first").ToUpperCase()
val last = r.getAs[String]("last")
new GenericRowWithSchema(Array(first,last), StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
))
})
}
}
Probably in short, I need something like this, where I can get my names_source column structure as a struct.
names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>
Above are the column schema required finally. I am able to get another_source_array properly and need help in names_source. I think my write schema for this column is correct but I am not sure. But I need finally names_source:struct<first_names:array<string>,last_names_id:array<bigint>> as column schema.
Note: I am able to get another_source_array column carefully without any problem. I kept here that function to make it better understanding.
From what I see in all the codes you've tried is that you are trying to flatten the struct dataValue column to separate columns.
If my assumption is correct then you don't have to go through such complexity. You can simply do the follwoing
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean]("available"),
parseDate(row.getAs[String]("createTime")),
row.getAs[Row]("dataValue").getAs[Row]("names_source"),
row.getAs[Row]("dataValue").getAs[String]("another_source"),
row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
row.getAs[Row]("dataValue").getAs[String]("location"),
row.getAs[Row]("dataValue").getAs[String]("timestamp"),
parseDate(row.getAs[String]("deleteTime"))
)
}).distinct
import org.apache.spark.sql.types._
val dataWriteSchema = StructType(Seq(
StructField("createTime", DateType, true),
StructField("createTime", StringType, true),
StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
StructField("another_source", StringType, true),
StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
StructField("location", StringType, true),
StructField("timestamp", StringType, true),
StructField("deleteTime", DateType, true)
))
spark.createDataFrame(myRDD, dataWriteSchema).show(false)
using * to flatten the struct column
You can simply use .* on struct column for the elements of struct column to be on separate columns
import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)
You will have to change the string dates column to dateType in this method
In both the cases you shall get output as
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources |[[1.1,ONE]] |GMP |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
I hope the answer is helpful

Appending Complex Column to Spark Dataframe

I am attempting to add a column containing List[Annotation] to a Spark DataFrame using the below code (I've reformatted everything so this can be reproduced by directly copying and pasting).
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
case class Annotation(
field1: String,
field2: String,
field3: Int,
field4: Float,
field5: Int,
field6: List[Mapping]
)
case class Mapping(
fieldA: String,
fieldB: String,
fieldC: String,
fieldD: String,
fieldE: String
)
object StructTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val annotationStruct =
StructType(
Array(
StructField("field1", StringType, nullable = true),
StructField("field2", StringType, nullable = true),
StructField("field3", IntegerType, nullable = false),
StructField("field4", FloatType, nullable = false),
StructField("field5", IntegerType, nullable = false),
StructField(
"field6",
ArrayType(
StructType(Array(
StructField("fieldA", StringType, nullable = true),
StructField("fieldB", StringType, nullable = true),
StructField("fieldC", StringType, nullable = true),
StructField("fieldD", StringType, nullable = true),
StructField("fieldE", StringType, nullable = true)
))),
nullable = true
)
)
)
val df = List(1).toDF
val annotation = Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))
val schema = df.schema
val newSchema = schema.add("annotations", ArrayType(annotationStruct), false)
val rdd = df.rdd.map(x => Row.fromSeq(x.toSeq :+ List(annotation)))
val newDF = spark.createDataFrame(rdd, newSchema)
newDF.printSchema
newDF.show
}
}
However, I'm getting an error when running this code.
Caused by: java.lang.RuntimeException: Annotation is not a valid external type for schema of struct<field1:string,field2:string,field3:int,field4:float,field5:int,field6:array<struct<fieldA:string,fieldB:string,fieldC:string,fieldD:string,fieldE:string>>>
The schema I am passing in (ArrayType(annotationStruct)) appears to be of the incorrect form when creating a dataFrame using createDataFrame, but it seems to match schemas for DataFrames that contain only List[Annotation].
Edit: Example of modifying a DF schema in this fashion with a simple type instead of a case class.
val df = List(1).toDF
spark.createDataFrame(df.rdd.map(x => Row.fromSeq(x.toSeq :+ "moose")), df.schema.add("moose", StringType, false)).show
+-----+-----+
|value|moose|
+-----+-----+
| 1|moose|
+-----+-----+
Edit 2: I've parsed this down a bit more. Sadly, I don't have the option of creating a DataFrame directly from the case class, which is why I am trying to mirror it as a Struct using ScalaReflection. In this case, I am not altering a previous schema, just attempting to create an DataFrame from an RDD of Rows which contain lists of my case class. Spark had an issue in 1.6 which impacts parsing arrays of structs which may be empty or null - I'm wondering if these are linked.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val annotationSchema = ScalaReflection.schemaFor[Annotation].dataType.asInstanceOf[StructType]
val annotation = Annotation("1", "2", 1, .5, 1, List(Mapping("a", "b", "c", "d", "e")))
val testRDD = spark.sparkContext.parallelize(List(List(annotation))).map(x => Row(x))
val testSchema = StructType(
Array(StructField("annotations", ArrayType(annotationSchema), false)
))
spark.createDataFrame(testRDD, testSchema).show
If you are concerned with adding a complex column to an existing dataframe, then following solution should work for you.
val df = List(1).toDF
val annotation = sc.parallelize(List(Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))))
val newDF = df.rdd.zip(annotation).map(x => Merged(x._1.get(0).asInstanceOf[Int], x._2)).toDF
newDF.printSchema
newDF.show(false)
which should give you
root
|-- value: integer (nullable = false)
|-- annotations: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
| |-- field3: integer (nullable = false)
| |-- field4: float (nullable = false)
| |-- field5: integer (nullable = false)
| |-- field6: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- fieldA: string (nullable = true)
| | | |-- fieldB: string (nullable = true)
| | | |-- fieldC: string (nullable = true)
| | | |-- fieldD: string (nullable = true)
| | | |-- fieldE: string (nullable = true)
+-----+---------------------------------------+
|value|annotations |
+-----+---------------------------------------+
|1 |[1,2,1,0.5,1,WrappedArray([a,b,c,d,e])]|
+-----+---------------------------------------+
The case classes used are the same as yours with Merged case class created.
case class Merged(value : Int, annotations: Annotation)
case class Annotation(field1: String, field2: String, field3: Int, field4: Float, field5: Int, field6: List[Mapping])
case class Mapping(fieldA: String, fieldB: String, fieldC: String, fieldD: String, fieldE: String)
When case classes are used we don't need to define schema. And the process of creation of column names by using case classes and sqlContext.createDataFrame are different.