Nested JSON in Spark

Nested JSON in Spark - scala

I have the following JSON loaded as a DataFrame:
root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)
I want to get the following RDD from this DataFrame:
RDD[(more1, more2, more3, field1, field2)]
How can I do this? I think I have to use flatMap for the nested JSON?

A combination of explode and dot syntax should do the trick:
import org.apache.spark.sql.functions.explode
case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)
val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")
df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)
val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")
val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd
aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])
Depending on your requirements you can follow this with map to extract values from the rows:
import org.apache.spark.sql.Row
aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}
See also Querying Spark SQL DataFrame with complex types

Related

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8

I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Reordering fields in nested dataframe

How do I reorder fields in a nested dataframe in scala?
for e.g below is the expected and desired schemas
currently->
root
|-- domain: struct (nullable = false)
| |-- assigned: string (nullable = true)
| |-- core: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- action: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- dqid: string (nullable = true)
expected->
root
|-- domain: struct (nullable = false)
| |-- core: string (nullable = true)
| |-- assigned: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- dqid: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- action: string (nullable = true)
```

You need to define schema before you read the dataframe.
val schema = val schema = StructType(Array(StructField("root",StructType(Array(StructField("domain",StructType(Array(StructField("core",StringType,true), StructField("assigned",StringType,true), StructField("createdBy",StringType,true))),true), StructField("Event",StructType(Array(StructField("dqid",StringType,true), StructField("eventid",StringType,true), StructField("action",StringType,true))),true))),true)))
Now, you can apply this schema while reading your file.
val df = spark.read.schema(schema).json("path/to/json")
Should work with any nested data.
Hope this helps!

Most efficient approach might be to just select the nested elements and wrap in a couple of structs, as shown below:
case class Domain(assigned: String, core: String, createdBy: Long)
case class Event(action: String, eventid: String, dqid: String)
val df = Seq(
(Domain("a", "b", 1L), Event("c", "d", "e")),
(Domain("f", "g", 2L), Event("h", "i", "j"))
).toDF("domain", "event")
val df2 = df.select(
struct($"domain.core", $"domain.assigned", $"domain.createdBy").as("domain"),
struct($"event.dqid", $"event.action", $"event.eventid").as("event")
)
df2.printSchema
// root
// |-- domain: struct (nullable = false)
// | |-- core: string (nullable = true)
// | |-- assigned: string (nullable = true)
// | |-- createdBy: long (nullable = true)
// |-- event: struct (nullable = false)
// | |-- dqid: string (nullable = true)
// | |-- action: string (nullable = true)
// | |-- eventid: string (nullable = true)
An alternative would be to apply row-wise map:
import org.apache.spark.sql.Row
val df2 = df.map{ case Row(Row(as: String, co: String, cr: Long), Row(ac: String, ev: String, dq: String)) =>
((co, as, cr), (dq, ac, ev))
}.toDF("domain", "event")

Changing the type of a nested JSON attribute

scala> val df = spark.read.json("data.json")
scala> df.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- **TimeStamp: string (nullable = true)**
|-- id: string (nullable = true)
scala> val df1 = df.withColumn("TimeStamp", $"TimeStamp".cast(TimestampType))
scala> df1.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- **TimeStamp: timestamp (nullable = true)** // WORKING AS EXPECTED
|-- id: string (nullable = true)
scala> val df2 = df.withColumn("a.b.c", $"a.b.c".cast(DoubleType))
scala> df2.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- id: string (nullable = true)
|-- **a.b.c: double (nullable = true)** // DUPLICATE COLUMN ADDED
I'm trying to change the type of a nested JSON attribute within a data frame column. the change in a nested attribute has been treated as a new column which resulting a duplicate column. the change is working fine for the top level attributes (Timestamp) but not for the nested ones (a.b.c). Any thoughts on this problem ?.

since your column is of struct type & you need to build it again in the same hierarchy. Because it doesn't go by assumption, it thinks that you are rewriting the structure.
Input:
{"a": {"b": {"c": "1.31", "d": "1.11"}}, "TimeStamp": "2017-02-18", "id":1}
{"a": {"b": {"c": "2.31", "d": "2.22"}}, "TimeStamp": "2017-02-18", "id":1}
val lines2 = spark.read.json("/home/kiran/km/km_hadoop/data/data_nested_struct_col2.json")
lines2.printSchema()
val df2 = lines2.withColumn("a", struct(
struct(
lines2("a.b.c").cast(DoubleType).as("c"),
lines2("a.b.d").as("d")
).as("b")))
.withColumn("TimeStamp", lines2("TimeStamp").cast(DateType))
df2.printSchema()
This is output of both schemas before & after:
root
|-- TimeStamp: string (nullable = true)
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
|-- id: long (nullable = true)
root
|-- TimeStamp: date (nullable = true)
|-- a: struct (nullable = false)
| |-- b: struct (nullable = false)
| | |-- c: double (nullable = true)
| | |-- d: string (nullable = true)
|-- id: long (nullable = true)
I hope it is clear.

Array of String to Array of Struct in Scala + Spark

I am currently using Spark and Scala 2.11.8
I have the following schema:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- descriptions: array (nullable = true)
|-- |-- element: string (containsNull = true)
I am trying to use UDF to convert it to the following:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- description: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
| | |-- code: string (nullable = true)
| | |-- cost: int(nullable = true)
So source data looks like this:
[WrappedArray(a abc 100,b abc 300)]
[WrappedArray(c abc 400)]
I need to use " " (space) as a delimiter, but don't know how to do this in scala.
def convert(product: Seq[String]): Seq[Row] = {
??/
}
I am fairly new in Scala, so can someone guide me how to construct this type of function?
Thanks.

I do not know if I understand your problem right, but map could be your friend.
case class Row(a: String, b: String, c: Int)
val value = List(List("a", "abc", 123), List("b", "bcd", 321))
value map {
case List(a: String, b: String, c: Int) => Row(a,b,c);
}
if you have to parse it first:
val value2 = List("a b 123", "c d 345")
value2 map {
case s => {
val split = s.toString.split(" ")
Row(split(0), split(1), split(2).toInt)
}
}

How can I create a Spark DataFrame from a nested array of struct element?

I have read a JSON file into Spark. This file has the following structure:
scala> tweetBlob.printSchema
root
|-- related: struct (nullable = true)
| |-- next: struct (nullable = true)
| | |-- href: string (nullable = true)
|-- search: struct (nullable = true)
| |-- current: long (nullable = true)
| |-- results: long (nullable = true)
|-- tweets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cde: struct (nullable = true)
...
...
| | |-- cdeInternal: struct (nullable = true)
...
...
| | |-- message: struct (nullable = true)
...
...
What I would ideally want is a DataFrame with columns "cde", "cdeInternal", "message"... as shown below
root
|-- cde: struct (nullable = true)
...
...
|-- cdeInternal: struct (nullable = true)
...
...
|-- message: struct (nullable = true)
...
...
I have managed to use "explode" to extract elements from the "tweets" array into a column called "tweets"
scala> val tweets = tweetBlob.select(explode($"tweets").as("tweets"))
tweets: org.apache.spark.sql.DataFrame = [tweets: struct<cde:struct<author:struct<gender:string,location:struct<city:string,country:string,state:string>,maritalStatus:struct<evidence:string,isMarried:string>,parenthood:struct<evidence:string,isParent:string>>,content:struct<sentiment:struct<evidence:array<struct<polarity:string,sentimentTerm:string>>,polarity:string>>>,cdeInternal:struct<compliance:struct<isActive:boolean,userProtected:boolean>,tracks:array<struct<id:string>>>,message:struct<actor:struct<displayName:string,favoritesCount:bigint,followersCount:bigint,friendsCount:bigint,id:string,image:string,languages:array<string>,link:string,links:array<struct<href:string,rel:string>>,listedCount:bigint,location:struct<displayName:string,objectType:string>,objectType:string,postedTime...
scala> tweets.printSchema
root
|-- tweets: struct (nullable = true)
| |-- cde: struct (nullable = true)
...
...
| |-- cdeInternal: struct (nullable = true)
...
...
| |-- message: struct (nullable = true)
...
...
How can I select all columns inside the struct and create a DataFrame out of it? Explode does not work on a struct if my understanding is correct.
Any help is appreciated.

One possible way to handle this is to extract required information from the schema. Lets start with some dummy data:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
case class Bar(x: Int, y: String)
case class Foo(bar: Bar)
val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df.printSchema
// root
// |-- bar: struct (nullable = true)
// | |-- x: integer (nullable = false)
// | |-- y: string (nullable = true)
and a helper function:
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}"))
}
Finally the results:
df.select(children("bar", df): _*).printSchema
// root
// |-- x: integer (nullable = true)
// |-- y: string (nullable = true)

You can use
df
.select(explode(col("path_to_collection")).as("collection"))
.select(col("collection.*"))`:
Example:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> val inline = sqlContext.read.json(sc.parallelize(json :: Nil)).select(explode(col("schools")).as("collection")).select(col("collection.*"))
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
Or, you can also use SQL function inline:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> sqlContext.read.json(sc.parallelize(json :: Nil)).registerTempTable("tmp")
scala> val inline = sqlContext.sql("SELECT inline(schools) FROM tmp")
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+

scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> case class Bar(x: Int, y: String)
defined class Bar
scala> case class Foo(bar: Bar)
defined class Foo
scala> val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df: org.apache.spark.sql.DataFrame = [bar: struct<x: int, y: string>]
scala> df.printSchema
root
|-- bar: struct (nullable = true)
| |-- x: integer (nullable = false)
| |-- y: string (nullable = true)
scala> df.select("bar.*").printSchema
root
|-- x: integer (nullable = true)
|-- y: string (nullable = true)
scala>