Use Spark Scala to transform flat data into nested object - scala

I need help converting a flat dataset into a nested format using Apache Spark / Scala.
Is it possible to automatically create a nested structure derived from input column namespaces
[level 1].[level 2]? In my example, the nesting level is determined by the period symbol '.' within the column headers.
I assuming this is possible to achieve using a map function. I am open to alternative solutions, particularly if there is a more elegant way of achieving the same outcome.
package org.acme.au
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import scala.collection.Seq
object testNestedObject extends App {
// Configure spark
val spark = SparkSession.builder()
.appName("Spark batch demo")
.master("local[*]")
.config("spark.driver.host", "localhost")
.getOrCreate()
// Start spark
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
// Define schema for input data
val flatSchema = new StructType()
.add(StructField("id", StringType, false))
.add(StructField("name", StringType, false))
.add(StructField("custom_fields.fav_colour", StringType, true))
.add(StructField("custom_fields.star_sign", StringType, true))
// Create a row with dummy data
val row1 = Row("123456", "John Citizen", "Blue", "Scorpio")
val row2 = Row("990087", "Jane Simth", "Green", "Taurus")
val flatData = Seq(row1, row2)
// Convert into dataframe
val dfIn = spark.createDataFrame(spark.sparkContext.parallelize(flatData), flatSchema)
// Print to console
dfIn.printSchema()
dfIn.show()
// Convert flat data into nested structure as either Parquet or JSON format
val dfOut = dfIn.rdd
.map(
row => ( /* TODO: Need help with mapping flat data to nested structure derived from input column namespaces
*
* For example:
*
* <id>12345<id>
* <name>John Citizen</name>
* <custom_fields>
* <fav_colour>Blue</fav_colour>
* <star_sign>Scorpio</star_sign>
* </custom_fields>
*
*/ ))
// Stop spark
sc.stop()
}

This solution is for the revised requirement that the JSON output would consist of an array of {K:valueK, V:valueV} rather than {valueK1: valueV1, valueK2: valueV2, ...}. For example:
// FROM:
"custom_fields":{"fav_colour":"Blue", "star_sign":"Scorpio"}
// TO:
"custom_fields":[{"key":"fav_colour", "value":"Blue"}, {"key":"star_sign", "value":"Scorpio"}]
Sample code below:
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio"),
(990087, "Jane Simth", "Green", "Taurus")
).toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map( v =>
struct(lit(v).as("key"), col("`" + kv._1 + "." + v + "`").as("value"))
)
accDF.withColumn(kv._1, array(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show(false)
// +------+------------+----------------------------------------+
// |id |name |custom_fields |
// +------+------------+----------------------------------------+
// |123456|John Citizen|[[fav_colour,Blue], [star_sign,Scorpio]]|
// |990087|Jane Simth |[[fav_colour,Green], [star_sign,Taurus]]|
// +------+------------+----------------------------------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- custom_fields: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
dfResult.toJSON.show(false)
// +-------------------------------------------------------------------------------------------------------------------------------+
// |value |
// +-------------------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","custom_fields":[{"key":"fav_colour","value":"Blue"},{"key":"star_sign","value":"Scorpio"}]}|
// |{"id":990087,"name":"Jane Simth","custom_fields":[{"key":"fav_colour","value":"Green"},{"key":"star_sign","value":"Taurus"}]} |
// +-------------------------------------------------------------------------------------------------------------------------------+
Note that we cannot make value type Any to accommodate a mix of different types, as Spark DataFrame API doesn't support type Any. As a consequence, the value in the array must be of a given type (e.g. String). Like the previous solution, this also handles only up to one nested level.

This can be solved with a dedicated case class and a UDF that converts the input data into case class instances. For example:
Define the case class
case class NestedFields(fav_colour: String, star_sign: String)
Define the UDF that takes the original column values as input and returns an instance of NestedFields:
private val asNestedFields = udf((fc: String, ss: String) => NestedFields(fc, ss))
Transform the original DataFrame and drop the flat columns:
val res = dfIn.withColumn("custom_fields", asNestedFields($"`custom_fields.fav_colour`", $"`custom_fields.star_sign`"))
.drop($"`custom_fields.fav_colour`")
.drop($"`custom_fields.star_sign`")
It produces
root
|-- id: string (nullable = false)
|-- name: string (nullable = false)
|-- custom_fields: struct (nullable = true)
| |-- fav_colour: string (nullable = true)
| |-- star_sign: string (nullable = true)
+------+------------+---------------+
| id| name| custom_fields|
+------+------------+---------------+
|123456|John Citizen|[Blue, Scorpio]|
|990087| Jane Simth|[Green, Taurus]|
+------+------------+---------------+

Here's a generalized solution that first assembles a Map of column names that contain the ., traverses the Map to add converted struct columns to the DataFrame, and finally drop the original columns with the .. A slightly more generalized dfIn is used as the sample data.
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio", "a", 1),
(990087, "Jane Simth", "Green", "Taurus", "b", 2)
).
toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign", "s.c1", "s.c2")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign, s.c1, s.c2)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(s -> Array(c1, c2), custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "." + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show
// +------+------------+-----+--------------+
// |id |name |s |custom_fields |
// +------+------------+-----+--------------+
// |123456|John Citizen|[a,1]|[Blue,Scorpio]|
// |990087|Jane Simth |[b,2]|[Green,Taurus]|
// +------+------------+-----+--------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- s: struct (nullable = false)
// | |-- c1: string (nullable = true)
// | |-- c2: integer (nullable = false)
// |-- custom_fields: struct (nullable = false)
// | |-- fav_colour: string (nullable = true)
// | |-- star_sign: string (nullable = true)
Note that this solution handles only up to one nested level.
To convert each row to JSON format, consider using toJSON as follows:
dfResult.toJSON.show(false)
// +---------------------------------------------------------------------------------------------------------------------+
// |value |
// +---------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","s":{"c1":"a","c2":1},"custom_fields":{"fav_colour":"Blue","star_sign":"Scorpio"}}|
// |{"id":990087,"name":"Jane Simth","s":{"c1":"b","c2":2},"custom_fields":{"fav_colour":"Green","star_sign":"Taurus"}} |
// +---------------------------------------------------------------------------------------------------------------------+

Related

Update DataFrame col names from a list avoiding using var

I have a list of defined columns as:
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
And a file (csv with header columns Products Selled, Total Value) which is readed as dataframe.
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(filePath)
// csv file have header as colNames
var finalDf = df
.withColumn("row_id", monotonically_increasing_id)
.select(cols
.map(_.name.trim)
.map(col): _*)
// convert df col names as colCodes (for kudu table columns)
cols.foreach(col => finalDf = finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
In last line, I change the dataframe column name from Products Selled into products_selled. Due of this, finalDf is a var.
I want to know if is a solution to declare finalDf as val, and not var.
I tried something like below code, but withColumnRenamed return a new DataFrame, but I can not do this outside cols.foreach
cols.foreach(col => finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))
Using select You can rename columns.
renaming columns inside select is faster than foldLeft, check post for comparison.
Try below code.
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "string", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val colExpr = cols.map(c => trim(col(c.colName)).as(c.colCode.trim))
If you are storing valid column data type in ExcelColumn case class, you can use column data type like below.
val colExpr = cols.map(c => trim(col(c.colName).cast(c.colType)).as(c.colCode.trim))
finalDf.select(colExpr:_*)
The better way is to use foldLeft with withColumnRenamed
case class ExcelColumn(colName: String, colType: String, colCode: String)
val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)
val resultDF = cols.foldLeft(df){(acc, name ) =>
acc.withColumnRenamed(name.colName.trim, name.colCode.trim)
}
Original Schema:
root
|-- Products Selled: integer (nullable = false)
|-- Total Value: string (nullable = true)
|-- value: integer (nullable = false)
New Schema:
root
|-- products_selled: integer (nullable = false)
|-- total_value: string (nullable = true)
|-- value: integer (nullable = false)

Spark-Scala Convert String of Numbers to Double

I am trying to make a dense Vector out of a string. But first, i need to convert to a double. How do i get it in double format?
[-- feature: string (nullable = false)]
https://i.stack.imgur.com/u1kWz.png
I have tried:
val new_col = df.withColumn("feature", df("feature").cast(DoubleType))
But, it results in a column of Null.
One approach would be to use a UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val df = Seq(
"-1,-1,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0",
"7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,",
"12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0",
"-1,-1,-1,-1,-1,-1,-1,5.0,9.0,9.0"
).toDF("feature")
def stringToVector = udf ( (s: String) =>
new DenseVector(s.split(",").map(_.toDouble))
)
df.withColumn("feature", stringToVector($"feature")).
show(false)
// +---------------------------------------------------+
// |feature |
// +---------------------------------------------------+
// |[-1.0,-1.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0]|
// |[7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0] |
// |[12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0]|
// |[-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,9.0,9.0] |
// +---------------------------------------------------+
first, i need to convert to a double. How do i get it in double format?
You can simply use split inbuilt function and cast to Array[Double] as below
import org.apache.spark.sql.functions._
val new_col = df.withColumn("feature", split(df("feature"), ",").cast("array<double>"))
which should give you
root
.....
.....
|-- feature: array (nullable = true)
| |-- element: double (containsNull = true)
.....
.....
I hope the answer is helpful

create a spark dataframe from a nested json file in scala [duplicate]

This question already has an answer here:
How to access sub-entities in JSON file?
(1 answer)
Closed 5 years ago.
I have a json file that looks like this
{
"group" : {},
"lang" : [
[ 1, "scala", "functional" ],
[ 2, "java","object" ],
[ 3, "py","interpreted" ]
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this
case class ProgLang (id: Int, lang: String, type: String )
I have read this post Reading JSON with Apache Spark - `corrupt_record` and understand that each record needs to be on a newline but in my case I cannot change the file structure
The json format is wrong. The the json api of sqlContext is reading it as corrupt record. Correct form is
{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}
and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframe you want
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = sqlContext.read.json("/home/test.json")
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
.show(false)
You should have
+---+-----+-----------+
|id |lang |type |
+---+-----+-----------+
|1 |scala|functional |
|2 |java |object |
|3 |py |interpreted|
+---+-----+-----------+
Updated
If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFiles to read the json file and parse it as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val readJSON = sc.wholeTextFiles("/home/test.json")
.map(x => x._2)
.map(data => data.replaceAll("\n", ""))
val df = sqlContext.read.json(readJSON)
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0).cast(IntegerType))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
df2.show(false)
df2.printSchema
It should give you dataframe as above and schema as
root
|-- id: integer (nullable = true)
|-- lang: string (nullable = true)
|-- type: string (nullable = true)
As of Spark 2.2 you can use multiLine option to deal with the case of multi-line JSONs.
scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
|-- lang: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Before Spark 2.2 see How to access sub-entities in JSON file? or Read multiline JSON in Apache Spark.

Create spark data frame from custom data format

I have text file with String REC as the record delimiter and line break as the column delimiter, and every data has column name attached to it with comma as delimiter, below is the sample data format
REC
Id,19048
Term,milk
Rank,1
REC
Id,19049
Term,corn
Rank,5
Used REC as the record delimiter.Now, i want to create the spark data frame with column names ID, Term and Rank.Please Assist me on this.
here is working code
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}
object RecordSeparator extends App {
var conf = new
SparkConf().setAppName("test").setMaster("local[1]")
.setExecutorEnv("executor- cores", "2")
var sc = new SparkContext(conf)
val hconf = new Configuration
hconf.set("textinputformat.record.delimiter", "REC")
val data = sc.newAPIHadoopFile("data.txt",
classOf[TextInputFormat], classOf[LongWritable],
classOf[Text], hconf).map(x => x._2.toString.trim).filter(x => x != "")
.map(x => getRecord(x)).map(x => x.split(","))
.map(x => record(x(0), x(2), x(2)))
val sqlContext = new SQLContext(sc)
val df = data.toDF()
df.printSchema()
df.show(false)
def getRecord(in: String): String = {
val ar = in.split("\n").mkString(",").split(",")
val data = Array(ar(1), ar(3), ar(5))
data.mkString(",")
}
}
case class record(Id: String, Term: String, Rank: String)
Output:
root
|-- Id: string (nullable = true)
|-- Term: string (nullable = true)
|-- Rank: string (nullable = true)
+-----+----+----+
|Id |Term|Rank|
+-----+----+----+
|19048|1 |1 |
|19049|5 |5 |
+-----+----+----+
Supposing you have your file on the "normal" filesystem (not HDFS), you have to write a file parser and then use sc.parallelize to create a RDD and then a DataFrame:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
object Demo extends App {
val conf = new SparkConf().setMaster("local[1]").setAppName("Demo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
case class Record(
var id:Option[Int] = None,
var term:Option[String] = None,
var rank:Option[Int] = None)
val filename = "data.dat"
val records = readFile(filename)
val df = sc.parallelize(records).toDF
df.printSchema()
df.show()
def readFile(filename:String) : Seq[Record] = {
import scala.io.Source
val records = mutable.ArrayBuffer.empty[Record]
var currentRecord: Record = null
for (line <- Source.fromFile(filename).getLines) {
val tokens = line.split(',')
currentRecord = tokens match {
case Array("REC") => Record()
case Array("Id", id) => {
currentRecord.id = Some(id.toInt); currentRecord
}
case Array("Term", term) => {
currentRecord.term = Some(term); currentRecord
}
case Array("Rank", rank) => {
currentRecord.rank = Some(rank.toInt); records += currentRecord;
null
}
}
}
return records
}
}
this gives
root
|-- id: integer (nullable = true)
|-- term: string (nullable = true)
|-- rank: integer (nullable = true)
+-----+----+----+
| id|term|rank|
+-----+----+----+
|19048|milk| 1|
|19049|corn| 5|
+-----+----+----+

How to extract values from json string?

I have a file which has bunch of columns and one column called jsonstring is of string type which has json strings in it… let's say the format is the following:
{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}
I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1
How can I do that in scala or spark sql.
With Spark 2.2 you could use the function from_json which does the JSON parsing for you.
from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema.
With the support for flattening nested columns by using * (star) that seems the best solution.
// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}""").toDF("jsonstring")
// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
.add($"level2key1".string)
.add($"level2key2".string)
val schema = new StructType()
.add($"key1".string)
.add("key2", key2schema)
scala> schema.printTreeString
root
|-- key1: string (nullable = true)
|-- key2: struct (nullable = true)
| |-- level2key1: string (nullable = true)
| |-- level2key2: string (nullable = true)
val messages = jsonstrings
.select(from_json($"jsonstring", schema) as "json")
.select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1 |key2 |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+
scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1 |level2key1 |level2key2 |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+
You can use withColumn + udf + json4s:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String, String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson \ "key1").extract[String]
val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
(value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))
df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))