Spark: 3.0.0
Scala: 2.12.8
My data frame has a column with JSON string, and I want to create a new column from it with the StructType.
temp_json_string
{"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""}
root
|-- temp_json_string: string (nullable = true)
Formatted JSON:
{
"name":"test",
"id":"12",
"category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""
}
I want to create a new column of type Struct so I tried:
dataFrame
.withColumn("temp_json_struct", struct(col("temp_json_string")))
.select("temp_json_struct")
Now, I get the schema as:
root
|-- temp_json_struct: struct (nullable = false)
| |-- temp_json_string: string (nullable = true)
Desired result:
root
|-- temp_json_struct: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- category: array (nullable = true)
| | |-- products: array (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- displayLabel: string (nullable = true)
| |-- createdAt: timestamp (nullable = true)
| |-- updatedAt: timestamp (nullable = true)
json_str_col is the column that has JSON string. I had multiple files so that's why the fist line is iterating through each row to extract the schema. If you know your schema up front then just replace json_schema with that.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_str_col)).schema
df = df.withColumn('new_col', from_json(col('json_str_col'), json_schema))
// import spark implicits for conversion to dataset (.as[String])
import spark.implicits._
val df = ??? //create your dataframe having the 'temp_json_string' column
//convert Dataset[Row] aka Dataframe to Dataset[String]
val ds = df.select("temp_json_string").as[String]
//read as json
spark.read.json(ds)
Documentation
There at least two different ways to retrieve/discover the schema for a given JSON.
For the illustration, let's create some data first:
import org.apache.spark.sql.types.StructType
val jsData = Seq(
("""{
"name":"test","id":"12","category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""}""")
)
Option 1: schema_of_json
The first option is to use the built-in function schema_of_json. The function will return the schema for the given JSON in DDL format:
val json = jsData.toDF("js").collect()(0).getString(0)
val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')")
.collect()(0) //get 1st row
.getString(0) //get 1st col of the row as string
.replace("null", "string") //replace type with string, this occurs since you have "createdAt":""
// struct<category:array<struct<displayLabel:string,displayName:string,products:array<string>>>,createdAt:null,createdBy:null,id:string,name:string>
val schema: StructType = StructType.fromDDL(s"js_schema $ddlSchema")
Note that you would expect that schema_of_json would also work on the column level i.e: schema_of_json(js_col), unfortunately, this doesn't work as expected therefore we are forced to pass a string instead.
Option 2: use Spark JSON reader (recommended)
import org.apache.spark.sql.functions.from_json
val schema: StructType = spark.read.json(jsData.toDS).schema
// schema.printTreeString
// root
// |-- category: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- displayLabel: string (nullable = true)
// | | |-- displayName: string (nullable = true)
// | | |-- products: array (nullable = true)
// | | | |-- element: string (containsNull = true)
// |-- createdAt: string (nullable = true)
// |-- createdBy: string (nullable = true)
// |-- id: string (nullable = true)
// |-- name: string (nullable = true)
As you can see, here we are producing a schema based on StructType and not a DDL string as in the previous case.
After discovering the schema we can move on to the next step which is converting the JSON data into a struct. To achieve that we will use from_json built-in function:
jsData.toDF("js")
.withColumn("temp_json_struct", from_json($"js", schema))
.printSchema()
// root
// |-- js: string (nullable = true)
// |-- temp_json_struct: struct (nullable = true)
// | |-- category: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- displayLabel: string (nullable = true)
// | | | |-- displayName: string (nullable = true)
// | | | |-- products: array (nullable = true)
// | | | | |-- element: string (containsNull = true)
// | |-- createdAt: string (nullable = true)
// | |-- createdBy: string (nullable = true)
// | |-- id: string (nullable = true)
// | |-- name: string (nullable = true)
Related
I have this file .json and I need, convert it in DF, the file is this:
{
"id": "517379",
"created_at": "2020-11-16T04:25:03Z",
"company": "1707",
"invoice": [
{
"invoice_id": "4102",
"date": "2020-11-16T04:25:03Z",
"id": "517379",
"cantidad": "21992.47",
"extra_data": {
"column": "ASDFG",
"credito": "Crédito"
}
}
I need in this way, like df.
id. , created_at, , company, invoice_id, date , id. , cantidad, column, credito
517379 , 2020-11-16T04:25:03Z , 1707, 4102, 2020-11-16T04:25:03Z , 517379 , 21992.47, ASDFG, Crédito
by default spark try to sparse each line as a json document, so if your file contains json objects across multiple line, you will have a dataframe with one column _corrupt_record. To solve this problem you need to set the reading option multiline to true.
test.json
[
{
"id":"517379",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
},
{
"id":"1234",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
}
]
pyspark code
from pyspark.sql import SparkSession
df = spark.read.option("multiline","true").json("test.json")
df.printSchema()
root
|-- company: string (nullable = true)
|-- created_at: string (nullable = true)
|-- id: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cantidad: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- invoice_id: string (nullable = true)
df.show()
+-------+--------------------+------+--------------------+
|company| created_at| id| invoice|
+-------+--------------------+------+--------------------+
| 1707|2020-11-16T04:25:03Z|517379|[{21992.47, 2020-...|
| 1707|2020-11-16T04:25:03Z| 1234|[{21992.47, 2020-...|
+-------+--------------------+------+--------------------+
if you know the schema, it’s better to provide it to reduce the processing time and parsing all the types as you want
from pyspark.sql import types as pyspark_types
from pyspark.sql import functions as pyspark_functions
schema = pyspark_types.StructType(fields=[
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("created_at", pyspark_types.TimestampType()),
pyspark_types.StructField("company", pyspark_types.StringType()),
pyspark_types.StructField("invoice", pyspark_types.ArrayType(
pyspark_types.StructType(fields=[
pyspark_types.StructField("invoice_id", pyspark_types.StringType()),
pyspark_types.StructField("date", pyspark_types.TimestampType()),
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("cantidad", pyspark_types.StringType()),
pyspark_types.StructField("extra_data", pyspark_types.StructType(fields=[
pyspark_types.StructField("column", pyspark_types.StringType()),
pyspark_types.StructField("credito", pyspark_types.StringType())
]))
])
)),
])
df = spark.read.option("multiline","true").json("test.json", schema=schema)
df.printSchema()
root
|-- id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- company: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- invoice_id: string (nullable = true)
| | |-- date: timestamp (nullable = true)
| | |-- id: string (nullable = true)
| | |-- cantidad: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
df.show()
+------+-------------------+-------+--------------------+
| id| created_at|company| invoice|
+------+-------------------+-------+--------------------+
|517379|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
| 1234|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
+------+-------------------+-------+--------------------+
# To get completely denormalized dataframe
df = df.withColumn("invoice", pyspark_functions.explode_outer("invoice")) \
.select("id", "company", "created_at",
"invoice.invoice_id", "invoice.date", "invoice.id", "invoice.cantidad",
"invoice.extra_data.*")
in the second example, spark doesn’t try to detect the schema, while it’s provided, but it try to cast the variables to the types you provided (created_at and date in the second example are timestamps instead of string)
I have a dataFrame with array of struct so I just want to filter columns or we can say select column in array of struct from the array of struct but is it possible as I am iterating through row .
Schema
root
|-- day: long (nullable = true)
|-- table_row: array (nullable = true)
| |-- element: struct (containsNull = true)
| |-- DATE: string (nullable = true)
| |-- ADMISSION_NUM: string (nullable = true)
| |-- SOURCE_CODE: string (nullable = true)
What I am doing is that I am iterating through Rows Can we select the array columns row wise . I only want to know how this is possible
def keepColumnInarray(columns: Set[String], row: Row): Row = {
//Some
}
Example If I want to keep column "Data" Then keepColumnInarray will only select this
Output Schema
root
|-- day: long (nullable = true)
|-- table_row: array (nullable = true)
| |-- element: struct (containsNull = true)
| |-- DATE: string (nullable = true)
spark >= 2.4
df.withColumn("table_row", expr("TRANSFORM (table_row, x -> named_struct('DATE', x.DATE))")
this will convert
table_row: array
struct (nullable = true)
| |-- DATE: string (nullable = true)
| |-- ADMISSION_NUM: string (nullable = true)
| |-- SOURCE_CODE: string (nullable = true)
to
table_row: array
struct (nullable = true)
| |-- DATE: string (nullable = true)
Update-1 (based on comments)
spark < 2.4
Use below UDF to select columns-
val df = spark.range(2).withColumnRenamed("id", "day")
.withColumn("table_row", expr("array(named_struct('DATE', 'sample_date'," +
" 'ADMISSION_NUM', 'sample_adm_num', 'SOURCE_CODE', 'sample_source_code'))"))
df.show(false)
df.printSchema()
//
// +---+---------------------------------------------------+
// |day|table_row |
// +---+---------------------------------------------------+
// |0 |[[sample_date, sample_adm_num, sample_source_code]]|
// |1 |[[sample_date, sample_adm_num, sample_source_code]]|
// +---+---------------------------------------------------+
//
// root
// |-- day: long (nullable = false)
// |-- table_row: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- DATE: string (nullable = false)
// | | |-- ADMISSION_NUM: string (nullable = false)
// | | |-- SOURCE_CODE: string (nullable = false)
//
def keepColumnInarray(columnsToKeep: Seq[String], rows: mutable.WrappedArray[Row]) = {
rows.map(r => {
new GenericRowWithSchema(r.getValuesMap(columnsToKeep).values.toArray,
StructType(r.schema.filter(s => columnsToKeep.contains(s.name))))
})
}
val keepColumns = udf((columnsToKeep: Seq[String], rows: mutable.WrappedArray[Row]) =>
keepColumnInarray(columnsToKeep, rows)
, ArrayType(StructType(StructField("DATE", StringType) :: Nil)))
val processedDF = df
.withColumn("table_row_new", keepColumns(array(lit("DATE")), col("table_row")))
processedDF.show(false)
processedDF.printSchema()
//
// +---+---------------------------------------------------+---------------+
// |day|table_row |table_row_new |
// +---+---------------------------------------------------+---------------+
// |0 |[[sample_date, sample_adm_num, sample_source_code]]|[[sample_date]]|
// |1 |[[sample_date, sample_adm_num, sample_source_code]]|[[sample_date]]|
// +---+---------------------------------------------------+---------------+
//
// root
// |-- day: long (nullable = false)
// |-- table_row: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- DATE: string (nullable = false)
// | | |-- ADMISSION_NUM: string (nullable = false)
// | | |-- SOURCE_CODE: string (nullable = false)
// |-- table_row_new: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- DATE: string (nullable = true)
//
I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)
I have written a sample spark app, where I'm creating a dataframe with MapType and writing it to disk. Then I'm reading the same file & printing its schema. Bu the output file schema is different when compared to Input Schema and I don't see the MapType in the Output. How I can read that output file with MapType
Code
import org.apache.spark.sql.{SaveMode, SparkSession}
case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])
object sample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
import spark.implicits._
val schemaData = Seq(
Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
Person("Persion2", Map("It" -> Department("1", "It Department")))
)
val df = spark.sparkContext.parallelize(schemaData).toDF()
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.json("D:\\save\\output\\*.json").printSchema()
}
}
OutPut
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- department: struct (nullable = true)
| |-- HR: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
| |-- It: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
|-- name: string (nullable = true)
Json File
{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}
EDIT :
For just explaining my requirement I have added the saving file part above. In actual scenario I will be just reading JSON data provided above and work on that dataframe
You can pass the schema from prevous dataframe while reading the json data
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Hope this helps!
Data structure:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
Now I want to load the data into a data frame and want to append zip to loc. The loc column name should be same (loc). The transformed data should be like this:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
No RDDs. I need a data frame operation to achieve this, preferably with the withColumn function. How can I do this?
Given a data structure as
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
You can covert it to dataframe as
val df = spark.read.json(sc.parallelize(jsonString::Nil))
which would give you
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
Now to get the desired output you would need to separate struct Emp column to separate columns and use Address array column in udf function to get your desired result as
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
where address in udf class is a case class
case class address(loc: String, Zip: String)
which should give you
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
Now to get the json you can just use .toJSON and you should get
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+