How to convert keys in JSON to lower case? - scala

I have a streaming query that reads data from Kafka in JSON format with 1000+ keys in camel case.
scala> kafka_df.printSchema()
root
|-- jsonData: struct (nullable = true)
| |-- header: struct (nullable = true)
| | |-- batch_id: string (nullable = true)
| | |-- entity: string (nullable = true)
| | |-- time: integer (nullable = true)
| | |-- key: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- message_type: string (nullable = true)
| |-- body: string (nullable = true)
How to change the keys to lower case recursively and convert back to data frame so that i can write using write stream?

Try this:
def columnsToLowercase(schema: StructType): StructType = {
def recurRename(schema: StructType): Seq[StructField] =
schema.fields.map {
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(name.toLowerCase, StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(name.toLowerCase, dtype, nullable, meta)
}
StructType(recurRename(schema))
}
val newDF = sparkSession.createDataFrame(dataFrame.rdd, columnsToLowercase(dataFrame.schema))

Related

parsing complex nested json in Spark scala

I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)

Spark - Flatten Array of Structs using flatMap

I have a df with schema -
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- qty: long (nullable = true)
| | |-- rqty: long (nullable = true)
| | |-- pids: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- sqty: long (nullable = true)
| | |-- id1: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
| | |-- otherId: string (nullable = true)
|-- primarykey: string (nullable = true)
|-- runtime: string (nullable = true)
I don't want to use explode as its extremely slow and wanted to try flapMap instead.
I tried doing -
val ds = df1.as[(Array[StructType], String, String)]
ds.flatMap{ case(x, y, z) => x.map((_, y, z))}.toDF()
This gives me error -
scala.MatchError: org.apache.spark.sql.types.StructType
How do I flatten arrayCol?
Sample data -
{
"primaryKeys":"sfdfrdsdjn",
"runtime":"2020-10-31T13:01:04.813Z",
"arrayCol":[{"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}]
}
Expected Output -
primaryKey runtime arrayCol
sfdfrdsdjn 2020-10-31T13:01:04.813Z {"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}
I want one row for every element in arrayCol. Just like explode(arrayCol)
You almost had it. Remember when using spark with scala, always try to use the Dataset API as often as possible. This not only increases readeability, but helps solve these type of issues very quickly.
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val inputTest = List(
FullArrayCols(Seq(ArrayCol("qwerty", Seq(), 3, 3, Seq(), 3, "dsfdsfdsf", "sdfsdfsdPuyOplzlR1idvfPkv5138g",
ArrayColWindow("2020-11-01T10:30:00Z", "2020-11-01T12:30:00Z"), null)),
"sfdfrdsdjn", "2020-10-31T13:01:04.813Z")
).toDS()
val output = inputTest.as[(Seq[ArrayCol],String,String)].flatMap{ case(x, y, z) => x.map((_, y, z))}
output.show(truncate=false)
you could just change
val ds = df1.as[(Array[StructType], String, String)]
to
val ds = df1.as[(Array[String], String, String)]
and you can get rid of the error and see the output you want.

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Schema conversion from String to Array[Structype] using Spark Scala

I've the sample data as shown below, i would need to convert columns(ABS, ALT) from string to Array[structType] using spark scala code. Any help would be much appreciated.
With the help of UDF, i was able to convert from string to arrayType, but need some help on converting from string to Array[structType] for these two columns(ABS, ALT).
VIN TT MSG_TYPE ABS ALT
MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX,"V":0.0}]
[{"E":156957XXXXXX,"V":0.0}]
df.currentSchema
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)
df.expectedSchema:
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
It also works if you try as below:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))
val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
final_df.printSchema:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
You can use an udf to parse the Json and transform it into arrays of structs.
First, define a function that parses the Json (based on this answer):
case class Data(E:String, V:Double)
class CC[T] extends Serializable { def unapply(a: Any): Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
def toStruct(in: String): Array[Data] = {
if( in == null || in.isEmpty) return new Array[Data](0)
val result = for {
Some(L(map)) <- List(JSON.parseFull(in))
M(data) <- map
S(e) = data("E")
D(v) = data("V")
} yield {
Data(e, v)
}
result.toArray
}
This function returns an array of Data objects, that have already the correct structure. Now we use this function to define an udf
val ts: String => Array[Data] = toStruct(_)
import org.apache.spark.sql.functions.udf
val toStructUdf = udf(ts)
Finally we call the udf (for example in a select statement):
val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()
Output:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)

Rename nested struct columns in a Spark DataFrame [duplicate]

This question already has answers here:
Rename nested field in spark dataframe
(5 answers)
Closed 3 years ago.
I am trying to change the names of a DataFrame columns in scala. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- _VkjLmnVop: string (nullable = true)
|-- _KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- _MnoPqrstUv: string (nullable = true)
| | |-- _ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkj_lmn_vop: string (nullable = true)
|-- ka_tas_lop: string (nullable = true)
|-- abc_def: struct (nullable = true)
| |-- uvw_xyz: struct (nullable = true)
| | |-- mno_pqrst_uv: string (nullable = true)
| | |-- man_devy_ixyz: string (nullable = true)
For Non Struct columns I'm changing column names by below
def aliasAllColumns(df: DataFrame): DataFrame = {
df.select(df.columns.map { c =>
df.col(c)
.as(
c.replaceAll("_", "")
.replaceAll("([A-Z])", "_$1")
.toLowerCase
.replaceFirst("_", ""))
}: _*)
}
aliasAllColumns(file_data_df).show(1)
How I can change Struct column names dynamically?
You can create a recursive method to traverse the DataFrame schema for renaming the columns:
import org.apache.spark.sql.types._
def renameAllCols(schema: StructType, rename: String => String): StructType = {
def recurRename(schema: StructType): Seq[StructField] = schema.fields.map{
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(rename(name), StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype: ArrayType, nullable, meta) if dtype.elementType.isInstanceOf[StructType] =>
StructField(rename(name), ArrayType(StructType(recurRename(dtype.elementType.asInstanceOf[StructType])), true), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(rename(name), dtype, nullable, meta)
}
StructType(recurRename(schema))
}
Testing it with the following example:
import org.apache.spark.sql.functions._
import spark.implicits._
val renameFcn = (s: String) =>
s.replace("_", "").replaceAll("([A-Z])", "_$1").toLowerCase.dropWhile(_ == '_')
case class C(A_Bc: Int, D_Ef: Int)
val df = Seq(
(10, "a", C(1, 2), Seq(C(11, 12), C(13, 14)), Seq(101, 102)),
(20, "b", C(3, 4), Seq(C(15, 16)), Seq(103))
).toDF("_VkjLmnVop", "_KaTasLop", "AbcDef", "ArrStruct", "ArrInt")
val newDF = spark.createDataFrame(df.rdd, renameAllCols(df.schema, renameFcn))
newDF.printSchema
// root
// |-- vkj_lmn_vop: integer (nullable = false)
// |-- ka_tas_lop: string (nullable = true)
// |-- abc_def: struct (nullable = true)
// | |-- a_bc: integer (nullable = false)
// | |-- d_ef: integer (nullable = false)
// |-- arr_struct: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a_bc: integer (nullable = false)
// | | |-- d_ef: integer (nullable = false)
// |-- arr_int: array (nullable = true)
// | |-- element: integer (containsNull = false)
as far as I know, it's not possible to rename nested fields directly.
From one side, you could try moving to a flat object.
However, if you need to keep the structure, you can play with spark.sql.functions.struct(*cols).
Creates a new struct column.
Parameters: cols – list of column names (string) or list of Column expressions
You will need to decompose all the schema, generate the aliases that you need and then compose it again using the struct function.
It's not the best solution. But it's something :)
Pd: I'm attaching the PySpark doc since it contains a better explanation than the Scala one.