Array of String to Array of Struct in Scala + Spark - scala

I am currently using Spark and Scala 2.11.8
I have the following schema:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- descriptions: array (nullable = true)
|-- |-- element: string (containsNull = true)
I am trying to use UDF to convert it to the following:
root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- description: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- value: string (nullable = true)
| | |-- code: string (nullable = true)
| | |-- cost: int(nullable = true)
So source data looks like this:
[WrappedArray(a abc 100,b abc 300)]
[WrappedArray(c abc 400)]
I need to use " " (space) as a delimiter, but don't know how to do this in scala.
def convert(product: Seq[String]): Seq[Row] = {
??/
}
I am fairly new in Scala, so can someone guide me how to construct this type of function?
Thanks.

I do not know if I understand your problem right, but map could be your friend.
case class Row(a: String, b: String, c: Int)
val value = List(List("a", "abc", 123), List("b", "bcd", 321))
value map {
case List(a: String, b: String, c: Int) => Row(a,b,c);
}
if you have to parse it first:
val value2 = List("a b 123", "c d 345")
value2 map {
case s => {
val split = s.toString.split(" ")
Row(split(0), split(1), split(2).toInt)
}
}

Related

spark scala convert a nested dataframe to nested dataset

I have a nested dataframe "inputFlowRecordsAgg" which have follwoing schema
root
|-- FlowI.key: string (nullable = true)
|-- FlowS.minFlowTime: long (nullable = true)
|-- FlowS.maxFlowTime: long (nullable = true)
|-- FlowS.flowStartedCount: long (nullable = true)
|-- FlowI.DestPort: integer (nullable = true)
|-- FlowI.SrcIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.DestIP: struct (nullable = true)
| |-- bytes: binary (nullable = true)
|-- FlowI.L4Protocol: byte (nullable = true)
|-- FlowI.Direction: byte (nullable = true)
|-- FlowI.Status: byte (nullable = true)
|-- FlowI.Mac: string (nullable = true)
Wanted to convert into nested dataset of following case classes
case class InputFlowV1(val FlowI: FlowI,
val FlowS: FlowS)
case class FlowI(val Mac: String,
val SrcIP: IPAddress,
val DestIP: IPAddress,
val DestPort: Int,
val L4Protocol: Byte,
val Direction: Byte,
val Status: Byte,
var key: String = "")
case class FlowS(var minFlowTime: Long,
var maxFlowTime: Long,
var flowStartedCount: Long)
but when I try converting it using
inputFlowRecordsAgg.as[InputFlowV1]
cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
org.apache.spark.sql.AnalysisException: cannot resolve '`FlowI`' given input columns: [FlowI.DestIP,FlowI.Direction, FlowI.key, FlowS.maxFlowTime, FlowI.SrcIP, FlowS.flowStartedCount, FlowI.L4Protocol, FlowI.Mac, FlowI.DestPort, FlowS.minFlowTime, FlowI.Status];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
One comment asked me for a full code, here it is
def getReducedFlowR(inputFlowRecords: Dataset[InputFlowV1],
#transient spark: SparkSession): Dataset[InputFlowV1]={
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "FlowI.key")
.agg(min("FlowS.minFlowTime") as "FlowS.minFlowTime" , max("FlowS.maxFlowTime") as "FlowS.maxFlowTime",
sum("FlowS.flowStartedCount") as "FlowS.flowStartedCount"
, first("FlowI.Mac") as "FlowI.Mac"
, first("FlowI.SrcIP") as "FlowI.SrcIP" , first("FlowI.DestIP") as "FlowI.DestIP"
,first("FlowI.DestPort") as "FlowI.DestPort"
, first("FlowI.L4Protocol") as "FlowI.L4Protocol"
, first("FlowI.Direction") as "FlowI.Direction" , first("FlowI.Status") as "FlowI.Status")
inputFlowRecordsAgg.printSchema()
return inputFlowRecordsAgg.as[InputFlowV1]
}
Reason is your case class schema has not matched with actual data schema, Please check the case class schema below. try to match your case class schema to data schema it will work.
Your case class schema is :
scala> df.printSchema
root
|-- FlowI: struct (nullable = true)
| |-- Mac: string (nullable = true)
| |-- SrcIP: string (nullable = true)
| |-- DestIP: string (nullable = true)
| |-- DestPort: integer (nullable = false)
| |-- L4Protocol: byte (nullable = false)
| |-- Direction: byte (nullable = false)
| |-- Status: byte (nullable = false)
| |-- key: string (nullable = true)
|-- FlowS: struct (nullable = true)
| |-- minFlowTime: long (nullable = false)
| |-- maxFlowTime: long (nullable = false)
| |-- flowStartedCount: long (nullable = false)
Try to change your code like below it should work now.
val inputFlowRecordsAgg = inputFlowRecords.groupBy(column("FlowI.key") as "key")
.agg(min("FlowS.minFlowTime") as "minFlowTime" , max("FlowS.maxFlowTime") as "maxFlowTime",
sum("FlowS.flowStartedCount") as "flowStartedCount"
, first("FlowI.Mac") as "Mac"
, first("FlowI.SrcIP") as "SrcIP" , first("DestIP") as "DestIP"
,first("FlowI.DestPort") as "DestPort"
, first("FlowI.L4Protocol") as "L4Protocol"
, first("FlowI.Direction") as "Direction" , first("FlowI.Status") as "Status")
.select(struct($"key",$"Mac",$"SrcIP",$"DestIP",$"DestPort",$"L4Protocol",$"Direction",$"Status").as("FlowI"),struct($"flowStartedCount",$"minFlowTime",$"maxFlowTime").as("FlowS")) // add this line & change based on your columns .. i have added roughly..:)

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Reordering fields in nested dataframe

How do I reorder fields in a nested dataframe in scala?
for e.g below is the expected and desired schemas
currently->
root
|-- domain: struct (nullable = false)
| |-- assigned: string (nullable = true)
| |-- core: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- action: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- dqid: string (nullable = true)
expected->
root
|-- domain: struct (nullable = false)
| |-- core: string (nullable = true)
| |-- assigned: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- dqid: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- action: string (nullable = true)
```
You need to define schema before you read the dataframe.
val schema = val schema = StructType(Array(StructField("root",StructType(Array(StructField("domain",StructType(Array(StructField("core",StringType,true), StructField("assigned",StringType,true), StructField("createdBy",StringType,true))),true), StructField("Event",StructType(Array(StructField("dqid",StringType,true), StructField("eventid",StringType,true), StructField("action",StringType,true))),true))),true)))
Now, you can apply this schema while reading your file.
val df = spark.read.schema(schema).json("path/to/json")
Should work with any nested data.
Hope this helps!
Most efficient approach might be to just select the nested elements and wrap in a couple of structs, as shown below:
case class Domain(assigned: String, core: String, createdBy: Long)
case class Event(action: String, eventid: String, dqid: String)
val df = Seq(
(Domain("a", "b", 1L), Event("c", "d", "e")),
(Domain("f", "g", 2L), Event("h", "i", "j"))
).toDF("domain", "event")
val df2 = df.select(
struct($"domain.core", $"domain.assigned", $"domain.createdBy").as("domain"),
struct($"event.dqid", $"event.action", $"event.eventid").as("event")
)
df2.printSchema
// root
// |-- domain: struct (nullable = false)
// | |-- core: string (nullable = true)
// | |-- assigned: string (nullable = true)
// | |-- createdBy: long (nullable = true)
// |-- event: struct (nullable = false)
// | |-- dqid: string (nullable = true)
// | |-- action: string (nullable = true)
// | |-- eventid: string (nullable = true)
An alternative would be to apply row-wise map:
import org.apache.spark.sql.Row
val df2 = df.map{ case Row(Row(as: String, co: String, cr: Long), Row(ac: String, ev: String, dq: String)) =>
((co, as, cr), (dq, ac, ev))
}.toDF("domain", "event")

Schema conversion from String to Array[Structype] using Spark Scala

I've the sample data as shown below, i would need to convert columns(ABS, ALT) from string to Array[structType] using spark scala code. Any help would be much appreciated.
With the help of UDF, i was able to convert from string to arrayType, but need some help on converting from string to Array[structType] for these two columns(ABS, ALT).
VIN TT MSG_TYPE ABS ALT
MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX,"V":0.0}]
[{"E":156957XXXXXX,"V":0.0}]
df.currentSchema
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)
df.expectedSchema:
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
It also works if you try as below:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))
val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
final_df.printSchema:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
You can use an udf to parse the Json and transform it into arrays of structs.
First, define a function that parses the Json (based on this answer):
case class Data(E:String, V:Double)
class CC[T] extends Serializable { def unapply(a: Any): Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
def toStruct(in: String): Array[Data] = {
if( in == null || in.isEmpty) return new Array[Data](0)
val result = for {
Some(L(map)) <- List(JSON.parseFull(in))
M(data) <- map
S(e) = data("E")
D(v) = data("V")
} yield {
Data(e, v)
}
result.toArray
}
This function returns an array of Data objects, that have already the correct structure. Now we use this function to define an udf
val ts: String => Array[Data] = toStruct(_)
import org.apache.spark.sql.functions.udf
val toStructUdf = udf(ts)
Finally we call the udf (for example in a select statement):
val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()
Output:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)

Nested JSON in Spark

I have the following JSON loaded as a DataFrame:
root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)
I want to get the following RDD from this DataFrame:
RDD[(more1, more2, more3, field1, field2)]
How can I do this? I think I have to use flatMap for the nested JSON?
A combination of explode and dot syntax should do the trick:
import org.apache.spark.sql.functions.explode
case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)
val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")
df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)
val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")
val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd
aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])
Depending on your requirements you can follow this with map to extract values from the rows:
import org.apache.spark.sql.Row
aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}
See also Querying Spark SQL DataFrame with complex types