Spark - Flatten Array of Structs using flatMap - scala

I have a df with schema -
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- qty: long (nullable = true)
| | |-- rqty: long (nullable = true)
| | |-- pids: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- sqty: long (nullable = true)
| | |-- id1: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
| | |-- otherId: string (nullable = true)
|-- primarykey: string (nullable = true)
|-- runtime: string (nullable = true)
I don't want to use explode as its extremely slow and wanted to try flapMap instead.
I tried doing -
val ds = df1.as[(Array[StructType], String, String)]
ds.flatMap{ case(x, y, z) => x.map((_, y, z))}.toDF()
This gives me error -
scala.MatchError: org.apache.spark.sql.types.StructType
How do I flatten arrayCol?
Sample data -
{
"primaryKeys":"sfdfrdsdjn",
"runtime":"2020-10-31T13:01:04.813Z",
"arrayCol":[{"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}]
}
Expected Output -
primaryKey runtime arrayCol
sfdfrdsdjn 2020-10-31T13:01:04.813Z {"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}
I want one row for every element in arrayCol. Just like explode(arrayCol)

You almost had it. Remember when using spark with scala, always try to use the Dataset API as often as possible. This not only increases readeability, but helps solve these type of issues very quickly.
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val inputTest = List(
FullArrayCols(Seq(ArrayCol("qwerty", Seq(), 3, 3, Seq(), 3, "dsfdsfdsf", "sdfsdfsdPuyOplzlR1idvfPkv5138g",
ArrayColWindow("2020-11-01T10:30:00Z", "2020-11-01T12:30:00Z"), null)),
"sfdfrdsdjn", "2020-10-31T13:01:04.813Z")
).toDS()
val output = inputTest.as[(Seq[ArrayCol],String,String)].flatMap{ case(x, y, z) => x.map((_, y, z))}
output.show(truncate=false)

you could just change
val ds = df1.as[(Array[StructType], String, String)]
to
val ds = df1.as[(Array[String], String, String)]
and you can get rid of the error and see the output you want.

Related

Spark merge two columns that are arrays of different structs with overlapping field

I have a question I was unable to solve when working with Scala Spark (or PySpark).
How can we merge two fields that are arrays of structs of different fields.
For example, if I have schema like so:
df.printSchema()
root
|-- arrayOne: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- Q: string (nullable = true)
|-- ArrayTwo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| | |-- Q: string (nullable = true)
Can I create a df of the following schema using UDF:
df.printSchema()
root
|-- arrayOne: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- Q: string (nullable = true)
|-- ArrayTwo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| | |-- Q: string (nullable = true)
|-- ArrayThree: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- Q: string (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
When a,b,c are not null, x,y,z are null and vice-versa, however when x,y,z are nulls there is Q that will be non-null and have the same value for both arrays.
The UDF is an important aspect here, as exploding (explode_outer) both fields will be:
Too expensive
Resulting in repetition of the second array elements that would corrupt the fidelity of the data - because of the element Q.
Writing UDF in Pig Latin or even plain Map Reduce would be very easy, but for some reason it is very complicated in the Spark environment, for me at least.
What would be a way to write a UDF to concatenate the two arrays and create the new struct with superset of elements of the two different structs?
Here's a sample test I did. I created 2 fields of Array(Struct()) - arr_struct1 and arr_struct2. Using them, I created the new field arr_struct12 that has all elements of the previous 2 array-struct fields. I've retained all columns in the printSchema() for a better understanding.
data_sdf. \
withColumn('arr_struct1', func.array(func.struct(func.col('a').alias('a'), func.col('b').alias('b'), func.col('c').alias('c')))). \
withColumn('arr_struct2', func.array(func.struct(func.col('e').alias('e'), func.col('f').alias('f')))). \
withColumn('struct1', func.col('arr_struct1')[0]). \
withColumn('struct2', func.col('arr_struct2')[0]). \
withColumn('arr_struct12', func.array(func.struct('struct1.*', 'struct2.*'))). \
printSchema()
# ignore columns a to g in the schema below
# root
# |-- a: long (nullable = true)
# |-- b: long (nullable = true)
# |-- c: long (nullable = true)
# |-- d: long (nullable = true)
# |-- e: long (nullable = true)
# |-- f: long (nullable = true)
# |-- g: long (nullable = true)
# |-- arr_struct1: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
# | | |-- c: long (nullable = true)
# |-- arr_struct2: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- e: long (nullable = true)
# | | |-- f: long (nullable = true)
# |-- vals1: struct (nullable = true)
# | |-- a: long (nullable = true)
# | |-- b: long (nullable = true)
# | |-- c: long (nullable = true)
# |-- vals2: struct (nullable = true)
# | |-- e: long (nullable = true)
# | |-- f: long (nullable = true)
# |-- arr_struct12: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
# | | |-- c: long (nullable = true)
# | | |-- e: long (nullable = true)
# | | |-- f: long (nullable = true)
In case you'd like to specify which elements to keep, you can specify it using col('col_name.element_alias') instead of the *.
data_sdf. \
withColumn('arr_struct1', func.array(func.struct(func.col('a').alias('a'), func.col('b').alias('b'), func.col('c').alias('c')))). \
withColumn('arr_struct2', func.array(func.struct(func.col('e').alias('e'), func.col('f').alias('f')))). \
withColumn('struct1', func.col('arr_struct1')[0]). \
withColumn('struct2', func.col('arr_struct2')[0]). \
withColumn('arr_struct12',
func.array(func.struct(func.col('struct1.a').alias('a'),
func.col('struct1.b').alias('b'),
func.col('struct2.f').alias('f')
)
)
). \
printSchema()
# ignore columns a to g in the schema below
# root
# |-- a: long (nullable = true)
# |-- b: long (nullable = true)
# |-- c: long (nullable = true)
# |-- d: long (nullable = true)
# |-- e: long (nullable = true)
# |-- f: long (nullable = true)
# |-- g: long (nullable = true)
# |-- arr_struct1: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
# | | |-- c: long (nullable = true)
# |-- arr_struct2: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- e: long (nullable = true)
# | | |-- f: long (nullable = true)
# |-- struct1: struct (nullable = true)
# | |-- a: long (nullable = true)
# | |-- b: long (nullable = true)
# | |-- c: long (nullable = true)
# |-- struct2: struct (nullable = true)
# | |-- e: long (nullable = true)
# | |-- f: long (nullable = true)
# |-- arr_struct12: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
# | | |-- f: long (nullable = true)
I will share below the solution that worked for me. Solution is a simple UDF that takes two arrays of structs as input, and creates a sequence of new struct that supersets the fields of the two structs as required:
case class ItemOne(a: String,
b: String,
c: String,
Q: String)
case class ItemTwo(x: String,
y: String,
z: String,
Q: String)
case class ItemThree(a: String,
b: String,
c: String,
x: String,
y: String,
z: String,
Q: String)
val combineAuctionData = udf((arrayOne: Seq[Row], arrayTwo: Seq[Row]) => {
val result = new ListBuffer[ItemThree]()
// For loop over list of ItemOne and get all ItemThree
for (el <- arrayOne) {
result += ItemThree(el.getString(0),
el.getString(1),
el.getString(2),
None,
None,
None,
el.getString(3))
}
// For loop over list of ItemTwo and get all ItemThree
for (el <- arrayTwo) {
result += ItemThree(None,
None,
None,
el.getString(0),
el.getString(1),
el.getString(2),
el.getString(3))
}
// Return List inheriting Seq of ItemThree's
result.toSeq
}: Seq[ItemThree])

How to flatten Array of WrappedArray of structs in scala

I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.
If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Index a map by a the value of a different column in Spark

I have a dataframe with the following schema:
|-- A: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- uid: string (nullable = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
|-- keyindex: string (nullable = true)
For example, if I have the following data:
{"A":{
"innerkey_1":[{"uid":"1","price":0.01,"recordtype":"STAT"},
{"uid":"6","price":4.3,"recordtype":"DYN"}],
"innerkey_2":[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]},
"innerkey_2"}
I use the following schema to read the data into a dataframe:
val schema = (new StructType().add("mainkey", MapType(StringType, new ArrayType(new StructType().add("uid",StringType).add("price",DoubleType).add("recordtype",StringType), true))).add("keyindex",StringType))
I am trying to figure out if I can use the keyindex to select values from the map. Since the keyindex in the example is "innerkey_2", I want the output to be
[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]
Thanks for your help!
getItem should do the trick:
scala> val df = Seq(("innerkey2", Map("innerkey2" -> Seq(("1", 0.01, "STAT"))))).toDF("keyindex", "A")
df: org.apache.spark.sql.DataFrame = [keyindex: string, A: map<string,array<struct<_1:string,_2:double,_3:string>>>]
scala> df.select($"A"($"keyindex")).show
+---------------+
| A[keyindex]|
+---------------+
|[[1,0.01,STAT]]|
+---------------+

How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark

How can I get all individual elements from MEMEBERDETAIL?
scala> xmlDF.printSchema
root
|-- MEMBERDETAIL: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FILE_ID: double (nullable = true)
| | |-- INP_SOURCE_ID: long (nullable = true)
| | |-- NET_DB_CR_SW: string (nullable = true)
| | |-- NET_PYM_AMT: string (nullable = true)
| | |-- ORGNTD_DB_CR_SW: string (nullable = true)
| | |-- ORGNTD_PYM_AMT: double (nullable = true)
| | |-- RCVD_DB_CR_SW: string (nullable = true)
| | |-- RCVD_PYM_AMT: string (nullable = true)
| | |-- RECON_DATE: string (nullable = true)
| | |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D, 94,842.38,C,0.0,D, 94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D, 33,169.84,C,0.0,D, 33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D, 155,500.88,C,0.0,D, 155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D, 164,952.29,C,0.0,D, 164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D, 203,061.06,C,0.0,D, 203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D, 104,040.01,C,0.0,D, 104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C, 849.14,C,849.14,C, 0.00,2016-10-11,7], [1.1610100000001465E22,1,D, 3.78,C,0.0,D, 3.78,2016-10-10,1], [1.1610100000001465E22,1,D, 261.54,C,0.0,D, ...
After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.
xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]
And StructType is like below -
res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]
This actually helped me -
xmlDF.selectExpr("explode(MEMBERDETAIL) as e").select("e.FILE_ID", "e.INP_SOURCE_ID", "e.NET_DB_CR_SW", "e.NET_PYM_AMT", "e.ORGNTD_DB_CR_SW", "e.ORGNTD_PYM_AMT", "e.RCVD_DB_CR_SW", "e.RCVD_PYM_AMT", "e.RECON_DATE").show()