Exploding nested df columns in Spark Scala - scala

Column name is 'col1' and is of the form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B0: struct (nullable = true)
| | | |-- B01: string (nullable = true)
| | | |-- B02: string (nullable = true)
| | |-- B1: string (nullable = true)
| | |-- B2: string (nullable = true)
| | |-- B3: string (nullable = true)
| | |-- B4: string (nullable = true)
| | |-- B5: string (nullable = true)
I am trying 2 things first to fetch the value B2. Code:
val explodeDF = test_df.explode($"col1") { case Row(col1_details:Array[String]) =>
col1_details:Array.map{ col1_details:Array =>
val firstName = col1_details:Array(2).asInstanceOf[String]
val lastName = col1_details:Array(3).asInstanceOf[String]
val email = col1_details:Array(4).asInstanceOf[String]
val salary = col1_details:Array(5).asInstanceOf[String]
notes_details(firstName, lastName, email, salary)
}
}
Error:
error: too many arguments for method apply: (index: Int)Char in class StringOps
col1_details(firstName, lastName, email, salary)
I have tried various snippets and I have been getting different errors. Any suggestions on the what the mistake would be highly helpful.

Related

Compare two columns in different dataframes, of types String and Array<string> respectively in pyspark without use explode function

I have two dfs:
df1:
sku category cep seller state
4858 BDU 00000 xefd SP
df2:
depth price sku seller infos_product
6.1 5.60 47347 gaha [{1, 86800000, 86...
For df2 I have the follow schema:
|-- depth: double (nullable = true)
|-- sku: string (nullable = true)
|-- price: double (nullable = true)
|-- seller: string (nullable = true)
|-- infos_produt: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- modality_id: integer (nullable = true)
| | |-- cep_coleta_ini: integer (nullable = true)
| | |-- cep_coleta_fim: integer (nullable = true)
| | |-- cep_entrega_ini: integer (nullable = true)
| | |-- cep_entrega_fim: integer (nullable = true)
| | |-- cubage_factor_entrega: double (nullable = true)
| | |-- value_coleta: double (nullable = true)
| | |-- value_entrega: double (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
I need to do a check between these df's. Something like that:
condi = [(df1.seller_id == df2.seller) & (df2.infos_produt.state == df1.state)]
df_finish = (df1\
.join(df2, on = condi ,how='left'))
But, return a error:
AnalysisException: cannot resolve '(infos_produt.`state` = view.coverage_state)' due to data type mismatch: differing types in '(infos_produt.`state` = view.coverage_state)' (array<string> and string).
Can anyone help me?
PS: I would like resolve this problem without apply 'explode', because I have a big data and explode function don't work.

parsing complex nested json in Spark scala

I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)

Spark - Flatten Array of Structs using flatMap

I have a df with schema -
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- qty: long (nullable = true)
| | |-- rqty: long (nullable = true)
| | |-- pids: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- sqty: long (nullable = true)
| | |-- id1: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
| | |-- otherId: string (nullable = true)
|-- primarykey: string (nullable = true)
|-- runtime: string (nullable = true)
I don't want to use explode as its extremely slow and wanted to try flapMap instead.
I tried doing -
val ds = df1.as[(Array[StructType], String, String)]
ds.flatMap{ case(x, y, z) => x.map((_, y, z))}.toDF()
This gives me error -
scala.MatchError: org.apache.spark.sql.types.StructType
How do I flatten arrayCol?
Sample data -
{
"primaryKeys":"sfdfrdsdjn",
"runtime":"2020-10-31T13:01:04.813Z",
"arrayCol":[{"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}]
}
Expected Output -
primaryKey runtime arrayCol
sfdfrdsdjn 2020-10-31T13:01:04.813Z {"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}
I want one row for every element in arrayCol. Just like explode(arrayCol)
You almost had it. Remember when using spark with scala, always try to use the Dataset API as often as possible. This not only increases readeability, but helps solve these type of issues very quickly.
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val inputTest = List(
FullArrayCols(Seq(ArrayCol("qwerty", Seq(), 3, 3, Seq(), 3, "dsfdsfdsf", "sdfsdfsdPuyOplzlR1idvfPkv5138g",
ArrayColWindow("2020-11-01T10:30:00Z", "2020-11-01T12:30:00Z"), null)),
"sfdfrdsdjn", "2020-10-31T13:01:04.813Z")
).toDS()
val output = inputTest.as[(Seq[ArrayCol],String,String)].flatMap{ case(x, y, z) => x.map((_, y, z))}
output.show(truncate=false)
you could just change
val ds = df1.as[(Array[StructType], String, String)]
to
val ds = df1.as[(Array[String], String, String)]
and you can get rid of the error and see the output you want.

how to explode a dataframe schema in databricks

I have a schema that should be exploded, below is the schema
|-- CaseNumber: string (nullable = true)
|-- Customers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Contacts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FirstName: string (nullable = true)
| | | | |-- LastName: string (nullable = true)
I want my schema to be like this,
|-- CaseNumber: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
or
+----------+---------------------+
| CaseNumber| FirstName| LastName|
+----------+---------------------+
| 1 | aa | bb |
+----------|-----------|---------|
| 2 | cc | dd |
+------------------------------- |
I am new to databricks, any help would be appreciated.thanks
Here is one way to solve it without using explode command -
case class MyCase(val Customers = Array[Customer](), CaseNumber : String
)
case class Customers(val Contacts = Array[Contacts]()
)
case class Contacts(val Firstname:String, val LastName:String
)
val dataset = // dataframe.as[MyCase]
dataset.map{ mycase =>
// return a Seq of tuples like - (mycase.caseNumber, //read customer's contract's first and last name )
//one row per first and last names, repeat mycase.caseNumber .. basically a loop
}.flatmap(identity)
I think you can still do explode(customersFlat.contacts). I sure this something like this some while ago, so forgive me my syntax and let me know whether this works
df.select("caseNumber",explode("customersFlat.contacts").as("contacts").select("caseNumber","contacts.firstName","contacts.lastName")

How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark

How can I get all individual elements from MEMEBERDETAIL?
scala> xmlDF.printSchema
root
|-- MEMBERDETAIL: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FILE_ID: double (nullable = true)
| | |-- INP_SOURCE_ID: long (nullable = true)
| | |-- NET_DB_CR_SW: string (nullable = true)
| | |-- NET_PYM_AMT: string (nullable = true)
| | |-- ORGNTD_DB_CR_SW: string (nullable = true)
| | |-- ORGNTD_PYM_AMT: double (nullable = true)
| | |-- RCVD_DB_CR_SW: string (nullable = true)
| | |-- RCVD_PYM_AMT: string (nullable = true)
| | |-- RECON_DATE: string (nullable = true)
| | |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D, 94,842.38,C,0.0,D, 94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D, 33,169.84,C,0.0,D, 33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D, 155,500.88,C,0.0,D, 155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D, 164,952.29,C,0.0,D, 164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D, 203,061.06,C,0.0,D, 203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D, 104,040.01,C,0.0,D, 104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C, 849.14,C,849.14,C, 0.00,2016-10-11,7], [1.1610100000001465E22,1,D, 3.78,C,0.0,D, 3.78,2016-10-10,1], [1.1610100000001465E22,1,D, 261.54,C,0.0,D, ...
After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.
xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]
And StructType is like below -
res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]
This actually helped me -
xmlDF.selectExpr("explode(MEMBERDETAIL) as e").select("e.FILE_ID", "e.INP_SOURCE_ID", "e.NET_DB_CR_SW", "e.NET_PYM_AMT", "e.ORGNTD_DB_CR_SW", "e.ORGNTD_PYM_AMT", "e.RCVD_DB_CR_SW", "e.RCVD_PYM_AMT", "e.RECON_DATE").show()