Below is my data structure:
root
|-- platform_build_id: string (nullable = true)
|-- pro: struct (nullable = true)
| |-- av: string (nullable = true)
| |-- avc: string (nullable = true)
i tried using explode function
val flattened = Data_df.withColumn("pro", explode(array($"pro")))
this will work if there is an element inside pro column, but in my case what should i use to get this data into flattened format.
Use .select() with struct column (pro.*) will result flatten format.
Data_df.select("platform_build_id","pro.*").show()
Example:
val s_ds=Seq("""{"platform_build_id":"1234","pro":{"av":"pro_av","avc":"pro_avc"}}""").toDS
spark.read.json(s_ds).printSchema
//root
// |-- platform_build_id: string(nullable = true)
// |-- pro: struct (nullable = true)
// | |-- av: string (nullable = true)
// | |-- avc: string (nullable = true)
spark.read.json(s_ds).select("platform_build_id","pro.*").show()
Result:
+-----------------+------+-------+
|platform_build_id| av| avc|
+-----------------+------+-------+
| 1234|pro_av|pro_avc|
+-----------------+------+-------+
Related
Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.
Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))
I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)
I have a dataframe with following schema :-
scala> final_df.printSchema
root
|-- mstr_prov_id: string (nullable = true)
|-- prov_ctgry_cd: string (nullable = true)
|-- prov_orgnl_efctv_dt: timestamp (nullable = true)
|-- prov_trmntn_dt: timestamp (nullable = true)
|-- prov_trmntn_rsn_cd: string (nullable = true)
|-- npi_rqrd_ind: string (nullable = true)
|-- prov_stts_aray_txt: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- PROV_STTS_KEY: string (nullable = true)
| | |-- PROV_STTS_EFCTV_DT: timestamp (nullable = true)
| | |-- PROV_STTS_CD: string (nullable = true)
| | |-- PROV_STTS_TRMNTN_DT: timestamp (nullable = true)
| | |-- PROV_STTS_TRMNTN_RSN_CD: string (nullable = true)
I am running following code to do basic cleansing but its not working inside "prov_stts_aray_txt" , basically its not going inside array type and performing transformation desire. I want to iterate through out nested all fields(Flat and nested field within Dataframe and perform basic transformation.
for(dt <- final_df.dtypes){
final_df = final_df.withColumn(dt._1,when(upper(trim(col(dt._1))) === "NULL",lit(" ")).otherwise(col(dt._1)))
}
please help.
Thanks
I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))
I am trying to parse a Wikipedia dump using Databricks XML parser and Spark's pipeline approach. The goal is to compute a feature vector for the text field, which is a nested column.
The schema of the XML is as follows:
root
|-- id: long (nullable = true)
|-- ns: long (nullable = true)
|-- revision: struct (nullable = true)
| |-- comment: string (nullable = true)
| |-- contributor: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- ip: string (nullable = true)
| | |-- username: string (nullable = true)
| |-- format: string (nullable = true)
| |-- id: long (nullable = true)
| |-- minor: string (nullable = true)
| |-- model: string (nullable = true)
| |-- parentid: long (nullable = true)
| |-- sha1: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _bytes: long (nullable = true)
| | |-- _space: string (nullable = true)
| |-- timestamp: string (nullable = true)
|-- title: string (nullable = true)
After reading in the dump with
val raw = spark.read.format("com.databricks.spark.xml").option("rowTag", "page").load("some.xml")
I am able to access the respective text using
raw.select("revision.text._VALUE").show(10)
I have then as my first stage in the Spark pipeline a RegexTokenizer, which needs to access revision.text._VALUE in order to transform the data:
val tokenizer = new RegexTokenizer().
setInputCol("revision.text._VALUE").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(raw)
However, this step fails with:
Name: java.lang.IllegalArgumentException
Message: Field "revision.text._VALUE" does not exist.
Any advise on how to have nested columns in the setInputCol method?
Thanks a lot!
Try creating a temp column before using in RegexTokenizer as
val rawTemp = raw.withColumn("temp", $"revision.text._VALUE")
Then you can use rawTemp dataframe and temp column in RegexTokenizer as
val tokenizer = new RegexTokenizer().
setInputCol("temp").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(rawTemp)
Hope the answer is helpful