spark scala: extracting xml from one column - scala

Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.

Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))

Related

Handling varying JSON schema when creating a dataframe in PySpark

I've Databricks notebook that reads the delta data in JSON format every hour. So lets says at 11AM the schema of the file is as follows,
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
The next hour at 12PM the schema changes to,
root
|-- number: string (nullable = true)
|-- company: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
Some of the columns change from string to struct and vice-versa. So if I select the col(company.link) and the incoming schema is of type string the code fails.
How do I handle schema changes in PySpark when reading the file as my end goal is to flatten the JSON to a CSV format.
def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
#df has the exploded JSON data
df2 = df.select("result.number",
"result.company",
"result.assignment_group")
df23 = df2
for name, cols in df2.dtypes:
if 'struct' in get_dtype(df2, name):
try:
df23 = df23.withColumn(name+"_link", col(name+".link")).withColumn(name+"_value", col(name+".value")).drop(name)
except:
print("error")
df23.printSchema()
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment_group_link: string (nullable = true)
|-- assignment_group_value: string (nullable = true)
So this is what I did,
Created a function that identifies if the column is of type struct
read all the columns from the base dataframe that has the exploded result from JSON
then loop through the column and if it of type struct then add new columns with the nested values.

How to change datatype of a field in a two-level schema tree?

Now I have a dataframe with schema:
root
|-- id: string (nullable = true)
|-- st_one: struct (nullable = true)
| |-- tid: long (nullable = true)
| |-- st_two: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- score: long (nullable = true)
|-- ts: double (nullable = true)
|-- date: string (nullable = true)
I want to change score's type from long to double. Is there any good solution?
BTW, I'm using Scala.
I've already known how to do it by "listing" all the fields. I want a more common method that could fit even st_two contains a thousand fields or more.
You can update the struct type column st_one like this:
val df1 = df.withColumn(
"st_one",
struct(
$"st_one.tid",
struct(
$"st_one.st_two.name",
$"st_one.st_two.score".cast("double").as("score")
).as("st_two")
)
)
You can do a complex cast:
val df2 = df.withColumn("st_one", $"st_one".cast("struct<tid:long, st_two:struct<name:string, score:double>>"))

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Use spark dataframe column value as an alias of another column

Using spark and scala i would like to set a struct and use one of the column value as an alias of another column.
I have this dataframe
root
|-- type: string (nullable = true)
|-- metadata
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
And i would like to have this
root
|-- metadata
|-- TYPE_VALUE
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
In my dataframe, i try with struct($"metadata".as($"type".toString())).alias("metadata") but its doesn't work, it take the field name instead of taking the value.
Well that is not going to work, because that would require a dynamic schema that is not known beforehand.
The best you could do is create a mapping out of it:
df.select(
map('type, 'metadata).as("metadata")
)
With an output like:
+-------------------------------+
|metadata |
+-------------------------------+
|Map(type1 -> [Tom,38,M,NL]) |
|Map(type2 -> [Marijke,37,F,NL])|
+-------------------------------+
res1: Unit = ()
root
|-- metadata: map (nullable = false)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
| | |-- gender: string (nullable = true)
| | |-- country: string (nullable = true)
Or just split the data based on the type and process each type as separate dataframe

Access nested columns in transformers setInputCol() method

I am trying to parse a Wikipedia dump using Databricks XML parser and Spark's pipeline approach. The goal is to compute a feature vector for the text field, which is a nested column.
The schema of the XML is as follows:
root
|-- id: long (nullable = true)
|-- ns: long (nullable = true)
|-- revision: struct (nullable = true)
| |-- comment: string (nullable = true)
| |-- contributor: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- ip: string (nullable = true)
| | |-- username: string (nullable = true)
| |-- format: string (nullable = true)
| |-- id: long (nullable = true)
| |-- minor: string (nullable = true)
| |-- model: string (nullable = true)
| |-- parentid: long (nullable = true)
| |-- sha1: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _bytes: long (nullable = true)
| | |-- _space: string (nullable = true)
| |-- timestamp: string (nullable = true)
|-- title: string (nullable = true)
After reading in the dump with
val raw = spark.read.format("com.databricks.spark.xml").option("rowTag", "page").load("some.xml")
I am able to access the respective text using
raw.select("revision.text._VALUE").show(10)
I have then as my first stage in the Spark pipeline a RegexTokenizer, which needs to access revision.text._VALUE in order to transform the data:
val tokenizer = new RegexTokenizer().
setInputCol("revision.text._VALUE").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(raw)
However, this step fails with:
Name: java.lang.IllegalArgumentException
Message: Field "revision.text._VALUE" does not exist.
Any advise on how to have nested columns in the setInputCol method?
Thanks a lot!
Try creating a temp column before using in RegexTokenizer as
val rawTemp = raw.withColumn("temp", $"revision.text._VALUE")
Then you can use rawTemp dataframe and temp column in RegexTokenizer as
val tokenizer = new RegexTokenizer().
setInputCol("temp").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(rawTemp)
Hope the answer is helpful