Access nested columns in transformers setInputCol() method - scala

I am trying to parse a Wikipedia dump using Databricks XML parser and Spark's pipeline approach. The goal is to compute a feature vector for the text field, which is a nested column.
The schema of the XML is as follows:
root
|-- id: long (nullable = true)
|-- ns: long (nullable = true)
|-- revision: struct (nullable = true)
| |-- comment: string (nullable = true)
| |-- contributor: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- ip: string (nullable = true)
| | |-- username: string (nullable = true)
| |-- format: string (nullable = true)
| |-- id: long (nullable = true)
| |-- minor: string (nullable = true)
| |-- model: string (nullable = true)
| |-- parentid: long (nullable = true)
| |-- sha1: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _bytes: long (nullable = true)
| | |-- _space: string (nullable = true)
| |-- timestamp: string (nullable = true)
|-- title: string (nullable = true)
After reading in the dump with
val raw = spark.read.format("com.databricks.spark.xml").option("rowTag", "page").load("some.xml")
I am able to access the respective text using
raw.select("revision.text._VALUE").show(10)
I have then as my first stage in the Spark pipeline a RegexTokenizer, which needs to access revision.text._VALUE in order to transform the data:
val tokenizer = new RegexTokenizer().
setInputCol("revision.text._VALUE").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(raw)
However, this step fails with:
Name: java.lang.IllegalArgumentException
Message: Field "revision.text._VALUE" does not exist.
Any advise on how to have nested columns in the setInputCol method?
Thanks a lot!

Try creating a temp column before using in RegexTokenizer as
val rawTemp = raw.withColumn("temp", $"revision.text._VALUE")
Then you can use rawTemp dataframe and temp column in RegexTokenizer as
val tokenizer = new RegexTokenizer().
setInputCol("temp").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(rawTemp)
Hope the answer is helpful

Related

spark scala: extracting xml from one column

Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.
Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))

rename spark dataframe structType fields

Given a dynamic structType . here structType name is not known . It is dynamic and hence its name is changing.
The name is variable . So don't pre assume "MAIN_COL" in the schema.
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
how can we write a dynamic code to rename the fields of a structType with its name as its prefix.
root
|-- MAIN_COL: struct (nullable = true)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)
| |-- MAIN_COL_d: string (nullable = true)
| |-- MAIN_COL_f: long (nullable = true)
| |-- MAIN_COL_g: long (nullable = true)
| |-- MAIN_COL_h: long (nullable = true)
| |-- MAIN_COL_j: long (nullable = true)
You can use DSL to update the schema of nested columns.
import org.apache.spark.sql.types._
val schema: StructType = df.schema.fields.head.dataType.asInstanceOf[StructType]
val updatedSchema = StructType.apply(
schema.fields.map(sf => StructField.apply("MAIN_COL_" + sf.name, sf.dataType))
)
val resultDF = df.withColumn("MAIN_COL", $"MAIN_COL".cast(updatedSchema))
Updated Schema:
root
|-- MAIN_COL: struct (nullable = false)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)

How to display the string variable in the root using Spark SQL?

Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.

Iterating through nested element in spark

I have a dataframe with following schema :-
scala> final_df.printSchema
root
|-- mstr_prov_id: string (nullable = true)
|-- prov_ctgry_cd: string (nullable = true)
|-- prov_orgnl_efctv_dt: timestamp (nullable = true)
|-- prov_trmntn_dt: timestamp (nullable = true)
|-- prov_trmntn_rsn_cd: string (nullable = true)
|-- npi_rqrd_ind: string (nullable = true)
|-- prov_stts_aray_txt: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- PROV_STTS_KEY: string (nullable = true)
| | |-- PROV_STTS_EFCTV_DT: timestamp (nullable = true)
| | |-- PROV_STTS_CD: string (nullable = true)
| | |-- PROV_STTS_TRMNTN_DT: timestamp (nullable = true)
| | |-- PROV_STTS_TRMNTN_RSN_CD: string (nullable = true)
I am running following code to do basic cleansing but its not working inside "prov_stts_aray_txt" , basically its not going inside array type and performing transformation desire. I want to iterate through out nested all fields(Flat and nested field within Dataframe and perform basic transformation.
for(dt <- final_df.dtypes){
final_df = final_df.withColumn(dt._1,when(upper(trim(col(dt._1))) === "NULL",lit(" ")).otherwise(col(dt._1)))
}
please help.
Thanks

Spark Read json object data as MapType

I have written a sample spark app, where I'm creating a dataframe with MapType and writing it to disk. Then I'm reading the same file & printing its schema. Bu the output file schema is different when compared to Input Schema and I don't see the MapType in the Output. How I can read that output file with MapType
Code
import org.apache.spark.sql.{SaveMode, SparkSession}
case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])
object sample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
import spark.implicits._
val schemaData = Seq(
Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
Person("Persion2", Map("It" -> Department("1", "It Department")))
)
val df = spark.sparkContext.parallelize(schemaData).toDF()
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.json("D:\\save\\output\\*.json").printSchema()
}
}
OutPut
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- department: struct (nullable = true)
| |-- HR: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
| |-- It: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
|-- name: string (nullable = true)
Json File
{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}
EDIT :
For just explaining my requirement I have added the saving file part above. In actual scenario I will be just reading JSON data provided above and work on that dataframe
You can pass the schema from prevous dataframe while reading the json data
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Hope this helps!