Access nested columns in transformers setInputCol() method

Access nested columns in transformers setInputCol() method - scala

I am trying to parse a Wikipedia dump using Databricks XML parser and Spark's pipeline approach. The goal is to compute a feature vector for the text field, which is a nested column.
The schema of the XML is as follows:
root
|-- id: long (nullable = true)
|-- ns: long (nullable = true)
|-- revision: struct (nullable = true)
| |-- comment: string (nullable = true)
| |-- contributor: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- ip: string (nullable = true)
| | |-- username: string (nullable = true)
| |-- format: string (nullable = true)
| |-- id: long (nullable = true)
| |-- minor: string (nullable = true)
| |-- model: string (nullable = true)
| |-- parentid: long (nullable = true)
| |-- sha1: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _bytes: long (nullable = true)
| | |-- _space: string (nullable = true)
| |-- timestamp: string (nullable = true)
|-- title: string (nullable = true)
After reading in the dump with
val raw = spark.read.format("com.databricks.spark.xml").option("rowTag", "page").load("some.xml")
I am able to access the respective text using
raw.select("revision.text._VALUE").show(10)
I have then as my first stage in the Spark pipeline a RegexTokenizer, which needs to access revision.text._VALUE in order to transform the data:
val tokenizer = new RegexTokenizer().
setInputCol("revision.text._VALUE").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(raw)
However, this step fails with:
Name: java.lang.IllegalArgumentException
Message: Field "revision.text._VALUE" does not exist.
Any advise on how to have nested columns in the setInputCol method?
Thanks a lot!

Try creating a temp column before using in RegexTokenizer as
val rawTemp = raw.withColumn("temp", $"revision.text._VALUE")
Then you can use rawTemp dataframe and temp column in RegexTokenizer as
val tokenizer = new RegexTokenizer().
setInputCol("temp").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(rawTemp)
Hope the answer is helpful

Related

spark scala: extracting xml from one column

Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.

Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))

rename spark dataframe structType fields

Given a dynamic structType . here structType name is not known . It is dynamic and hence its name is changing.
The name is variable . So don't pre assume "MAIN_COL" in the schema.
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
how can we write a dynamic code to rename the fields of a structType with its name as its prefix.
root
|-- MAIN_COL: struct (nullable = true)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)
| |-- MAIN_COL_d: string (nullable = true)
| |-- MAIN_COL_f: long (nullable = true)
| |-- MAIN_COL_g: long (nullable = true)
| |-- MAIN_COL_h: long (nullable = true)
| |-- MAIN_COL_j: long (nullable = true)

You can use DSL to update the schema of nested columns.
import org.apache.spark.sql.types._
val schema: StructType = df.schema.fields.head.dataType.asInstanceOf[StructType]
val updatedSchema = StructType.apply(
schema.fields.map(sf => StructField.apply("MAIN_COL_" + sf.name, sf.dataType))
)
val resultDF = df.withColumn("MAIN_COL", $"MAIN_COL".cast(updatedSchema))
Updated Schema:
root
|-- MAIN_COL: struct (nullable = false)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)

How to display the string variable in the root using Spark SQL?

Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.

Iterating through nested element in spark

I have a dataframe with following schema :-
scala> final_df.printSchema
root
|-- mstr_prov_id: string (nullable = true)
|-- prov_ctgry_cd: string (nullable = true)
|-- prov_orgnl_efctv_dt: timestamp (nullable = true)
|-- prov_trmntn_dt: timestamp (nullable = true)
|-- prov_trmntn_rsn_cd: string (nullable = true)
|-- npi_rqrd_ind: string (nullable = true)
|-- prov_stts_aray_txt: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- PROV_STTS_KEY: string (nullable = true)
| | |-- PROV_STTS_EFCTV_DT: timestamp (nullable = true)
| | |-- PROV_STTS_CD: string (nullable = true)
| | |-- PROV_STTS_TRMNTN_DT: timestamp (nullable = true)
| | |-- PROV_STTS_TRMNTN_RSN_CD: string (nullable = true)
I am running following code to do basic cleansing but its not working inside "prov_stts_aray_txt" , basically its not going inside array type and performing transformation desire. I want to iterate through out nested all fields(Flat and nested field within Dataframe and perform basic transformation.
for(dt <- final_df.dtypes){
final_df = final_df.withColumn(dt._1,when(upper(trim(col(dt._1))) === "NULL",lit(" ")).otherwise(col(dt._1)))
}
please help.
Thanks

Spark Read json object data as MapType

I have written a sample spark app, where I'm creating a dataframe with MapType and writing it to disk. Then I'm reading the same file & printing its schema. Bu the output file schema is different when compared to Input Schema and I don't see the MapType in the Output. How I can read that output file with MapType
Code
import org.apache.spark.sql.{SaveMode, SparkSession}
case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])
object sample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
import spark.implicits._
val schemaData = Seq(
Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
Person("Persion2", Map("It" -> Department("1", "It Department")))
)
val df = spark.sparkContext.parallelize(schemaData).toDF()
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.json("D:\\save\\output\\*.json").printSchema()
}
}
OutPut
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- department: struct (nullable = true)
| |-- HR: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
| |-- It: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
|-- name: string (nullable = true)
Json File
{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}
EDIT :
For just explaining my requirement I have added the saving file part above. In actual scenario I will be just reading JSON data provided above and work on that dataframe

You can pass the schema from prevous dataframe while reading the json data
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Access nested columns in transformers setInputCol() method - scala

Related

spark scala: extracting xml from one column

rename spark dataframe structType fields

How to display the string variable in the root using Spark SQL?

Iterating through nested element in spark

Spark Read json object data as MapType

Categories

Resources