Handling varying JSON schema when creating a dataframe in PySpark - pyspark

I've Databricks notebook that reads the delta data in JSON format every hour. So lets says at 11AM the schema of the file is as follows,
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
The next hour at 12PM the schema changes to,
root
|-- number: string (nullable = true)
|-- company: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
Some of the columns change from string to struct and vice-versa. So if I select the col(company.link) and the incoming schema is of type string the code fails.
How do I handle schema changes in PySpark when reading the file as my end goal is to flatten the JSON to a CSV format.

def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
#df has the exploded JSON data
df2 = df.select("result.number",
"result.company",
"result.assignment_group")
df23 = df2
for name, cols in df2.dtypes:
if 'struct' in get_dtype(df2, name):
try:
df23 = df23.withColumn(name+"_link", col(name+".link")).withColumn(name+"_value", col(name+".value")).drop(name)
except:
print("error")
df23.printSchema()
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment_group_link: string (nullable = true)
|-- assignment_group_value: string (nullable = true)
So this is what I did,
Created a function that identifies if the column is of type struct
read all the columns from the base dataframe that has the exploded result from JSON
then loop through the column and if it of type struct then add new columns with the nested values.

Related

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]
Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

spark scala: extracting xml from one column

Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.
Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))

How to change datatype of a field in a two-level schema tree?

Now I have a dataframe with schema:
root
|-- id: string (nullable = true)
|-- st_one: struct (nullable = true)
| |-- tid: long (nullable = true)
| |-- st_two: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- score: long (nullable = true)
|-- ts: double (nullable = true)
|-- date: string (nullable = true)
I want to change score's type from long to double. Is there any good solution?
BTW, I'm using Scala.
I've already known how to do it by "listing" all the fields. I want a more common method that could fit even st_two contains a thousand fields or more.
You can update the struct type column st_one like this:
val df1 = df.withColumn(
"st_one",
struct(
$"st_one.tid",
struct(
$"st_one.st_two.name",
$"st_one.st_two.score".cast("double").as("score")
).as("st_two")
)
)
You can do a complex cast:
val df2 = df.withColumn("st_one", $"st_one".cast("struct<tid:long, st_two:struct<name:string, score:double>>"))

How to dynamically infer a schema using SparkSession

I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.
Is it possible to do so?
I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.
val ds1: DataFrame = spark
.readStream
.format("csv")
.load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
println(ds1.show(2))
You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:
scala> val df = spark.read.
| option("header", "true").
| option("inferSchema", "true").
| option("timestampFormat","MM/dd/yyyy").
| csv("D:\\texts\\C2ImportCalEventSample.csv")
df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]
scala> df.printSchema
root
|-- Start Date : timestamp (nullable = true)
|-- Start Time: string (nullable = true)
|-- End Date: timestamp (nullable = true)
|-- End Time: string (nullable = true)
|-- Event Title : string (nullable = true)
|-- All Day Event: string (nullable = true)
|-- No End Time: string (nullable = true)
|-- Event Description: string (nullable = true)
|-- Contact : string (nullable = true)
|-- Contact Email: string (nullable = true)
|-- Contact Phone: string (nullable = true)
|-- Location: string (nullable = true)
|-- Category: integer (nullable = true)
|-- Mandatory: string (nullable = true)
|-- Registration: string (nullable = true)
|-- Maximum: integer (nullable = true)
|-- Last Date To Register: timestamp (nullable = true)

Convert Struct data type to Map data type in Scala

How can I convert a column with the data type of struct to Map or String. This is the schema:
root
|-- Col1: string (nullable = true)
|-- Col2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: integer (nullable = false)
The second column makes the problem when I want to dump the dataframe into a file. I have tried many different ways such as casting to string but it changed the values in the second column. I also tried to convert the Col2 to a map but i was not successful.
I tried to get the first value in struct(_1) through a udf but it has error:
Failed to execute user defined function($anonfun$1: (struct<_1:string,_2:int>) => string)
Select Col1, Col2._1, Col2._2 from <your table>
By spark.sql, you can try this and save it to another dataframe and then write to CSV.
In Scala we could do in this way:
val df_new = df_old.select($"Col1", $"Col2._1", $"Col3._2")
You can also * notation to expand all the columns from Struct data type.
Schema
root
|-- address: struct (nullable = false)
| |-- street: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
Expansion SQL
val df1 = df.select("address.*").show(false)
df1.printSchema
root
|-- street: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)