Nested Schema in Spark/Scala with Multiple Elements - scala

I wanted to create a predefined Schema in spark/scala so that I can read the json files accordingly.
Structure of the Schema is as below :
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- qty: long (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
|-- primaryKeys: string (nullable = true)
|-- state: string (nullable = true)
I was able to create the schema but I am stuck at one place where the elements have two sub elements. This is what I have tried
import org.apache.spark.sql.types._
val testSchema = StructType(
List(
StructField("primaryKeys", StringType, true),
StructField("state", IntegerType, true),
StructField("email",ArrayType(StringType,true),true),
StructField("id",StringType,true),
StructField("name",StringType,true),
StructField("id2",StringType,true),
StructField("price",StringType,true),
StructField("qty",StringType,true),
StructField("window",ArrayType(StringType,true),true)
))
I am not able to figure out how start and end can be included inside that window element.

it is a nested struct, so it should be
StructType(StructField("end",StringType,true), StructField("start",StringType,true))
btw, you can get schema from case classes as follow:
import org.apache.spark.sql.catalyst.ScalaReflection
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val schema =ScalaReflection.schemaFor[FullArrayCols].dataType.asInstanceOf[StructType]

The way I figured out to make it work is as under :
val arrayStructureSchema = new StructType()
.add("primaryKeys",StringType,true)
.add("runtime", StringType, true)
.add("email", ArrayType(StringType))
.add("id", StringType)
.add("id1", StringType)
.add("id2", StringType)
.add("otherId", StringType)
.add("qty", StringType)
.add("rqty", StringType)
.add("sqty", StringType)
.add("window", new StructType()
.add("end",StringType)
.add("start",StringType))

Related

Reading ORC file in Spark with schema returns null values

I am trying to read ORC files from spark job. I have defined the below schema based on output of printSchema,
df.printSchema():
root
|-- application: struct (nullable = true)
| |-- appserver: string (nullable = true)
| |-- buildversion: string (nullable = true)
| |-- frameworkversion: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- schemaversion: string (nullable = true)
| |-- sdkversion: string (nullable = true)
| |-- servertimestamp: string (nullable = true)
| |-- trackingversion: string (nullable = true)
| |-- version: string (nullable = true)
Schema defined
StructField("application", new StructType()
.add(StructField("appserver", StringType, false))
.add(StructField("buildversion", StringType, true))
.add(StructField("frameworkversion", StringType, true))
.add(StructField("id", StringType, true))
.add(StructField("name", StringType, true))
.add(StructField("schemaversion", StringType, true))
.add(StructField("sdkversion", StringType, true))
.add(StructField("servertimestamp", StringType, true))
.add(StructField("trackingversion", StringType, true))
.add(StructField("version", StringType, true)),
false)
)
The Dataframe output returns null for application value.
val data = sparkSession
.read
.schema(schema)
.orc("Data/Input/")
.select($"*")
data
.show(100, false)
When the same data is read without schema defined, it returns valid values. There are few other fields in the file and I am not defining them in schema as I am not interested in them. Could this be causing issue? Can somebody help me understand the problem here?

Change column names of nested data in bigquery using spark

I'm trying to write some data into BigQuery using Spark Scala, My spark df looks like,
root
|-- id: string (nullable = true)
|-- cost: double (nullable = false)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- settled: string (nullable = true)
| | |-- constant: string (nullable = true)
|-- status: string (nullable = true)
I tried to change the struct of the data frame.
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("cost", DoubleType, true),
StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
StructField("status", StringType, true)))
val actualDf = spark.createDataFrame(results, schema)
But it didn't work. When this writes into the BigQuery, Column names look like as follows,
id, cost, nodes.list.element.settled, nodes.list.element.constant, status
Is there a possible way to change these column names as,
id, cost, settled, constant, status
You can explode nodes array to get flatten structure of columns, then write dataframe to bigquery.
Example:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+

How to add missing fields to dataframe based on pre-defined schema?

When working with spark-streaming application, if all records in a batch from kafka misses a common field, then the dataframe schema changes everytime. I need a fixed dataframe schema for further processing and transformation operations.
I have pre-defined schema like this :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonType: string (nullable = true)
|-- PokemonEndurance: float (nullable = false)
|-- Attacks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AttackName: string (nullable = true)
| | |-- AttackImpact: long (nullable = true)
But for some streaming sessions, I don't get all columns, and input schema is like :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonEndurance: float (nullable = false)
I am defining my schema like this :
val schema = StructType(Array(
StructField("PokeId", StringType, true),
StructField("PokemonName", StringType, true),
StructField("PokemonWeight", IntegerType, false),
StructField("PokemonType", StringType, true),
StructField("PokemonEndurance", FloatType, false),
StructField("Attacks", ArrayType(StructType(Array(
StructField("AttackName", StringType),
StructField("AttackImpact", LongType)
))))
))
Now, I don't know how to add missing columns( with null values ) in input dataframe based on this schema?
I have tried with spark-daria for Dataframe Validation, but it returns missing columns as a descriptive error. How to get missing columns from it.

Create ArrayType column in Spark scala

Structure of the Schema to be created:
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Code to create schema:
val prodSchema = StructType(Array(StructField("col1", StringType), StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))))
Error:
found : Array[org.apache.spark.sql.types.StructField]
required: org.apache.spark.sql.types.DataType
StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))
Any suggestions on how to correct this schema error.
I think you can write it like this:
val prodSchema =
StructType(
List(
StructField("col1", BooleanType),
StructField("col2", ArrayType(
StructType(
List(
StructField("col2_1", BooleanType),
StructField("col2_2",StringType)
)
)
))
)
)
prodSchema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Try this:
val schema = StructType(Seq(
StructField("col1",BooleanType,false),
StructField("col2",ArrayType(StructType(Seq(
StructField("col2_1",BooleanType,true),
StructField("col2_2",StringType,true)
)))
)))
You could use the Schema DSL to create the schema:
val col2 = new StructType().add($"col2_1".boolean).add($"col2_2".string)
val schema = new StructType()
.add($"col1".boolean)
.add($"col2".array(col2))
schema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Hope it helps.

how to manipulate my dataframe in spark?

I have an nested json rdd stream comming in from a kafka topic.
the data looks like this:
{
"time":"sometext1","host":"somehost1","event":
{"category":"sometext2","computerName":"somecomputer1"}
}
I turned this into a dataframe and the schema looks like
root
|-- event: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- computerName: string (nullable = true)
|-- time: string (nullable = true)
|-- host: string (nullable = true)
Im trying to save it to a hive table on hdfs with a schema like this
category:string
computerName:string
time:string
host:string
This is my first time working with spark and scala. I would appretiate if someone could help me.
Thanks
// Creating Rdd
val vals = sc.parallelize(
"""{"time":"sometext1","host":"somehost1","event": {"category":"sometext2","computerName":"somecomputer1"}}""" ::
Nil)
// Creating Schema
val schema = (new StructType)
.add("time", StringType)
.add("host", StringType)
.add("event", (new StructType)
.add("category", StringType)
.add("computerName", StringType))
import sqlContext.implicits._
val jsonDF = sqlContext.read.schema(schema).json(vals)
jsonDF.printSchema
root
|-- time: string (nullable = true)
|-- host: string (nullable = true)
|-- event: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- computerName: string (nullable = true)
// selecting columns
val df = jsonDF.select($"event.*",$"time",
$"host")
df.printSchema
root
|-- category: string (nullable = true)
|-- computerName: string (nullable = true)
|-- time: string (nullable = true)
|-- host: string (nullable = true)
df.show
+---------+-------------+---------+---------+
| category| computerName| time| host|
+---------+-------------+---------+---------+
|sometext2|somecomputer1|sometext1|somehost1|
+---------+-------------+---------+---------+