Creating Schema of JSON type and Reading it using Spark in Scala [Error : cannot resolve jsontostructs] - scala

I have a JSON file like below :
{"Codes":[{"CName":"012","CValue":"XYZ1234","CLevel":"0","msg":"","CType":"event"},{"CName":"013","CValue":"ABC1234","CLevel":"1","msg":"","CType":"event"}}
I wanted to create the schema for this and if the JSON file is empty({}) it should be an empty String.
However, df Output is below when I used df.show:
[[012, XYZ1234, 0, event, ], [013, ABC1234, 1, event, ]]
I created Schema like below :
val schemaF = ArrayType(
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
)
When I tried below,
val df1 = df.withColumn("Codes",from_json('Codes, schemaF))
It gives AnalysisException :
org.apache.spark.sql.AnalysisException: cannot resolve
'jsontostructs(Codes)' due to data type mismatch: argument 1
requires string type, however, 'Codes' is of
array<structCName:string,CValue:string,CLevel:string,CType:string,msg:string>
type.;; 'Project [valid#51,
jsontostructs(ArrayType(StructType(StructField(CName,StringType,true),
StructField(CValue,StringType,true),
StructField(CLevel,StringType,true), StructField(msg,StringType,true),
StructField(CType,StringType,true)),true), Codes#8,
Some(America/Bogota)) AS errorCodes#77]
Can someone please tell me why and how to resolve this issue?

val schema =
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
val df0 = spark.read.schema(schema).json("/path/to/data.json")

Your schema does not correspond to the JSON file you're trying to read. It's missing the field Codes of array type, it should look like this :
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true)
)
)
And you want to apply it when reading the json not with from_json function :
val df = spark.read.schema(schema).json("path/to/json/file")
df.printSchema
//root
// |-- Codes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- CLevel: string (nullable = true)
// | | |-- CName: string (nullable = true)
// | | |-- CType: string (nullable = true)
// | | |-- CValue: string (nullable = true)
// | | |-- msg: string (nullable = true)
EDIT:
For your comment question, you can use this schema definition:
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true),
StructField("lid", StructType(Array(StructField("idNo", StringType, true))), true)
)
)

Related

Parse the String column to get the data in date format using Spark Scala

I've the below column(TriggeredDateTime) in my .avro file which is of type String, i would need to get the data in yyyy-MM-dd HH:mm:ss format(as shown in the expected output) using Spark-Scala. Please could you let me know is there any way to achieve this by writing an UDF, rather than using my below approach. Any help would be much appreciated.
"TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}
expected output
_ _ _ _ _ _ _ _ _ _
|TriggeredDateTime |
|___________________|
|2019-05-16 04:56:19|
|_ _ _ _ _ _ _ _ _ _|
My Approach:
I'm trying to convert .avro file to JSON format by applying the schema and then i can try parsing the JSON to get the required results.
DataFrame Sample Data:
[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]
DataFrame PrintSchema:
initialDF.printSchema
root
|-- vin: string (nullable = true)
|-- basetime: string (nullable = true)
|-- dtctime: string (nullable = true)
|-- latitude: string (nullable = true)
|-- longitude: string (nullable = true)
|-- dtcs: string (nullable = true)
|-- signals: string (nullable = true)
|-- sourceEcu: string (nullable = true)
|-- dtcTriggeredDateTime: string (nullable = true)
Instead of writing an UDF you can use the build-in get_json_object to parse the json row and format_string to extract the desired output.
import org.apache.spark.sql.functions.{get_json_object, format_string}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
df.select(
format_string("%s-%s-%s %s:%s:%s",
get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
).as("TriggeredDateTime")
).show(false)
Output:
+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+
The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date i.e: $.dateTime.date.year which we add as param to format_string function in order to generate the final date.
UPDATE:
For the sake of completeness instead of calling multiple times get_json_object we can use from_json providing the schema which we already know:
import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
val schema =
StructType(Seq(
StructField("dateTime", StructType(Seq(
StructField("date",
StructType(Seq(
StructField("year", IntegerType, false),
StructField("month", IntegerType, false),
StructField("day", IntegerType, false)
)
)
),
StructField("time",
StructType(Seq(
StructField("hour", IntegerType, false),
StructField("minute", IntegerType, false),
StructField("second", IntegerType, false),
StructField("nano", IntegerType, false)
)
)
)
)
)
),
StructField("offset", StructType(Seq(
StructField("totalSeconds", IntegerType, false)
)
)
)
))
df.select(
from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
format_string("%s-%s-%s %s:%s:%s",
$"parsedDateTime.dateTime.date.year".as("year"),
$"parsedDateTime.dateTime.date.month".as("month"),
$"parsedDateTime.dateTime.date.day".as("day"),
$"parsedDateTime.dateTime.time.hour".as("hour"),
$"parsedDateTime.dateTime.time.minute".as("min"),
$"parsedDateTime.dateTime.time.second".as("sec")
).as("TriggeredDateTime")
)
.show(false)
// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+

Spark Structured Streaming Databricks Event Hub Schema Defining issue

I am having an issue with defining the structure for the json document.
Now i am trying to do the same schema on streamread.
val jsonSchema = StructType([ StructField("associatedEntities", struct<driver:StringType,truck:StringType>, True),
StructField("heading", StringType, True),
StructField("location", struct<accuracyType:StringType,captureDateTime:StringType,cityStateCode:StringType,description:StringType,latitude:DoubleType,longitude:DoubleType,quality:StringType,transmitDateTime:StringType>, True),
StructField("measurements", array<struct<type:StringType,uom:StringType,value:StringType>>, True),
StructField("source", struct<entityType:StringType,key:StringType,vendor:StringType>, True),
StructField("speed", DoubleType, True)])
val df = spark
.readStream
.format("eventhubs")
//.schema(jsonSchema)
.options(ehConf.toMap)
.load()
When I run this cell in the notebook ":15: error: illegal start of simple expression
val jsonSchema = StructType([ StructField("associatedEntities", struct, True),"
Edit: The goal is to get the data into a dataframe. I can get the json string from the body of the event hub message but i am not sure what to do from there if i cant get the schema to work.
You get the error message because of your schema definition. The schema definition should look something like this:
import org.apache.spark.sql.types._
val jsonSchema = StructType(
Seq(StructField("associatedEntities",
StructType(Seq(
StructField("driver", StringType),
StructField ("truck", StringType)
))),
StructField("heading", StringType),
StructField("measurements", ArrayType(StructType(Seq(StructField ("type", StringType), StructField ("uom", StringType), StructField("value", StringType)))))
)
)
You can doublecheck the schema with:
jsonSchema.printTreeString
Giving you the schema back:
root
|-- associatedEntities: struct (nullable = true)
| |-- driver: string (nullable = true)
| |-- truck: string (nullable = true)
|-- heading: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- uom: string (nullable = true)
| | |-- value: string (nullable = true)
As mentioned in the comments you get binary data. So first you get the raw dataframe:
val rawData = spark.readStream
.format("eventhubs")
.option(...)
.load()
You have to:
convert the data to a string
parse the nested json
and flatten it
Define the dataframe with the parsed data:
val parsedData = rawData
.selectExpr("cast (Body as string) as json")
.select(from_json($"json", jsonSchema).as("data"))
.select("data.*")

StructType from Array

What I need to do?
Create schema for a DataFrame that should look like this:
root
|-- doubleColumn: double (nullable = false)
|-- longColumn: long (nullable = false)
|-- col0: double (nullable = true)
|-- col1: double (nullable = true)
...
Columns with prefix col can vary in number. Their names are stored in an array ar: Array[String].
My attempt
val schema = StructType(
StructField("doubleColumn", DoubleType, false) ::
StructField("longColumn", LongType, false) ::
ar.map(item => StructField(item, DoubleType, true)) // how to reduce it?
Nil
)
I have a problem with the commented line (4), I don't know, how to pass this array.
There is no need to reduce anything. You can just perpend a list of known columns: val
val schema = StructType(Seq(
StructField("doubleColumn", DoubleType, false),
StructField("longColumn", LongType, false)
) ++ ar.map(item => StructField(item, DoubleType, true))
)
You might also
ar.foldLeft(StructType(Seq(
StructField("doubleColumn", DoubleType, false),
StructField("longColumn", LongType, false)
)))((acc, name) => acc.add(name, DoubleType, true))

Cast values of a Spark dataframe using a defined StructType

Is there a way to cast all the values of a dataframe using a StructType ?
Let me explain my question using an example :
Let's say that we obtained a dataframe after reading from a file(I am providing a code which generates this dataframe, but in my real world project, I am obtaining this dataframe after reading from a file):
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
val rows1 = Seq(
Row("1", Row("a", "b"), "8.00", Row("1","2")),
Row("2", Row("c", "d"), "9.00", Row("3","4"))
)
val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)
val schema1 = StructType(
Seq(
StructField("id", StringType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", StringType, true),
StructField("s2", StructType(
Seq(
StructField("u", StringType, true),
StructField("v", StringType, true)
)
), true)
)
)
val df1 = spark.createDataFrame(rows1Rdd, schema1)
println("Schema with nested struct")
df1.printSchema()
root
|-- id: string (nullable = true)
|-- s1: struct (nullable = true)
| |-- x: string (nullable = true)
| |-- y: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: struct (nullable = true)
| |-- u: string (nullable = true)
| |-- v: string (nullable = true)
Now let's say that my client provided me the schema of the data he wants (which is equivalent to the schema of the read dataframe, but with different Datatypes (contains StringTypes, IntegerTypes ...)):
val wantedSchema = StructType(
Seq(
StructField("id", IntegerType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", DoubleType, true),
StructField("s2", StructType(
Seq(
StructField("u", IntegerType, true),
StructField("v", IntegerType, true)
)
), true)
)
)
What's the best way to cast the dataframe's values using the provided StructType ?
It would be great if there's a method that we can apply on a dataframe, and it applies the new StructTypes by casting all the values by itself.
PS: This is a small Dataframe which is used as an example, in my project the dataframe contains much more rows.
If It was a small Dataframe with few columns, I could have done the cast easily, but in my case, I am looking for a smart solution to cast all the values by applying a StructType and without having to cast each column/value manually in the code.
i will be grateful for any help you can provide, Thanks a lot !
After a lot of researches, here's a generic solution to cast a dataframe following a schema :
val castedDf = df1.selectExpr(wantedSchema.map(
field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
): _*)
Here's the schema of the casted dataframe :
castedDf.printSchema
root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
| |-- x: string (nullable = true)
| |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
| |-- u: integer (nullable = true)
| |-- v: integer (nullable = true)
I hope it's going to help someone, I spent 5 days looking for this simple/generic solution.
There's no automatic way to perform the conversion. You can express the conversion logic in Spark SQL, to convert everything in one pass - the resulting SQL might get quite big, though, if you have a lot of fields. But at least you get to keep all your transformation in one place.
Example:
df1.selectExpr("CAST (id AS INTEGER) as id",
"STRUCT (s1.x, s1.y) AS s1",
"CAST (d AS DECIMAL) as d",
"STRUCT (CAST (s2.u AS INTEGER), CAST (s2.v AS INTEGER)) as s2").show()
One thing to watch out for is that whenever conversion fails (e.g., when d is not a number), you'll get a NULL. One option is to run some validation prior to the conversion, and then filter out the df1 records to only convert the valid ones.

Spark: Programmatically creating dataframe schema in scala

I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.
val innerSchema =
StructType(
Array(
StructField("value", StringType),
StructField("count", LongType)
)
)
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", innerSchema)
)
)
val result = stats.columnStats.map{ c =>
Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
}
val rdd = sc.parallelize(result.toSeq)
val outputDf = sqlContext.createDataFrame(rdd, outputSchema)
outputDf.show()
The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
When I debug and inspect my objects, I'm seeing this:
rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]
It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.
Any idea how I should be defining the schema?
val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)
df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+
As David pointed out, I needed to use an ArrayType. Spark is happy with this:
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", ArrayType(StructType(Array(
StructField("value", StringType),
StructField("count", LongType)
))))
)
)
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val searchPath = "/path/to/.csv"
val columns = "col1,col2,col3,col4,col5,col6,col7"
val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType,
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)
When you load the data with custom schema will be much faster compared to loading data with default schema