I am using Structured Streaming (Spark 2.4.0) to read avro mesages through kafka and using
Confluent schema-Registry to receive/read schema
I am unable to access the deeply nested fields.
Schema looks like this in compacted avsc format:
{"type":"record","name":"KafkaMessage","namespace":"avro.pojo","fields":[{"name":"context","type":["null",{"type":"record","name":"Context","fields":[{"name":"businessInteractionId","type":["null","string"]},{"name":"referenceNumber","type":["null","string"]},{"name":"serviceName","type":["null","string"]},{"name":"status","type":["null","string"]},{"name":"sourceSystems","type":["null",{"type":"array","items":{"type":"record","name":"SourceSystem","fields":[{"name":"orderId","type":["null","string"]},{"name":"revisionNumber","type":["null","string"]},{"name":"systemId","type":["null","string"]}]}}]},{"name":"sysDate","type":["null","string"]}]}]}]}
as parsed in spark
context
|-- businessInteractionId: string (nullable = true)
|-- referenceNumber: string (nullable = true)
|-- serviceName: string (nullable = true)
|-- sourceSystems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- orderId: string (nullable = true)
| | |-- revisionNumber: string (nullable = true)
| | |-- systemId: string (nullable = true)
|-- status: string (nullable = true)
|-- sysDate: string (nullable = true)
My approach : Cast the returned Object as a GenericRecord and array as GenericData.Array[GenericRecord] Link
Code
val client = new CachedSchemaRegistryClient(schemaRegUrl, 100)
val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])
val results = df.select(col("value").as[Array[Byte]]).map {
rawBytes: Array[Byte] =>
//read the raw bytes from spark and then use the confluent deserializer to get the record back
val deser = brdDeser.value
val decoded = deser.deserialize(topics, rawBytes)
val context_GR =
decoded.get("context").asInstanceOf[GenericRecord]
val c_businessInteractionId =
context_GR.get("businessInteractionId").toString //this works
val c1_sourceSystems =
context_GR
.get("sourceSystems")
.asInstanceOf[GenericData.Array[GenericRecord]]
val c_orderId = c1_sourceSystems.get(0).get("orderId").toString //NullPointerException
val c_revisionNumber = c1_sourceSystems.get(0).get("revisionNumber").toString
val c_systemId = c1_sourceSystems.get(0).get("systemId").toString
new CaseMessage(
c_businessInteractionId, c_orderId, c_revisionNumber, c_systemId )
}
case class CaseMessage(c_businessInteractionId: String,
c_orderId: String,
c_revisionNumber: String,
c_systemId: String,)
Each time i receive a java.lang.NullPointerException when it is trying to evaluate c_orderId
This was a data issue. I was able to resolve this by doing a null value check
val c_orderId = if (c1_sourceSystems.get(0).get("orderId") != null) {
c1_sourceSystems.get(0).get("orderId").toString
I have the following schema that I read from csv:
val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))
One person can have multiple addresses and is related through PersonID.
Can someone help transform the records to a PersonAddress records as in the following case class definition?
case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])
I have tried the following but it is giving exception in the last step:
val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show
Gives an error:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to $line26.$read$$iw$$iw$Address
The easiest solution in this situtation would be to use an UDF. First, collect the street numbers and names as two separate lists, then use the UDF to convert everything into a dataframe of PersonAddress.
val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
val addresses = streetNumber.zip(streetName)
PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})
val results = personData.join(addressData, Seq("PersonID"), "left_outer")
.groupBy("PersonID","Name")
.agg(collect_list($"StreetNumber").as("StreetNumbers"),
collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))
This will give you a schema as below.
root
|-- Person: struct (nullable = true)
| |-- PersonID: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- StreetNumber: string (nullable = true)
| | | |-- StreetName: string (nullable = true)
I have an RDD that has been created from some JSON, each record in the RDD contains key/value pairs. My RDD looks like:
myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]
I would to convert each record to a row in a spark dataframe, the nested fields in trackingInfo should be there own columns and the type list should be its own column also.
So far I've tired to split it using a case class :
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
However I keep getting java.lang.ArrayIndexOutOfBoundsException: 1 errors.
What is the best way to do this ? As you can see record number 5 has a slight difference in the ordering of some attributes. Is it possible to parse based on attribute names instead of splitting on "," etc.
I'm using Spark 1.6.x
Your json rdd seems to be invalid jsons. You need to convert them to valid jsons as
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
then you can use the sqlContext to read the valid rdd jsons into a dataframe as
val df = sqlContext.read.json(validJsonRdd)
which should give you dataframe ( i used the invalid json you provided in the question)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
and the schema for the dataframe is
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
I hope the answer is helpful
You can use sqlContext.read.json(myRDD.map(_._2)) to read json into a dataframe
I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:
I have schema defined as:
StructType(
StructField(name,StringType,true),
StructField(data,ArrayType(
StructType(
StructField(name,StringType,true),
StructField(values,
MapType(StringType,StringType,true),
true)
),
true
),
true)
)
I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:
df.withColumn("data.values", functions.array(functions.lit(null)))
but this ultimately creates a new column of data.values and does not modify the values element of the data array.
Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:
case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])
val nullableDF = df.as[Root].map { root =>
val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
NullableRoot(root.name, nullableData)
}.toDF()
The resulting schema of nullableDF will be:
root
|-- name: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame