TaskNotSerializable with Spark Dataset transformations - scala

I'm trying to filter the Spark dataset by one of the fields but as the result of transformations, I receive TaskNotSerializable. Here's how the method looks like:
def filterData(input: Dataset[CustomType], idsList: List[Int]): Dataset[CustomType] =
input.flatMap { record =>
val filtered = record.data.filter(rec => idsList.contains(rec.id))
if (filtered.nonEmpty)
Seq(record.withFields(filtered))
else
Iterable.empty
}
I tried to do similar transformations on Spark dataframe and it worked without any issues:
input.withColumn("arr", explode($”data”))
.filter($"arr.id".isin(idsList: _*)) //List(34,81,95)
.drop("arr")
.as[CustomType]
How dataset transformations could be fixed to avoid this error?
Custom type structure is the following:
|-- url: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- expiration: long (nullable = false)
| | |-- weight: integer (nullable = false)
|-- queryParams: array (nullable = true)
|-- element: string (containsNull = true)
Error serialization stack looks like this:
Serialization stack:
- object not serializable (class: job.DataSink, value: job.DataSink#6dda8f39)
- field (class: job.DataSink$$anonfun$filterData$1, name: $outer, type: class job.DataSink)
- object (class job.DataSink$$anonfun$filterData$1, <function1>)
- field (class: org.apache.spark.sql.Dataset$$anonfun$flatMap$1, name: func$6, type: interface scala.Function1)
- object (class org.apache.spark.sql.Dataset$$anonfun$flatMap$1, <function1>)
- field (class: org.apache.spark.sql.execution.MapPartitionsExec, name: func, type: interface scala.Function1)
- object (class org.apache.spark.sql.execution.MapPartitionsExec, MapPartitions <function1>, obj#392: schemas.data.CustomType
+- Scan[obj#383]
)

From serialization stack I would say that this function is part of job.DataSink class and Spark is trying to serialize whole class to send it to worker. You need to implement Serializable if you want to make it serializable by Spark.

Related

Unable to access de-serialized nested avro generic record elements in scala

I am using Structured Streaming (Spark 2.4.0) to read avro mesages through kafka and using
Confluent schema-Registry to receive/read schema
I am unable to access the deeply nested fields.
Schema looks like this in compacted avsc format:
{"type":"record","name":"KafkaMessage","namespace":"avro.pojo","fields":[{"name":"context","type":["null",{"type":"record","name":"Context","fields":[{"name":"businessInteractionId","type":["null","string"]},{"name":"referenceNumber","type":["null","string"]},{"name":"serviceName","type":["null","string"]},{"name":"status","type":["null","string"]},{"name":"sourceSystems","type":["null",{"type":"array","items":{"type":"record","name":"SourceSystem","fields":[{"name":"orderId","type":["null","string"]},{"name":"revisionNumber","type":["null","string"]},{"name":"systemId","type":["null","string"]}]}}]},{"name":"sysDate","type":["null","string"]}]}]}]}
as parsed in spark
context
|-- businessInteractionId: string (nullable = true)
|-- referenceNumber: string (nullable = true)
|-- serviceName: string (nullable = true)
|-- sourceSystems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- orderId: string (nullable = true)
| | |-- revisionNumber: string (nullable = true)
| | |-- systemId: string (nullable = true)
|-- status: string (nullable = true)
|-- sysDate: string (nullable = true)
My approach : Cast the returned Object as a GenericRecord and array as GenericData.Array[GenericRecord] Link
Code
val client = new CachedSchemaRegistryClient(schemaRegUrl, 100)
val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])
val results = df.select(col("value").as[Array[Byte]]).map {
rawBytes: Array[Byte] =>
//read the raw bytes from spark and then use the confluent deserializer to get the record back
val deser = brdDeser.value
val decoded = deser.deserialize(topics, rawBytes)
val context_GR =
decoded.get("context").asInstanceOf[GenericRecord]
val c_businessInteractionId =
context_GR.get("businessInteractionId").toString //this works
val c1_sourceSystems =
context_GR
.get("sourceSystems")
.asInstanceOf[GenericData.Array[GenericRecord]]
val c_orderId = c1_sourceSystems.get(0).get("orderId").toString //NullPointerException
val c_revisionNumber = c1_sourceSystems.get(0).get("revisionNumber").toString
val c_systemId = c1_sourceSystems.get(0).get("systemId").toString
new CaseMessage(
c_businessInteractionId, c_orderId, c_revisionNumber, c_systemId )
}
case class CaseMessage(c_businessInteractionId: String,
c_orderId: String,
c_revisionNumber: String,
c_systemId: String,)
Each time i receive a java.lang.NullPointerException when it is trying to evaluate c_orderId
This was a data issue. I was able to resolve this by doing a null value check
val c_orderId = if (c1_sourceSystems.get(0).get("orderId") != null) {
c1_sourceSystems.get(0).get("orderId").toString

Show call failing with a grouped and aggregated dataframe

I am trying to use sum after groupBy, like this,
val b = a.groupBy($"key").agg(sum($"value"))
Here the schema of a is of the following type,
|-- key: string (nullable = true)
|-- value: integer (nullable = false)
While the schema of b is of the following type,
|-- key: string (nullable = true)
|-- sum(value): long (nullable = true)
But when I do b.show, I get this error.
cannot assign instance of scala.collection.immutable.List$SerializationProxy
to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_
of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
What could be the reason for this error? I am using Spark 2.3.2 and running the code with an Apache Zeppelin note.

De-normalizing data in spark scala

I have the following schema that I read from csv:
val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))
One person can have multiple addresses and is related through PersonID.
Can someone help transform the records to a PersonAddress records as in the following case class definition?
case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])
I have tried the following but it is giving exception in the last step:
val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show
Gives an error:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to $line26.$read$$iw$$iw$Address
The easiest solution in this situtation would be to use an UDF. First, collect the street numbers and names as two separate lists, then use the UDF to convert everything into a dataframe of PersonAddress.
val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
val addresses = streetNumber.zip(streetName)
PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})
val results = personData.join(addressData, Seq("PersonID"), "left_outer")
.groupBy("PersonID","Name")
.agg(collect_list($"StreetNumber").as("StreetNumbers"),
collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))
This will give you a schema as below.
root
|-- Person: struct (nullable = true)
| |-- PersonID: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- StreetNumber: string (nullable = true)
| | | |-- StreetName: string (nullable = true)

How to convert RDD of JSONs to Dataframe?

I have an RDD that has been created from some JSON, each record in the RDD contains key/value pairs. My RDD looks like:
myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]
I would to convert each record to a row in a spark dataframe, the nested fields in trackingInfo should be there own columns and the type list should be its own column also.
So far I've tired to split it using a case class :
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
However I keep getting java.lang.ArrayIndexOutOfBoundsException: 1 errors.
What is the best way to do this ? As you can see record number 5 has a slight difference in the ordering of some attributes. Is it possible to parse based on attribute names instead of splitting on "," etc.
I'm using Spark 1.6.x
Your json rdd seems to be invalid jsons. You need to convert them to valid jsons as
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
then you can use the sqlContext to read the valid rdd jsons into a dataframe as
val df = sqlContext.read.json(validJsonRdd)
which should give you dataframe ( i used the invalid json you provided in the question)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
and the schema for the dataframe is
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
I hope the answer is helpful
You can use sqlContext.read.json(myRDD.map(_._2)) to read json into a dataframe

How to modify a Spark Dataframe with a complex nested structure?

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:
I have schema defined as:
StructType(
StructField(name,StringType,true),
StructField(data,ArrayType(
StructType(
StructField(name,StringType,true),
StructField(values,
MapType(StringType,StringType,true),
true)
),
true
),
true)
)
I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:
df.withColumn("data.values", functions.array(functions.lit(null)))
but this ultimately creates a new column of data.values and does not modify the values element of the data array.
Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:
case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])
val nullableDF = df.as[Root].map { root =>
val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
NullableRoot(root.name, nullableData)
}.toDF()
The resulting schema of nullableDF will be:
root
|-- name: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame