Unable to find encoder for type stored in a Dataset for streaming mongo db data through Kafka - scala

I want to tail Mongo oplog and stream it through Kafka. So I found debezium Kafka CDC connector which tails the Mongo oplog with their in-build serialisation technique.
Schema registry uses below convertor for the serialization,
'key.converter=io.confluent.connect.avro.AvroConverter' and
'value.converter=io.confluent.connect.avro.AvroConverter'
Below are the library dependencies I'm using in the project
libraryDependencies += "io.confluent" % "kafka-avro-serializer" % "3.1.2"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "0.10.2.0
Below is the streaming code which deserialize Avro data
import org.apache.spark.sql.{Dataset, SparkSession}
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
import scala.collection.JavaConverters._
object KafkaStream{
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder
.master("local")
.appName("kafka")
.getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
import sparkSession.implicits._
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "productCollection.inventory.Product"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
rawTopicMessageDF.printSchema()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("key"), keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueJsonString = valueDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("value"), topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueJsonString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Kafka consumer running fine I can see the data tailing from the oplog however when I run above code I'm getting below errors,
Error:(70, 59) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
and
Error:(70, 59) not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[DeserializedFromKafkaRecord])org.apache.spark.sql.Dataset[DeserializedFromKafkaRecord].
Unspecified value parameter evidence$7.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
Please suggest what I'm missing here.
Thanks in advance.

Just declare your case class DeserializedFromKafkaRecord outside of the main method.
I imagine that when the case class is defined inside main, Spark magic with implicit encoders does not work properly, since the case class does not exist before the execution of main method.
The problem can be reproduced with a simpler example (without Kafka) :
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object SimpleTest {
// declare CaseClass outside of main method
case class CaseClass(value: Int)
def main(args: Array[String]): Unit = {
// when case class is declared here instead
// of outside main, the program does not compile
// case class CaseClass(value: Int)
val sparkSession = SparkSession
.builder
.master("local")
.appName("simpletest")
.getOrCreate()
import sparkSession.implicits._
val df: DataFrame = sparkSession.sparkContext.parallelize(1 to 10).toDF()
val ds: Dataset[CaseClass] = df.map { row =>
CaseClass(row.getInt(0))
}
ds.show()
}
}

Related

Histogram/Counter metrics for Prometheus in Spark Job readStream/writeStream (Kafka to parquet). How to send metric from LOOP or Event or Listener

How to send Histogram/Counter metrics for Prometheus from Spark job in:
Loop
foreachBatch
methods of ForeachWriter
Spark events
using org.apache.spark.metrics.source.Source in Spark job with stream?
I'm able to accumulate metrics in collection accumulator(s), but I cannot find context where I can send accumulated metrics without issue of compilation or execution.
Common issue:
22/11/28 14:24:36 ERROR MicroBatchExecution: Query [id = 5d2fc03c-1dbc-4bb1-a821-397586d22cf4, runId = e665dcd2-6e3d-4b03-8684-11844de040f0] terminated with error
org.apache.spark.SparkException: Task not serializable
or
Spark job is stopped in ~15 seconds on the spark worker after start with different variation of the error messages.
Found solution:
It works on local env. with simple spark-submit, but it doesn't work with the cluster. Collection returned by SparkEnv.get.metricsSystem.getSourcesByName is always empty.
https://gist.github.com/ambud/641f8fc25f7f8d3923d6fd10f64b7184
I see only doubted ways to fix this issue. I don't believe that there's no any common solution.
package org.apache.spark.metrics.source
import com.codahale.metrics.{Counter, Histogram, MetricRegistry}
class PrometheusMetricSource extends Source {
override val sourceName: String = "PrometheusMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val myMetric: Histogram = metricRegistry.histogram(MetricRegistry.name("myMetric"))
}
import org.apache.spark.SparkEnv
import org.apache.spark.metrics.source.PrometheusMetricSource
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, SparkSession}
object Example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("My Spark job").getOrCreate()
import spark.implicits._
val source: PrometheusMetricSource = new PrometheusMetricSource
SparkEnv.get.metricsSystem.registerSource(source)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "my-topic")
.option("startingOffsets", "earliest")
.load()
val ds: Dataset[String] =
df.select(col("value"))
.as[String]
.map { str =>
source.myMetric.update(1L) // submit metric ////////////////////////
str + "test"
}
ds.writeStream
.foreachBatch {
(batchDF: Dataset[String],
batchId: Long) =>
source.myMetric.update(1L) // submit metric ////////////////////////
}
.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = true
def close(errorOrNull: Throwable): Unit = {}
def process(record: String) = {
source.myMetric.update(1L) // submit metric ////////////////////////
}
})
.outputMode("append")
.format("parquet")
.option("path", "/share/parquet")
.option("checkpointLocation", "/share/checkpoints")
.start()
.awaitTermination()
}
}

How to perform Unit testing on Spark Structured Streaming?

I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}

enocder issue- Spark Structured streaming- works in repl only

I have a working process to ingest and deserialize kafka avro message using schema reg. It works great in the REPL but when I try to compile I get
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map(x => {
I'm not sure if I need to modify my object, but why would I need to if the REPL works fine.
object AgentDeserializerWrapper {
val props = new Properties()
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL)
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true")
val vProps = new kafka.utils.VerifiableProperties(props)
val deser = new KafkaAvroDecoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueNameAgentRead)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
}
case class DeserializedFromKafkaRecord( value: String)
import spark.implicits._
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})
Add as[DeserializedFromKafkaRecord], in order to type statically your dataset:
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.as[DeserializedFromKafkaRecord]
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})

Use schema to convert AVRO messages with Spark to DataFrame

Is there a way to use a schema to convert avro messages from kafka with spark to dataframe? The schema file for user records:
{
"fields": [
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" }
],
"name": "user",
"type": "record"
}
And code snippets from SqlNetworkWordCount example and Kafka, Spark and Avro - Part 3, Producing and consuming Avro messages to read in messages.
object Injection {
val parser = new Schema.Parser()
val schema = parser.parse(getClass.getResourceAsStream("/user_schema.json"))
val injection: Injection[GenericRecord, Array[Byte]] = GenericAvroCodecs.toBinary(schema)
}
...
messages.foreachRDD((rdd: RDD[(String, Array[Byte])]) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(message => Injection.injection.invert(message._2).get)
.map(record => User(record.get("firstName").toString, records.get("lastName").toString)).toDF()
df.show()
})
case class User(firstName: String, lastName: String)
Somehow I can't find another way than using a case class to convert AVRO messages to DataFrame. Is there a possibility to use the schema instead? I'm using Spark 1.6.2 and Kafka 0.10.
The complete code, in case you're interested.
import com.twitter.bijection.Injection
import com.twitter.bijection.avro.GenericAvroCodecs
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import org.apache.spark.{SparkConf, SparkContext}
object ReadMessagesFromKafka {
object Injection {
val parser = new Schema.Parser()
val schema = parser.parse(getClass.getResourceAsStream("/user_schema.json"))
val injection: Injection[GenericRecord, Array[Byte]] = GenericAvroCodecs.toBinary(schema)
}
def main(args: Array[String]) {
val brokers = "127.0.0.1:9092"
val topics = "test"
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("ReadMessagesFromKafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](
ssc, kafkaParams, topicsSet)
messages.foreachRDD((rdd: RDD[(String, Array[Byte])]) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(message => Injection.injection.invert(message._2).get)
.map(record => User(record.get("firstName").toString, records.get("lastName").toString)).toDF()
df.show()
})
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
/** Case class for converting RDD to DataFrame */
case class User(firstName: String, lastName: String)
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
OP probably resolved the issue but for future reference I solved this issue quite generally so thought it might be helpful to post here.
So generally speaking you should convert the Avro schema to a spark StructType and also convert the object you have in your RDD to Row[Any] and then use:
spark.createDataFrame(<RDD[obj] mapped to RDD[Row}>,<schema as StructType>
In order to convert the Avro schema I used spark-avro like so:
SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
The convertion of the RDD was more tricky.. if your schema is simple you can probably just do a simple map.. something like this:
rdd.map(obj=>{
val seq = (obj.getName(),obj.getAge()
Row.fromSeq(seq))
})
In this example the object has 2 fields name and age.
The important thing is to make sure the elements in the Row will match the order and types of the fields in the StructType from before.
In my perticular case I had a much more complex object which I wanted to handle generically to support future schema changes so my code was much more complex.
the method suggested by OP should also work on some casese but will be hard to imply on complex objects (not primitive or case-class)
another tip is that if you have a class within a class you should convert that class to a Row so that the wrapping class will be converted to something like:
Row(Any,Any,Any,Row,...)
you can also look at the spark-avro project I mentioned earlier on how to convert objects to rows.. I used some of the logic there myself
If someone reading this needs further help ask me in the comments and I'll try to help
Similar problem is solved also here.
Please take a look at this
https://github.com/databricks/spark-avro/blob/master/src/test/scala/com/databricks/spark/avro/AvroSuite.scala
So instead of
val df = rdd.map(message => Injection.injection.invert(message._2).get)
.map(record => User(record.get("firstName").toString,records.get("lastName").toString)).toDF()
you can try this
val df = spark.read.avro(message._2.get)
I worked on the similar issue, but in Java. So not sure about Scala, but take a look at the library com.databricks.spark.avro.
For anyone interested in handling this in a way that can handle schema changes without needing to stop and redeploy your spark application (assuming your app logic can handle this) see this question/answer.

Spark kryo encoder ArrayIndexOutOfBoundsException

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)
Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS