enocder issue- Spark Structured streaming- works in repl only - scala

I have a working process to ingest and deserialize kafka avro message using schema reg. It works great in the REPL but when I try to compile I get
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map(x => {
I'm not sure if I need to modify my object, but why would I need to if the REPL works fine.
object AgentDeserializerWrapper {
val props = new Properties()
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL)
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true")
val vProps = new kafka.utils.VerifiableProperties(props)
val deser = new KafkaAvroDecoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueNameAgentRead)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
}
case class DeserializedFromKafkaRecord( value: String)
import spark.implicits._
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})

Add as[DeserializedFromKafkaRecord], in order to type statically your dataset:
val agentStringDF = spark
.readStream
.format("kafka")
.option("subscribe", "agent")
.options(kafkaParams)
.load()
.as[DeserializedFromKafkaRecord]
.map(x => {
DeserializedFromKafkaRecord(AgentDeserializerWrapper.deser.fromBytes(x.getAs[Array[Byte]]("value"), AgentDeserializerWrapper.messageSchema).asInstanceOf[GenericData.Record].toString)
})

Related

Unable to find encoder for type stored in a Dataset for streaming mongo db data through Kafka

I want to tail Mongo oplog and stream it through Kafka. So I found debezium Kafka CDC connector which tails the Mongo oplog with their in-build serialisation technique.
Schema registry uses below convertor for the serialization,
'key.converter=io.confluent.connect.avro.AvroConverter' and
'value.converter=io.confluent.connect.avro.AvroConverter'
Below are the library dependencies I'm using in the project
libraryDependencies += "io.confluent" % "kafka-avro-serializer" % "3.1.2"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "0.10.2.0
Below is the streaming code which deserialize Avro data
import org.apache.spark.sql.{Dataset, SparkSession}
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
import scala.collection.JavaConverters._
object KafkaStream{
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder
.master("local")
.appName("kafka")
.getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
import sparkSession.implicits._
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "productCollection.inventory.Product"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
rawTopicMessageDF.printSchema()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("key"), keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueJsonString = valueDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("value"), topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueJsonString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Kafka consumer running fine I can see the data tailing from the oplog however when I run above code I'm getting below errors,
Error:(70, 59) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
and
Error:(70, 59) not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[DeserializedFromKafkaRecord])org.apache.spark.sql.Dataset[DeserializedFromKafkaRecord].
Unspecified value parameter evidence$7.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
Please suggest what I'm missing here.
Thanks in advance.
Just declare your case class DeserializedFromKafkaRecord outside of the main method.
I imagine that when the case class is defined inside main, Spark magic with implicit encoders does not work properly, since the case class does not exist before the execution of main method.
The problem can be reproduced with a simpler example (without Kafka) :
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object SimpleTest {
// declare CaseClass outside of main method
case class CaseClass(value: Int)
def main(args: Array[String]): Unit = {
// when case class is declared here instead
// of outside main, the program does not compile
// case class CaseClass(value: Int)
val sparkSession = SparkSession
.builder
.master("local")
.appName("simpletest")
.getOrCreate()
import sparkSession.implicits._
val df: DataFrame = sparkSession.sparkContext.parallelize(1 to 10).toDF()
val ds: Dataset[CaseClass] = df.map { row =>
CaseClass(row.getInt(0))
}
ds.show()
}
}

Each query takes more time using Structured Streaming with Spark

I'm using Spark 2.3.0, Scala 2.11.8 and Kafka and I'm trying to write into parquet files all the messages from Kafka with Structured Streaming but for each query that my implementation does the total time for each one increase a lot Spark Stages Image.
I would like to know why this happens, I tried with different possibles triggers (Continues,0 seconds, 1 seconds, 10 seconds,10 minutes, etc) and always I get the same behavior. My code has this structure:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SparkSession}
import com.name.proto.ProtoMessages
import java.io._
import java.text.{DateFormat, SimpleDateFormat}
import java.util.Date
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
object StructuredStreaming {
def message_proto(value:Array[Byte]): Map[String, String] = {
try {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val impression_proto = ProtoMessages.TrackingRequest.parseFrom(value)
val json = Map(
"id_req" -> (impression_proto.getIdReq().toString),
"ts_imp_request" -> (impression_proto.getTsRequest().toString),
"is_after" -> (impression_proto.getIsAfter().toString),
"type" -> (impression_proto.getType().toString)
)
return json
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
return Map("error" -> "error")
}
}
def main(args: Array[String]){
val proto_impressions_udf = udf(message_proto _)
val spark = SparkSession.builder.appName("Structured Streaming ").getOrCreate()
//fetchOffset.numRetries, fetchOffset.retryIntervalMs
val stream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "ip:9092")
.option("subscribe", "ssp.impressions")
.option("startingOffsets", "latest")
.option("max.poll.records", "1000000")
.option("auto.commit.interval.ms", "100000")
.option("session.timeout.ms", "10000")
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("failOnDataLoss", "false")
.option("latestFirst", "true")
.load()
try{
val query = stream.select(col("value").cast("string"))
.select(proto_impressions_udf(col("value")) as "value_udf")
.select(col("value_udf")("id_req").as("id_req"), col("value_udf")("is_after").as("is_after"),
date_format(col("value_udf")("ts_request"), "yyyy").as("date").as("year"),
date_format(col("value_udf")("ts_request"), "MM").as("date").as("month"),
date_format(col("value_udf")("ts_request"), "dd").as("date").as("day"),
date_format(col("value_udf")("ts_request"), "HH").as("date").as("hour"))
val query2 = query.writeStream.format("parquet")
.option("checkpointLocation", "/home/data/impressions/checkpoint")
.option("path", "/home/data/impressions")
.outputMode(OutputMode.Append())
.partitionBy("year", "month", "day", "hour")
.trigger(Trigger.ProcessingTime("1 seconds"))
.start()
}catch{
case e:Exception=>
val pw = new PrintWriter(new File("/home/data/log.log" ))
pw.write(e.toString)
pw.close()
}
}
}
I attached others images from the Spark UI:
Your problem is related to the batches, you need to define a good time for processing each batch, and that depends on your cluster processing power. Also, the time for solve each batch depends whether you are receiving all the fields without null because if you receive a lot of fields on null the process will take less time to process the batch.

How to use foreachPartition in Spark 2.2 to avoid Task Serialization error

I have the following working code that uses Structured Streaming (Spark 2.2) in order to read data from Kafka (0.10).
The only issue that I cannot solve is related to Task serialization problem when using kafkaProducer inside ForeachWriter.
In my old version of this code developed for Spark 1.6 I was using foreachPartition and I was defining kafkaProducer for each partition to avoid Task Serialization problem.
How can I do it in Spark 2.2?
val df: Dataset[String] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
.map(_._2)
var mySet = spark.sparkContext.broadcast(Map(
"metadataBrokerList"->metadataBrokerList,
"outputKafkaTopic"->outputKafkaTopic,
"batchSize"->batchSize,
"lingerMS"->lingerMS))
val kafkaProducer = Utils.createProducer(mySet.value("metadataBrokerList"),
mySet.value("batchSize"),
mySet.value("lingerMS"))
val writer = new ForeachWriter[String] {
override def process(row: String): Unit = {
// val result = ...
val record = new ProducerRecord[String, String](mySet.value("outputKafkaTopic"), "1", result);
kafkaProducer.send(record)
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}
val query = df
.writeStream
.foreach(writer)
.start
query.awaitTermination()
spark.stop()
Write implementation of ForeachWriter and than use it. (Avoid anonymous classes with not serializable objects - in your case its ProducerRecord)
Example: val writer = new YourForeachWriter[String]
Also here is helpful article about Spark Serialization problems: https://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error

Read json from Kafka and write json to other Kafka topic

I'm trying prepare application for Spark streaming (Spark 2.1, Kafka 0.10)
I need to read data from Kafka topic "input", find correct data and write result to topic "output"
I can read data from Kafka base on KafkaUtils.createDirectStream method.
I converted the RDD to json and prepare filters:
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val elementDstream = messages.map(v => v.value).foreachRDD { rdd =>
val PeopleDf=spark.read.schema(schema1).json(rdd)
import spark.implicits._
PeopleDf.show()
val PeopleDfFilter = PeopleDf.filter(($"value1".rlike("1"))||($"value2" === 2))
PeopleDfFilter.show()
}
I can load data from Kafka and write "as is" to Kafka use KafkaProducer:
messages.foreachRDD( rdd => {
rdd.foreachPartition( partition => {
val kafkaTopic = "output"
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partition.foreach{ record: ConsumerRecord[String, String] => {
System.out.print("########################" + record.value())
val messageResult = new ProducerRecord[String, String](kafkaTopic, record.value())
producer.send(messageResult)
}}
producer.close()
})
})
However, I cannot integrate those two actions > find in json proper value and write findings to Kafka: write PeopleDfFilter in JSON format to "output" Kafka topic.
I have a lot of input messages in Kafka, this is the reason I want to use foreachPartition to create the Kafka producer.
The process is very simple so why not use structured streaming all the way?
import org.apache.spark.sql.functions.from_json
spark
// Read the data
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", inservers)
.option("subscribe", intopic)
.load()
// Transform / filter
.select(from_json($"value".cast("string"), schema).alias("value"))
.filter(...) // Add the condition
.select(to_json($"value").alias("value")
// Write back
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", outservers)
.option("subscribe", outtopic)
.start()
Try using Structured Streaming for that. Even if you used Spark 2.1, you may implement your own Kafka ForeachWriter as followed:
Kafka sink:
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
kafkaProperties.put("value.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Usage:
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

spark dataset encoder for kafka avro decoder message

import spark.implicits._
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load()
.as[KafkaMessage]
.select($"value".as[Array[Byte]])
.map(msg=>{
val byteArrayInputStream = new ByteArrayInputStream(msg)
val datumReader:DatumReader[GenericRecord] = new SpecificDatumReader[GenericRecord](messageSchema)
val dataFileReader:DataFileStream[GenericRecord] = new DataFileStream[GenericRecord](byteArrayInputStream, datumReader)
while(dataFileReader.hasNext) {
val userData1: GenericRecord = dataFileReader.next()
userData1.asInstanceOf[org.apache.avro.util.Utf8].toString
}
})
Error:
Error:(49, 9) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map(msg=>{
Whenever you are trying to to do map/transformations on dataset in structured streaming, it is expected to associate with an appropriate encoder.
Tor primitive data types, implicit encoders are provided by spark:
import spark.implicits._
Where as for other types you need to provide manually.
So here you can either use implicit encoders:
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) = org.apache.spark.sql.Encoders.kryo[A](ct)
... or you can define your encoder which is associated with the data being processed within map function.