Read Avro records from Kafka using Spark Dstreams - scala

I'm using spark 2.3 and trying to stream data from Kafka using Dstreams (using DStreams to acheive a specific usecase which we were not able to using Structured Streaming).
The Kafka topic contains data in avro format. I want the read that data using Spark DStreams and interpret it as a json string.
I'm trying to do something like this,
val kafkaParams: Map[String, Object] = Map(
"bootstrap.servers" -> "kafka-servers",
"key.serializer" -> classOf[StringSerializer],
"value.serializer" -> classOf[StringSerializer],
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[org.apache.spark.sql.avro.AvroDeserializer],
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"group.id" -> "group1"
)
val kafkaDstream = KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
val processedStream = kafkaDstream.map(record => (record.key(), record.value()))
processedStream.foreachRDD(
someRdd =>
someRdd.foreach(
paths=> {
println(paths._2)
}
)
)
But I don't see the data getting processed (getting below error message), which I think is because AvroDeserializer is available only after Spark 2.4.0.
Caused by: org.apache.kafka.common.KafkaException: Could not instantiate class org.apache.spark.sql.avro.AvroDeserializer Does it have a public no-argument constructor?
Any idea on how I can acheive this?
Thank you.

Spark's Avro deserializer is not a Kafka deserializer (by the way, you cannot have duplicate keys in your config map). That class is for SparkSQL/Structured Streaming, also, not for (deprecated) Streaming
Unclear how your producer has serialized data, but if using Confluent Schema Registry, you'll need to use Confluent's own KafkaAvroDeserializer class, and you would then use [String, GenericRecord] as your stream types. Data is never automatically converted to JSON, and using String as the stream type will fail when using Avro Deserializer.

Related

Spark Streaming - Join on multiple kafka stream operation is slow

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.
Spark Cluster config:
This is how i'm reading kafka streams to tempviews in spark(scala)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")
I join 3 TABLES using spark spark sql
select COL1, COL2 from TABLE1
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK
Execution of Job:
Am i missing out some configuration on spark that i've to look into?
I find the same problem. And I found join between stream and stream needs more memory as I image. And the problem disappear when I increase the cores per executor.
unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.
#Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; and each receiver only create one tasks. One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.
If this still not working, another way is to use Kafka API to createDirectStream. According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html, this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.
I premilinary crafted a sample code for creating direct stream below. You might want to learn about this to customize to you own preference.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "KAFKASERVER",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"startingOffsets" -> "earliest",
"endingOffsets" -> "latest"
)
val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
val tdf = spark.read.schema(schema).json(rdd)
df = df.union(tdf)
df.createOrReplaceTempView("TABLE1")
}
Some related materials:
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream)
Good luck!

How to consume from a different Kafka topic in each batch of a Spark Streaming job?

I am pretty sure that there is no simple way of doing this, but here is my use case:
I have a Spark Streaming job (version 2.1.0) with a 5 second duration for each micro batch.
My goal, is to consume data from 1 different topic at every microbatch interval, of a total 250 Kafka topics. You can take the code bellow as a simple example:
val groupId:String = "first_group"
val kafka_servers:String = "datanode1:9092,datanode2:9092,datanode3:9092"
val ss:SparkSession = SparkSession.builder().config("spark.streaming.unpersist","true").appName("ConsumerStream_test").getOrCreate()
val ssc:StreamingContext= new StreamingContext(ss.sparkContext,Duration(5000))
val kafka_parameters:Map[String,Object]=Map(
"bootstrap.servers" -> kafka_servers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteArrayDeserializer],
"heartbeat.interval.ms" -> (1000:Integer),
"max.poll.interval.ms" -> (100:Integer),
"enable.auto.commit" -> (false: java.lang.Boolean),
"autoOffsetReset" -> OffsetResetStrategy.EARLIEST,
//"connections.max.idle.ms" -> (5000:Integer),
"group.id" -> groupId
)
val r = scala.util.Random
val kafka_list_one_topic=List("topic_"+ r.nextInt(250))
val consumer:DStream[ConsumerRecord[String,Array[Byte]]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers, ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))
consumer.foreachRDD( eachRDD => {
// DOING SOMETHING WITH THE DATA...
})
ssc.start()
ssc.awaitTermination()
But the issue with this approach, is that Spark will only run the initial code (everything before the foreachRDD command) once, in order to create the Kafka consumer DStream, but in the following micro batch, it only runs the "foreachRDD" statement.
As an example, let's say that r.nextInt(250) returned 40. The Spark Streaming job will connect to topic_40 and process its data. But in the next micro batches, it will still connect to topic_40, and ignore all the commands before the foreachRDD statement.
I guess this is expected, since the code before the foreachRDD statement runs only on the Spark driver.
My question is, is there a way that I can do this without having to relaunch a Spark application every 5 seconds?
Thank you.
My approach would be really simple, if you want it to be really random and don't care about any other consequences, make the kafka_list_one_topic as a mutable variable and change it in the streaming code.
val r = scala.util.Random
var kafka_list_one_topic=List("topic_"+ r.nextInt(250))
val consumer:DStream[ConsumerRecord[String,Array[Byte]]] =
KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers,
ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))
consumer.foreachRDD( eachRDD => {
// DOING SOMETHING WITH THE DATA...
kafka_list_one_topic=List("topic_"+ r.nextInt(250))
})
ssc.start()
ssc.awaitTermination()

Spark streaming checkpoint

I am reading messages from Kafka using Spark Kafka direct streaming. I want to implement zero message loss and after restarts spark, it has to read the missed messages from Kafka. I am using checkpoint to save all read offset, so that next time spark will start read from stored offset. this is my understanding.
I have used below code. I stopped my spark and pushed few message to Kafka. After restart the spark which is not reading missed messages from Kafka. Spark reads latest messages from kafka. How to read the missed message from Kafka?
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(6000))
ssc.checkpoint("C:/cp")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(50))
val msgStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Note: Application logs shows auto.offset.reset to none instead of latest. why ?
WARN KafkaUtils: overriding auto.offset.reset to none for executor
SBT
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
Windows : 7
If you want to read missed out messages, try commit process instead of checkpoint.
Please understand, Spark can't read old messages with property:
"auto.offset.reset" -> "latest"
Try this:
val kafkaParams = Map[String, Object](
//...
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
//...
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//Your processing goes here
//Then commit after completing your process.
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
Hope this helps.
I would rather suggest not to rely on checkpointing instead you can use an external data store to save your processed Kafka message offset.Please follow the link to get some insight.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/

Query Cassandra table for every Kafka Message

I am trying to query a cassandra table for every single kafka message.
Below is the code that I have been working on:
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark SQL basic example")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
val topicsSet = List("Test").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "12345",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val lines_myobjects = lines.map(line =>
new Gson().fromJson(line, classOf[myClass]) // The myClass is a simple case class which extends serializable
//This changes every single message into an object
)
Now things get complicated, I cannot get around the point where I can query the cassandra table with relevant to the message from the kafka message. Every single kafka message object has a return method.
I have tried multiple ways to get around this. For instance:
val transformed_data = lines_myobjects.map(myobject => {
val forest = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "mytable", "keyspace" -> "mydb"))
.load()
.filter("userid='" + myobject.getuserId + "'")
)}
I have also tried ssc.cassandraTable which gave me no luck.
The main goal is to get all the rows from the database where the userid matches with the userid that comes from the kafka message.
One thing I would like to mention is that even though loading or querying the cassandra database every time is not efficient, the cassandra database changes everytime.
You can't do spark.read or ssc.cassandraTable inside .map(. Because it means you would try to create new RDD per each message. It shouldn't work like that.
Please, sider the following options:
1 - If you could ask required data by one/two CQL queries, try to use CassandraConnector inside the .mapPartitions(. Something like this:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
val connector = ...instantiate CassandraConnector onece here
val transformed_data = lines_myobjects.mapPartitions(it => {
connector.withSessionDo { session =>
it.map(myobject => session.execute("CQL QUERY TO GET YOUR DATA HERE", myobject.getuserId)
})
2 - Otherwise (if you select by primary/partition key) consider .joinWithCassandraTable. Something like this:
import com.datastax.spark.connector._
val mytableRDD = sc.cassandraTable("mydb", "mytable")
val transformed_data = lines_myobjects
.map(myobject => {
Tuple1(myobject.getuserId) // you need to wrap ids to a tuple to do join with Cassandra
})
.joinWithCassandraTable("mydb", "mytable")
// process results here
I would approach this a different way.
The data that is flowing into Cassandra, route it through Kafka (and from Kafka send to the Cassandra with the Kafka Connect sink).
With your data in Kafka, you can then join between your streams of data, whether in Spark, or with Kafka's Streams API, or KSQL.
Both Kafka Streams and KSQL support stream-table joins that you're doing here. You can see it in action with KSQL here and here.

Extract the time stamp from kafka messages in spark streaming?

Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming.
kafka(version 0.10.0.0)
spark streaming(version 2.0.1)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "your.server.com:9092")
.option("subscribe", "your-topic")
.load()
.select($"timestamp", $"value")
Field "timestamp" is what you are looking for. Type - java.sql.Timestamp. Make sure that you are connecting to 0.10 Kafka server. There is no timestamp in earlier versions.
Full list of fields described here - http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
I'd suggest couple things:
Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka)
E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1
Than you create a stream, according to the docs above:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker1:9092,broker2:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteArrayDeserializer],
"group.id" -> "spark-streaming-test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val sparkConf = new SparkConf()
// suppose you have 60 second window
val ssc = new StreamingContext(sparkConf, Seconds(60))
ssc.checkpoint("checkpoint")
val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent,
Subscribe[String, Array[Byte]](topics, kafkaParams))
Your stream will be an DStream of ConsumerRecord[String,Array[Byte]] and you can get a timestamp and key-value as simple as:
stream.map { record => (record.timestamp(), record.key(), record.value()) }
Hope that helps.