Spark Kafka Streaming multi partition CommitAsync issue - scala

I am reading a message from Kafka topic which has multiple partitions. While reading from message no issue, while Committing the offset range to Kafka, I am getting an error. I tried my level best and not able to resolve this issue.
Code
object ParallelStreamJob {
def main(args: Array[String]): Unit = {
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
spark.sparkContext.setLogLevel("WARN")
val kafkaStream = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "welcome3",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test2")
val numPartitionsOfInputTopic = 2
val streams = (1 to numPartitionsOfInputTopic) map {
_ => KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) )
}
streams
}
// var offsetRanges = Array[OffsetRange]()
kafkaStream.foreach(rdd=> {
rdd.foreachRDD(conRec=> {
val offsetRanges = conRec.asInstanceOf[HasOffsetRanges].offsetRanges
conRec.foreach(str=> {
println(str.value())
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
})
kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
})
println(" Spark parallel reader is ready !!!")
ssc.start()
ssc.awaitTermination()
}
}
Error
18/03/19 21:21:30 ERROR JobScheduler: Error running job streaming job 1521512490000 ms.0
java.lang.ClassCastException: scala.collection.immutable.Vector cannot be cast to org.apache.spark.streaming.kafka010.CanCommitOffsets
at com.cts.ignite.inventory.core.ParallelStreamJob$$anonfun$main$1$$anonfun$apply$1.apply(ParallelStreamJob.scala:48)
at com.cts.ignite.inventory.core.ParallelStreamJob$$anonfun$main$1$$anonfun$apply$1.apply(ParallelStreamJob.scala:39)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.a

you can commit the offset like
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
in your case kafkaStream is Seq of stream. change you commit line.
Reference: https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

change kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) line to rdd.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

Related

Spark streaming context Kafka Consumer: Read messages of a certain byte size in batches

I have created a Spark Streaming consumer with given sample code as below,
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"fetch.max.bytes" -> "65536",
"max.partition.fetch.bytes" -> "8192",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"sasl.jaas.config"-> "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"admin\" password=\"admin\";",
"sasl.mechanism" -> "PLAIN",
"security.protocol" -> "SASL_PLAINTEXT",
)
val topics = Array("test.topic")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD {
rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
println(offsetRanges.foreach(a => println(a.topic + ":" + a.partition + ":" + a.fromOffset + ":" + a.untilOffset + ":" + a.count())))
val df = rdd.map(a => a.value().split(",")).toDF()
val selectCols = columns.indices.map(i => $"value"(i))
var newDF = df.select(selectCols: _*).toDF(columns: _*)
newDF.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "topic.ouput")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"admin\" password=\"admin\";")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.save()
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
sparkSession.catalog.clearCache()
}
streamingContext.start()
streamingContext.awaitTermination()
I want messages to be read in bytes of certain size. I put the required configs fetch.max.bytes to 64 KB. To test it I'm leaving my producer to push enough data to run for a configurable amount of time while the consumer is down. Then I'm restarting the consumer to fetch messages. But I'm getting messages much more than expected bytes.
Here is a sample output of the program:
test.topic:6:1345075:4163058:2817983
test.topic:0:1339456:4144190:2804734
test.topic:3:1354266:4189336:2835070
test.topic:7:1353542:4186148:2832606
test.topic:5:1355140:4189071:2833931
test.topic:2:1351162:4173375:2822213
test.topic:1:1352801:4184073:2831272
test.topic:4:1348558:4166749:2818191
()
test.topic:6:4163058:4163058:0
test.topic:0:4144190:4144190:0
test.topic:3:4189336:4189336:0
test.topic:7:4186148:4186148:0
test.topic:5:4189071:4189071:0
test.topic:2:4173375:4173375:0
test.topic:1:4184073:4184073:0
test.topic:4:4166749:4166749:0
Also, I'm not sure why in the second output we are getting 0 count for the offsets.

How to turn this simple Spark Streaming code into a Multi threaded one?

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

How to disable KafkaUtils override my kafka topic in spark job

Part of my spark job:
val kafkaParams = Map[String, Object](
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "mytopic",
)
try {
val inputStream = KafkaUtils.createDirectStream(ssc,PreferConsistent, Subscribe[String, String](Array(inputTopic), kafkaParams))
val processedStream = inputStream.map(record => record.value)
processedStream.print()
ssc.start
ssc.awaitTermination
} finally {
...
}
I got the following log:
2018-10-23 14:35:26 WARN KafkaUtils:66 - overriding executor group.id to spark-executor-mytopic
How to disable the topic override?
Any comments welcomed. Thanks

If Kafka Consumer fails (Spark Job), how to fetch the last offset committed by Kafka Consumer. (Scala)

Before I give any details, please note, I am NOT asking how to fetch the latest offset from Console using kafka-run-class.sh kafka.tools.ConsumerOffsetChecker.
I am trying to make a kafka consumer (kafka version 0.10) in Spark (2.3.1) using Scala (2.11.8), which will be fault tolerant. By fault tolerant, I mean, if for some reason the kafka consumer dies and restarts, it should resume consuming the messages from the last offset.
For achieving this, I commit the Kafka offset once it has been consumed using the below code,
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
val topic = Array("topic_1")
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, offsets) /*Edit: Added offset*/
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async Commit Here */
}
I have tried many thing to fetch the latest Offset for the given topic, but could not get it to work.
Can anyone help me out with the scala code to achieve this, please?
Edit:
In the above code, I am trying to fetch the last offset by using
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
but the offset fetched by the above code is 0, not the latest. Is there anyway to fetch the latest offset?
Found the solution to the above issue. Here it is. Hope it helps someone in need.
Language: Scala, Spark Job
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
import java.util.Properties
//create a new properties object with Kafaka Parameters as done previously. Note: Both needs to be present. We will use the proprty object just to fetch the last offset
val kafka_props = new Properties()
kafka_props.put("bootstrap.servers", "localhost:9092")
kafka_props.put("key.deserializer",classOf[StringDeserializer])
kafka_props.put("value.deserializer",classOf[StringDeserializer])
kafka_props.put("group.id","group_101")
kafka_props.put("auto.offset.reset","latest")
kafka_props.put("enable.auto.commit",(false: java.lang.Boolean))
kafka_props.put("session.timeout.ms",(30000: java.lang.Integer))
kafka_props.put("heartbeat.interval.ms",(3000: java.lang.Integer))
val topic = Array("topic_1")
/*val offsets = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> 2L) Edit: Added code to fetch offset*/
val topicAndPartition = new org.apache.kafka.common.TopicPartition("topic_1", 0) //Using 0 as the partition because this topic does not have any partitions
val consumer = new KafkaConsumer[String,String](kafka_props) //create a 2nd consumer to fetch last offset
import java.util
consumer.subscribe(util.Arrays.asList("topic_1")) //Subscribe to the 2nd consumer. Without this step, the offsetAndMetadata can't be fetched.
val offsetAndMetadata = consumer.committed(topicAndPartition) //Find last committed offset for the given topicAndPartition
val endOffset = offsetAndMetadata.offset().toLong //fetch the last committed offset from offsetAndMetadata and cast it to Long data type.
val fetch_from_offset = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> endOffset) // create a Map with data type (TopicPartition, Long)
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, fetch_from_offset) //Pass the offset Map of datatype (TopicPartition, Long) created eariler
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async offset Commit Here */
}

Spark streaming for Kafka: How to get the topic name from Kafka consumer DStream?

I have set up the Spark-Kafka Consumer in Scala that receives messages from multiple topics:
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(10))
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> properties.getProperty("group_id"),
"auto.offset.reset" -> properties.getProperty("offset_reset")
)
// Kafka integration with receiver
val msgStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaParams, Map(properties.getProperty("topic1") -> 1,
properties.getProperty("topic2") -> 2,
properties.getProperty("topic3") -> 3),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
I need to develop corresponding action code for messages (which will be in JSON format) from each topic.
I referred to the following question, but the answer in it didn't help me:
get topic from Kafka message in spark
So, is there any method on the received DStream that can be used to fetch topic name along with the message to determine what action should take place?
Any help on this would be greatly appreciated. Thank you.
See the code below.
You can get topic name and message by foreachRDD, map operation on DStream.
msgStream.foreachRDD(rdd => {
val pairRdd = rdd.map(i => (i.topic(), i.value()))
})
The code below is an example source of createDirectStream that I am using.
val ssc = new StreamingContext(configLoader.sparkConfig, Seconds(conf.getInt(Conf.KAFKA_PULL_INTERVAL)))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString(Conf.KAFKA_BOOTSTRAP_SERVERS),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> conf.getString(Conf.KAFKA_CONSUMER_GID),
"auto.offset.reset" -> conf.getString(Conf.KAFKA_AUTO_OFFSET_RESET),
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics: Array[String] = conf.getString(Conf.KAFKA_TOPICS).split(",")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)