Returning values from kafka consumer - scala

I am working on a scala application. I am using kafka in it. I am to consume message from kafka topic. Since I am writing a test case, I need to get record do some assertion to get pass my testcase. I am using following code to consume kafka message:
import java.util.{Collections, Properties}
import java.util.regex.Pattern
import org.apache.kafka.clients.consumer.KafkaConsumer
import scala.collection.JavaConverters._
object KafkaConsumerSubscribeApp extends App {
val props:Properties = new Properties()
props.put("group.id", "test")
props.put("bootstrap.servers","localhost:9092")
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
Props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
val consumer = new KafkaConsumer(props)
val topics = List("topic_text")
try {
consumer.subscribe(topics.asJava)
while (true) {
val records = consumer.poll(10)
for (record <- records.asScala) {
println("Topic: " + record.topic() +
",Key: " + record.key() +
",Value: " + record.value() +
", Offset: " + record.offset() +
", Partition: " + record.partition())
}
}
}catch{
case e:Exception => e.printStackTrace()
}finally {
consumer.close()
}
}
There are two problems I am facing with this code. In intellij it is giving warning that poll method is deprecated. How can I modify this code for deprecation? Second problem is I want to this method to return the message which it gets from kafka topic. That message is in record.value(). How can I return it? With this code since while(true) is used, so it will be a endless loop and it will keep listening to messages from topic. How can I return record.value() from this method so that I can use data which I got from topic in other methods.

If you want to return you have to modify the codebase as follows. To remove the deprecation of the poll, you need to just pass the durations. For more info refer
def readFromKafka(//arguments) = {
// other code
consumer.subscribe(util.Collections.singletonList(topic))
consumer.poll(Duration.ofMillis(5000)).asScala.toList.map(_.value())
}
To keep on reading from Kafka, you have to keep on calling the function. For that, you may use the scheduler.
I hope it will help.

Related

One to One instant messaging using Kafka

I'm using Scala and Kafka to create topic based pub-sub architecture.
My question is how can I handle One-to-One Messaging part of my application using Kafka topics?
This is my producer class:
class Producer(topic: String, key: String, brokers: String, message: String) {
val producer = new KafkaProducer[String, String](configuration)
private def configuration: Properties = {
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props
}
def sendMessages(): Unit = {
val record = new ProducerRecord[String, String](topic, key, message)
producer.send(record)
producer.close()
}
}
And this is my consumer class:
class Consumer(brokers: String, topic: String, groupId: String) {
val consumer = new KafkaConsumer[String, String](configuration)
consumer.subscribe(util.Arrays.asList(topic))
private def configuration: Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
//props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
props
}
def receiveMessages(): Unit = {
while (true) {
consumer.poll(Duration.ofSeconds(0)).forEach(record => println(s"Received message: $record"))
}
}
}
I also have an auth service that takes cares care of everything related to authenticating via JWT tokens.
I am confused on how to create messages to specific users, I thought about creating a "Messages" class but I got lost when it comes to how to send these "specific" users messages and how to partition these messages on kafka for later usage:
class Message {
def sendMessage(sender_id: String, receiver_id: String, content: String): Unit ={
val newMessage = new Producer(brokers = KAFKA_BROKER,key =sender_id + " to " + receiver_id, topic = "topic_1", message = content)
newMessage.sendMessages()
}
def loadMessage(): Unit ={
//
}
}
My thought was to specify a custom key for all messages belonging to the same conversation but I couldn't find the right way to retrieve these messages later on as my consumer returns everything contained in that topic no matter what the key is. Meaning, all the users will eventually get all the messages. I know my modeling seems messy but I couldn't find the right way to do it, I'm also kinda confused when it comes to the usage of the group_id in the consumer.
Could someone make me what's the right way to achieve what I'm trying to do here please ?
couldn't find the right way to retrieve these messages later on ... consumer returns everything contained in that topic no matter what the key is
You would need to .assign the Consumer instance to a specific partition, not use .subscribe, which reads all partitions. Or you'd use specific topics for each conversation.
But then you need unique partitions/topics for every conversation that will exist. In a regular chat application where users create/remove rooms randomly, that will not scale for Kafka.
Ultimately, I'd suggest writing your data to somewhere else than Kafka that you can actually query and index on a "convertsationId" and/or user ids rather than try to forward those events directly from Kafka into your "chat" application.

Kafka Ensure At Least Once

First project with Kafka, trying to prove that an event will get processed at least once. So far, not seeing evidence that processing is retried.
Structure of dummy app is simple: subscribe, process, publish, commit; if exception, abort transaction and hope it gets retried. I am logging every message.
I expect to see (1) "process messageX" (2) "error for messageX" (3) "process messageX". Instead, I see processing continue beyond messageX, i.e. it does not get re-processed.
What I see is: (1) "process messageX" (2) "error for messageX" (3) "process someOtherMessage".
Using Kafka 2.7.0, Scala 2.12.
What am I missing? Showing relevant parts of dummy app below.
I also tried by removing the producer from the code (and all references to it).
UPDATE 1: I managed to get records re-processed by using the offsets with consumer.seek(), i.e. sending the consumer back to the start of the batch of records. Not sure why simply NOT reaching consumer.commitSync() (because of an exception) does not do this already.
import com.myco.somepackage.{MyEvent, KafkaConfigTxn}
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer, OffsetAndMetadata}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.common.{KafkaException, TopicPartition}
import org.slf4j.LoggerFactory
import java.util
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
// Prove that a message can be re-processed if there is an exception
object TopicDrainApp {
private val logger = LoggerFactory.getLogger(this.getClass)
private val subTopic = "input.topic"
private val pubTopic = "output.topic"
val producer = new KafkaProducer[String, String](KafkaConfigTxn.producerProps)
producer.initTransactions()
val consumer = new KafkaConsumer[String, String](KafkaConfigTxn.consumerProps)
private var lastEventMillis = System.currentTimeMillis
private val pollIntervalMillis = 1000
private val pollDuration = java.time.Duration.ofMillis(pollIntervalMillis)
def main(args: Array[String]): Unit = {
subscribe(subTopic)
}
def subscribe(subTopic: String): Unit = {
consumer.subscribe(util.Arrays.asList(subTopic))
while (System.currentTimeMillis - lastEventMillis < 5000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
records.asScala.foreach { record =>
try {
lastEventMillis = System.currentTimeMillis
val event = MyEvent.deserialize(record.value())
logger.info("ReceivedMyEvent:" + record.value())
producer.beginTransaction()
simulateProcessing(event) // [not shown] throw exception to test re-processing
producer.flush()
val offsetsToCommit = getOffsetsToCommit(records)
//consumer.commitSync() // tried this; does not work
//producer.sendOffsetsToTransaction(offsetsToCommit, "group1") // tried this; does not work
producer.commitTransaction()
} catch {
case e: KafkaException => logger.error(s"rollback ${record.value()}", e)
producer.abortTransaction()
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
}
private def getOffsetsToCommit(records: ConsumerRecords[String, String]): util.Map[TopicPartition, OffsetAndMetadata] = {
records.partitions().asScala.map { partition =>
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(partitionedRecords.size - 1).offset
(partition, new OffsetAndMetadata(offset + 1))
}.toMap.asJava
}
}
object KafkaConfigTxn {
// Only relevant properties are shown
def commonProperties: Properties = {
val props = new Properties()
props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "...")
props.put(CommonClientConfigs.GROUP_ID_CONFIG, "...")
props
}
def producerProps: Properties = {
val props = new Properties()
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true") // "enable.idempotence"
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "...") // "transactional.id"
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.RETRIES_CONFIG, "3")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
def consumerProps: Properties = {
val props = new Properties()
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed") // "isolation.level"
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
}
according to the reference I gave you , you need to use sendOffsetsToTransaction in the process, but again your consumer won't get the message of the aborted transcation as you are reading only committed transcation
Transcations were introduced in order to allow exactly once processing between kafka to kafka, being said that kafka supported from day one delivery semantics of at least once and at most once,
To get at least once behavior you disable auto commit and commit when processing had finished successfully that way next time you call poll() if you had exception before commit you will read again the records from the last commit offset
To get at most once behavior you commit before processing starts that way if exception happens , next time you call poll() you get new messages (but lose the other messages)
Exactly once is the most hard to achieve in plain java (not talking on spring framework which makes everything easier) - it involves saving offsets to external db ( usually where you process is done) and reading from there on startup/rebalance
For transcation usage example in java you might read this excellent guide by baeldung
https://www.baeldung.com/kafka-exactly-once
Figured out the correct combination of method calls (subscribe, beginTransaction, process, commit / abortTransaction, etc.), for a demo app. The core of the code is
def readProcessWrite(subTopic: String, pubTopic: String): Int = {
var lastEventMillis = System.currentTimeMillis
val consumer = createConsumer(subTopic)
val producer = createProducer()
val groupMetadata = consumer.groupMetadata()
var numRecords = 0
while (System.currentTimeMillis - lastEventMillis < 10000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
val offsetsToCommit = getOffsetsToCommit(records)
// println(s">>> PollRecords: ${records.count()}")
records.asScala.foreach { record =>
val currentOffset = record.offset()
try {
numRecords += 1
lastEventMillis = System.currentTimeMillis
println(s">>> Topic: $subTopic, ReceivedEvent: offset=${record.offset()}, key=${record.key()}, value=${record.value()}")
producer.beginTransaction()
val eventOut = simulateProcessing(record.value()) // may throw
publish(producer, pubTopic, eventOut)
producer.sendOffsetsToTransaction(offsetsToCommit, groupMetadata)
consumer.commitSync()
producer.commitTransaction()
} catch {
case e: KafkaException => println(s"---------- rollback ${record.value()}", e)
producer.abortTransaction()
offsetsToCommit.forEach { case (topicPartition, _) =>
consumer.seek(topicPartition, currentOffset)
}
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
consumer.close()
producer.close()
numRecords
}
// Consumer created with props.put("max.poll.records", "1")
I was able to prove that this will process each event exactly once, even when simulateProcessing() throws an exception. To be precise: when processing works fine, each event is processed exactly once. If there is an exception, the event is re-processed until success. In my case, there is no real reason for the exceptions, so re-processing will always end in success.

Spark kafka producer Introducing duplicate records during kafka ingestion

I have written a spark kafka producer, which pulls messages from hive and pushes into kafka, most of the records(messages) are getting duplicated when we ingest into kafka, though i do not have any duplicates before pushing into kafka. I have added configurations related to exactly-once semantics making kafka producer idempotent
below is the code-snippet i am using for kafka producer
import java.util.{Properties, UUID}
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.KafkaException
import org.apache.kafka.common.errors.{AuthorizationException, OutOfOrderSequenceException, ProducerFencedException}
import org.apache.log4j.LogManager
import org.apache.spark.sql.DataFrame
object KafkaWriter {
lazy val log = LogManager.getLogger("KafkaWriter")
private def writeKafkaWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int,
kafkaBootstrapServer: String): Unit = {
log.info("Inside Method KafkaWriter::writeKafkaWithoutRepartition no of partitions =" + noOfPartitions)
df.foreachPartition(
iter => {
val properties = getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer)
properties.setProperty(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString + "-" + System.currentTimeMillis().toString)
log.info("Inside Method writeKafkaWithoutRepartition:: inside writekafka :: kafka properties ::" + properties)
val kafkaProducer = new KafkaProducer[String, String](properties)
try {
log.info("kafka producer property enable.idempotence ::" + properties.getProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG))
kafkaProducer.initTransactions
kafkaProducer.beginTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: inside each partition kafka transactions started")
iter.foreach(row => {
log.info("Inside Method writeKafkaWithoutRepartition:: inside each iterator record kafka transactions started")
kafkaProducer.send(new ProducerRecord(topic, row.getAs[String]("value")))
})
kafkaProducer.commitTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: kafka transactions completed")
} catch {
case e#(_: ProducerFencedException ) =>
// We can't recover from these exceptions, so our only option is to close the producer and exit.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.close
case e: KafkaException =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.abortTransaction
case ex: Exception =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + ex.getMessage)
kafkaProducer.abortTransaction
} finally {
kafkaProducer.close
}
})
}
def writeWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int, kafkaBootstrapServer: String): Unit = {
var repartitionedDF = df.selectExpr("to_json(struct(*)) AS value")
log.info("Inside KafkaWriter::writeWithoutRepartition ")
writeKafkaWithoutRepartition(repartitionedDF, topic, noOfPartitions, kafkaBootstrapServer)
}
def getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer: String): Properties = {
val properties = new Properties
properties.setProperty(CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServer)
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
// properties.setProperty(CommonClientConfigs.REQUEST_TIMEOUT_MS_CONFIG, "400000")
// properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, "300000")
properties.put(ProducerConfig.RETRIES_CONFIG, "1000")
properties.put(ProducerConfig.ACKS_CONFIG, "all")
properties.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
properties
}
}
set the isolation.level-->Committed on the kafka consumer end.
tried setting min.insync.replicas-->2(In my opinion this property might not play an important role , still tried)
Spark version : 2.3.1
kafka client version : 2.2.1
And I am also using transactions while producing messages into kafka, init begin and commit transactions for each message.I am ingesting around 100 Million records one time, I did split the data into smaller chunks, say 100 Million divided into 1 Million at once before ingesting into kafka
Tried using structured streaming, still no luck
df.selectExpr(s""" '${key}' as key """, "to_json(struct(*)) AS value").write.format("kafka").options(getKafkaProducerProperties(topic)).save
I am not sure if i am missing any configurations on the kafka producer , broker or at the consumer end.
Not sure if i can include any
Thanks in Advance

Cannot see message while sinking kafka stream and cannot see print message in flink 1.2

My goal is to use kafka to read in a string in json format, do a filter to the string and then sink the message out (still in json string format).
For testing purpose, my input string message looks like:
{"a":1,"b":2}
And my code of implementation is:
def main(args: Array[String]): Unit = {
// parse input arguments
val params = ParameterTool.fromArgs(args)
if (params.getNumberOfParameters < 4) {
println("Missing parameters!\n"
+ "Usage: Kafka --input-topic <topic> --output-topic <topic> "
+ "--bootstrap.servers <kafka brokers> "
+ "--zookeeper.connect <zk quorum> --group.id <some id> [--prefix <prefix>]")
return
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.disableSysoutLogging
env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000))
// create a checkpoint every 5 seconds
env.enableCheckpointing(5000)
// make parameters available in the web interface
env.getConfig.setGlobalJobParameters(params)
// create a Kafka streaming source consumer for Kafka 0.10.x
val kafkaConsumer = new FlinkKafkaConsumer010(
params.getRequired("input-topic"),
new JSONKeyValueDeserializationSchema(false),
params.getProperties)
val messageStream = env.addSource(kafkaConsumer)
val filteredStream: DataStream[ObjectNode] = messageStream.filter(node => node.get("a").asText.equals("1")
&& node.get("b").asText.equals("2"))
messageStream.print()
// Refer to: https://stackoverflow.com/documentation/apache-flink/9004/how-to-define-a-custom-deserialization-schema#t=201708080802319255857
filteredStream.addSink(new FlinkKafkaProducer010[ObjectNode](
params.getRequired("output-topic"),
new SerializationSchema[ObjectNode] {
override def serialize(element: ObjectNode): Array[Byte] = element.toString.getBytes()
}, params.getProperties
))
env.execute("Kafka 0.10 Example")
}
As can be seen, I want to print message stream to the console and sink the filtered message to kafka. However, I can see neither of them.
The interesting thing is, if I modify the schema of KafkaConsumer from JSONKeyValueDeserializationSchema to SimpleStringSchema, I can see messageStream print to the console. Code as shown below:
val kafkaConsumer = new FlinkKafkaConsumer010(
params.getRequired("input-topic"),
new SimpleStringSchema,
params.getProperties)
val messageStream = env.addSource(kafkaConsumer)
messageStream.print()
This makes me think if I use JSONKeyValueDeserializationSchema, my input message is actually not accepted by Kafka. But this seems so weird and quite different from the online document(https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/kafka.html)
Hope someone can help me out!
The JSONKeyValueDeserializationSchema() expects message key with each kafka msg and I am assuming that no key is supplied when the JSON messages are produced and sent over the kafka topic.
Thus to solve the issue, try using JSONDeserializationSchema() which expects only the message and creates an object node based on the message received.

Kafka scala Consumer code - to print consumed records

As i am creating simple kafka consumer as below by using url :https://gist.github.com/akhil/6dfda8a04e33eff91a20 .
In that link, to print the consumed record, used a word "asScala" , that is not identified. and kindly , tell me, how to iterate the return type : ConsumerRecord[String,String] , which is poll() method's return type.
import java.util
import java.util.Properties
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer}
object KafkaConsumerEx extends App {
val topic_name = "newtopic55"
val consumer_group = "KafkaConsumerBatch"
val prot = new Properties()
prot.put("bootstrap.servers","localhost:9092")
prot.put("group.id",consumer_group)
prot.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
prot.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
val kfk_consumer = new KafkaConsumer[String,String](prot)
kfk_consumer.subscribe(util.Collections.singleton(topic_name))
println("here")
while(true){
val consumer_record : ConsumerRecords[String, String] = kfk_consumer.poll(100)
println("records count : " + consumer_record.count())
println("records partitions: " + consumer_record.partitions())
consumer_record.iterator().
}
}
Thanks in adv.
You can easily do that
for (record <- consumer_record.iterator()) {
println(s"Here's your $record")
}
Remember to add this import:
import scala.collection.JavaConversions._
Add another answer since the scala.collection.JavaConversions was deprecated as mentioned at here.
As for this question, the code could be like
import scala.collection.JavaConverters._
for (record <- asScalaIterator(consumer_record.iterator)) {
println(s"Here's your $record")
}
while(true){
val consumer_records = kfk_consumer.poll(100)
val record_iter=consumer_record.iterator()
while(record_iter.hasNext())
{
record=record_iter.next()
println("records partitions: " + record.partition()
"records_data:" + record.value())
}
}