I have an application that should send a finite number of messages to Kafka and then quit. For some reason, the Kafka connection stays up even if I close the producer. My implementation (in Scala) is more or less
object Kafka {
private val props = new Properties()
props.put("compression.codec", DefaultCompressionCodec.codec.toString)
props.put("producer.type", "sync")
props.put("metadata.broker.list", "localhost:9092")
props.put("batch.num.messages", "200")
props.put("message.send.max.retries", "3")
props.put("request.required.acks", "-1")
props.put("client.id", "myclient")
private val producer = new Producer[Array[Byte], Array[Byte]](new ProducerConfig(props))
private def encode(msg: Message) = new KeyedMessage("topic", msg.id.getBytes, write(msg).getBytes)
def send(msg: Message) = Try(producer.send(encode(msg)))
def close() = producer.close()
}
Here Message is a simple case class, and how I convert it to byte array is not really relevant.
The messages do arrive, but when I eventually call Kafka.close(), the application does not exit, and the connection does not seem to be released.
Is there a way to explicitly ask Kafka to terminate the connection?
def close() = producer.close()
This creates a function called "close" that calls producer.close()
I see no evidence that your code actually closes the producer.
You need to just call:
producer.close
Related
I'm using Scala and Kafka to create topic based pub-sub architecture.
My question is how can I handle One-to-One Messaging part of my application using Kafka topics?
This is my producer class:
class Producer(topic: String, key: String, brokers: String, message: String) {
val producer = new KafkaProducer[String, String](configuration)
private def configuration: Properties = {
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props
}
def sendMessages(): Unit = {
val record = new ProducerRecord[String, String](topic, key, message)
producer.send(record)
producer.close()
}
}
And this is my consumer class:
class Consumer(brokers: String, topic: String, groupId: String) {
val consumer = new KafkaConsumer[String, String](configuration)
consumer.subscribe(util.Arrays.asList(topic))
private def configuration: Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
//props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
props
}
def receiveMessages(): Unit = {
while (true) {
consumer.poll(Duration.ofSeconds(0)).forEach(record => println(s"Received message: $record"))
}
}
}
I also have an auth service that takes cares care of everything related to authenticating via JWT tokens.
I am confused on how to create messages to specific users, I thought about creating a "Messages" class but I got lost when it comes to how to send these "specific" users messages and how to partition these messages on kafka for later usage:
class Message {
def sendMessage(sender_id: String, receiver_id: String, content: String): Unit ={
val newMessage = new Producer(brokers = KAFKA_BROKER,key =sender_id + " to " + receiver_id, topic = "topic_1", message = content)
newMessage.sendMessages()
}
def loadMessage(): Unit ={
//
}
}
My thought was to specify a custom key for all messages belonging to the same conversation but I couldn't find the right way to retrieve these messages later on as my consumer returns everything contained in that topic no matter what the key is. Meaning, all the users will eventually get all the messages. I know my modeling seems messy but I couldn't find the right way to do it, I'm also kinda confused when it comes to the usage of the group_id in the consumer.
Could someone make me what's the right way to achieve what I'm trying to do here please ?
couldn't find the right way to retrieve these messages later on ... consumer returns everything contained in that topic no matter what the key is
You would need to .assign the Consumer instance to a specific partition, not use .subscribe, which reads all partitions. Or you'd use specific topics for each conversation.
But then you need unique partitions/topics for every conversation that will exist. In a regular chat application where users create/remove rooms randomly, that will not scale for Kafka.
Ultimately, I'd suggest writing your data to somewhere else than Kafka that you can actually query and index on a "convertsationId" and/or user ids rather than try to forward those events directly from Kafka into your "chat" application.
First project with Kafka, trying to prove that an event will get processed at least once. So far, not seeing evidence that processing is retried.
Structure of dummy app is simple: subscribe, process, publish, commit; if exception, abort transaction and hope it gets retried. I am logging every message.
I expect to see (1) "process messageX" (2) "error for messageX" (3) "process messageX". Instead, I see processing continue beyond messageX, i.e. it does not get re-processed.
What I see is: (1) "process messageX" (2) "error for messageX" (3) "process someOtherMessage".
Using Kafka 2.7.0, Scala 2.12.
What am I missing? Showing relevant parts of dummy app below.
I also tried by removing the producer from the code (and all references to it).
UPDATE 1: I managed to get records re-processed by using the offsets with consumer.seek(), i.e. sending the consumer back to the start of the batch of records. Not sure why simply NOT reaching consumer.commitSync() (because of an exception) does not do this already.
import com.myco.somepackage.{MyEvent, KafkaConfigTxn}
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer, OffsetAndMetadata}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.common.{KafkaException, TopicPartition}
import org.slf4j.LoggerFactory
import java.util
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
// Prove that a message can be re-processed if there is an exception
object TopicDrainApp {
private val logger = LoggerFactory.getLogger(this.getClass)
private val subTopic = "input.topic"
private val pubTopic = "output.topic"
val producer = new KafkaProducer[String, String](KafkaConfigTxn.producerProps)
producer.initTransactions()
val consumer = new KafkaConsumer[String, String](KafkaConfigTxn.consumerProps)
private var lastEventMillis = System.currentTimeMillis
private val pollIntervalMillis = 1000
private val pollDuration = java.time.Duration.ofMillis(pollIntervalMillis)
def main(args: Array[String]): Unit = {
subscribe(subTopic)
}
def subscribe(subTopic: String): Unit = {
consumer.subscribe(util.Arrays.asList(subTopic))
while (System.currentTimeMillis - lastEventMillis < 5000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
records.asScala.foreach { record =>
try {
lastEventMillis = System.currentTimeMillis
val event = MyEvent.deserialize(record.value())
logger.info("ReceivedMyEvent:" + record.value())
producer.beginTransaction()
simulateProcessing(event) // [not shown] throw exception to test re-processing
producer.flush()
val offsetsToCommit = getOffsetsToCommit(records)
//consumer.commitSync() // tried this; does not work
//producer.sendOffsetsToTransaction(offsetsToCommit, "group1") // tried this; does not work
producer.commitTransaction()
} catch {
case e: KafkaException => logger.error(s"rollback ${record.value()}", e)
producer.abortTransaction()
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
}
private def getOffsetsToCommit(records: ConsumerRecords[String, String]): util.Map[TopicPartition, OffsetAndMetadata] = {
records.partitions().asScala.map { partition =>
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(partitionedRecords.size - 1).offset
(partition, new OffsetAndMetadata(offset + 1))
}.toMap.asJava
}
}
object KafkaConfigTxn {
// Only relevant properties are shown
def commonProperties: Properties = {
val props = new Properties()
props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "...")
props.put(CommonClientConfigs.GROUP_ID_CONFIG, "...")
props
}
def producerProps: Properties = {
val props = new Properties()
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true") // "enable.idempotence"
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "...") // "transactional.id"
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.RETRIES_CONFIG, "3")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
def consumerProps: Properties = {
val props = new Properties()
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed") // "isolation.level"
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
}
according to the reference I gave you , you need to use sendOffsetsToTransaction in the process, but again your consumer won't get the message of the aborted transcation as you are reading only committed transcation
Transcations were introduced in order to allow exactly once processing between kafka to kafka, being said that kafka supported from day one delivery semantics of at least once and at most once,
To get at least once behavior you disable auto commit and commit when processing had finished successfully that way next time you call poll() if you had exception before commit you will read again the records from the last commit offset
To get at most once behavior you commit before processing starts that way if exception happens , next time you call poll() you get new messages (but lose the other messages)
Exactly once is the most hard to achieve in plain java (not talking on spring framework which makes everything easier) - it involves saving offsets to external db ( usually where you process is done) and reading from there on startup/rebalance
For transcation usage example in java you might read this excellent guide by baeldung
https://www.baeldung.com/kafka-exactly-once
Figured out the correct combination of method calls (subscribe, beginTransaction, process, commit / abortTransaction, etc.), for a demo app. The core of the code is
def readProcessWrite(subTopic: String, pubTopic: String): Int = {
var lastEventMillis = System.currentTimeMillis
val consumer = createConsumer(subTopic)
val producer = createProducer()
val groupMetadata = consumer.groupMetadata()
var numRecords = 0
while (System.currentTimeMillis - lastEventMillis < 10000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
val offsetsToCommit = getOffsetsToCommit(records)
// println(s">>> PollRecords: ${records.count()}")
records.asScala.foreach { record =>
val currentOffset = record.offset()
try {
numRecords += 1
lastEventMillis = System.currentTimeMillis
println(s">>> Topic: $subTopic, ReceivedEvent: offset=${record.offset()}, key=${record.key()}, value=${record.value()}")
producer.beginTransaction()
val eventOut = simulateProcessing(record.value()) // may throw
publish(producer, pubTopic, eventOut)
producer.sendOffsetsToTransaction(offsetsToCommit, groupMetadata)
consumer.commitSync()
producer.commitTransaction()
} catch {
case e: KafkaException => println(s"---------- rollback ${record.value()}", e)
producer.abortTransaction()
offsetsToCommit.forEach { case (topicPartition, _) =>
consumer.seek(topicPartition, currentOffset)
}
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
consumer.close()
producer.close()
numRecords
}
// Consumer created with props.put("max.poll.records", "1")
I was able to prove that this will process each event exactly once, even when simulateProcessing() throws an exception. To be precise: when processing works fine, each event is processed exactly once. If there is an exception, the event is re-processed until success. In my case, there is no real reason for the exceptions, so re-processing will always end in success.
I have a simple committable source for Kafka stream wrapped in RestartSource. It works fine in happy path, but if I deliberately severe the connection to Kafka cluster, it throws connection exception from underlying kafka client and reports Kafka Consumer Shut Down. My expectation was it to restart the stream after ~150 seconds, but it doesn't. Is my understanding/usage of RestartSource incorrect from below:
val atomicControl = new AtomicReference[Consumer.Control](NoopControl)
val restartablekafkaSourceWithFlow = {
RestartSource.withBackoff(30.seconds, 120.seconds, 0.2) {
() => {
Consumer.committableSource(consumerSettings.withClientId("clientId"), Subscriptions.topics(Set("someTopic")))
.mapMaterializedValue(c => atomicControl.set(c))
.via(someFlow)
.via(httpFlow)
}
}
}
val committerSink: Sink[(Any, ConsumerMessage.CommittableOffset), Future[Done]] = Committer.sinkWithOffsetContext(CommitterSettings(actorSystem))
val runnableGraph = restartablekafkaSourceWithFlow.toMat(committerSink)(Keep.both)
val control = runnableGraph.mapMaterializedValue(x => Consumer.DrainingControl.apply(atomicControl.get, x._2)).run()
Maybe you are getting error outside of RestartSource.
You can add recover to see the error, and/or create a decider like below and use it in runnableGraph.
private val decider: Supervision.Decider = { e =>
logger.error("Unhandled exception in stream.", e)
Supervision.Resume
}
runnableGraph.withAttributes(supervisionStrategy(decider))
My goal is to use kafka to read in a string in json format, do a filter to the string and then sink the message out (still in json string format).
For testing purpose, my input string message looks like:
{"a":1,"b":2}
And my code of implementation is:
def main(args: Array[String]): Unit = {
// parse input arguments
val params = ParameterTool.fromArgs(args)
if (params.getNumberOfParameters < 4) {
println("Missing parameters!\n"
+ "Usage: Kafka --input-topic <topic> --output-topic <topic> "
+ "--bootstrap.servers <kafka brokers> "
+ "--zookeeper.connect <zk quorum> --group.id <some id> [--prefix <prefix>]")
return
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.disableSysoutLogging
env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000))
// create a checkpoint every 5 seconds
env.enableCheckpointing(5000)
// make parameters available in the web interface
env.getConfig.setGlobalJobParameters(params)
// create a Kafka streaming source consumer for Kafka 0.10.x
val kafkaConsumer = new FlinkKafkaConsumer010(
params.getRequired("input-topic"),
new JSONKeyValueDeserializationSchema(false),
params.getProperties)
val messageStream = env.addSource(kafkaConsumer)
val filteredStream: DataStream[ObjectNode] = messageStream.filter(node => node.get("a").asText.equals("1")
&& node.get("b").asText.equals("2"))
messageStream.print()
// Refer to: https://stackoverflow.com/documentation/apache-flink/9004/how-to-define-a-custom-deserialization-schema#t=201708080802319255857
filteredStream.addSink(new FlinkKafkaProducer010[ObjectNode](
params.getRequired("output-topic"),
new SerializationSchema[ObjectNode] {
override def serialize(element: ObjectNode): Array[Byte] = element.toString.getBytes()
}, params.getProperties
))
env.execute("Kafka 0.10 Example")
}
As can be seen, I want to print message stream to the console and sink the filtered message to kafka. However, I can see neither of them.
The interesting thing is, if I modify the schema of KafkaConsumer from JSONKeyValueDeserializationSchema to SimpleStringSchema, I can see messageStream print to the console. Code as shown below:
val kafkaConsumer = new FlinkKafkaConsumer010(
params.getRequired("input-topic"),
new SimpleStringSchema,
params.getProperties)
val messageStream = env.addSource(kafkaConsumer)
messageStream.print()
This makes me think if I use JSONKeyValueDeserializationSchema, my input message is actually not accepted by Kafka. But this seems so weird and quite different from the online document(https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/kafka.html)
Hope someone can help me out!
The JSONKeyValueDeserializationSchema() expects message key with each kafka msg and I am assuming that no key is supplied when the JSON messages are produced and sent over the kafka topic.
Thus to solve the issue, try using JSONDeserializationSchema() which expects only the message and creates an object node based on the message received.
I am trying to implement a setup where I have multiple web browsers open a websocket connection to my akka-http server in order to read all messages posted to a kafka topic.
so the stream of messages should go this way
kafka topic -> akka-http -> websocket connection 1
-> websocket connection 2
-> websocket connection 3
For now I have created a path for the websocket:
val route: Route =
path("ws") {
handleWebSocketMessages(notificationWs)
}
Then I have created a consumer for my kafka topic:
val consumerSettings = ConsumerSettings(system,
new ByteArrayDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val source = Consumer
.plainSource(consumerSettings, Subscriptions.topics("topic1"))
And then finally I want to connect this source to the websocket in handleWebSocketMessages
def handleWebSocketMessages: Flow[Message, Message, Any] =
Flow[Message].mapConcat {
case tm: TextMessage =>
TextMessage(source)::Nil
case bm: BinaryMessage =>
// ignore binary messages but drain content to avoid the stream being clogged
bm.dataStream.runWith(Sink.ignore)
Nil
}
Here is the error I get when I try to use source in the TextMessage:
Error:(77, 9) overloaded method value apply with alternatives:
(textStream: akka.stream.scaladsl.Source[String,Any])akka.http.scaladsl.model.ws.TextMessage
(text: String)akka.http.scaladsl.model.ws.TextMessage.Strict
cannot be applied to (akka.stream.scaladsl.Source[org.apache.kafka.clients.consumer.ConsumerRecord[Array[Byte],String],akka.kafka.scaladsl.Consumer.Control])
TextMessage(source)::Nil
I think I'm making numerous mistakes along the way but I would say that the most blocking part is the handleWebSocketMessages.
The first thing, is to understand that source is of type : Source[ConsumerRecord[K, V], Control].
So, it's not something that you could pass as an argument of a TextMessage.
Now, let's take the websocket's point of view:
An outgoing message is built for each message in the Kafka source. The message will be a TextMessage from a String transformation of the Kafka message.
For each incoming message, just println() it
So, the Flow can be seen as two components: the Source & the Sink.
val incomingMessages: Sink[Message, NotUsed] =
Sink.foreach(println(_))
val outgoingMessages: Source[Message, NotUsed] =
source
.map { consumerRecord => TextMessage(consumerRecord.record.value) }
val handleWebSocketMessages: Flow[Message, Message, Any]
= Flow.fromSinkAndSource(incomingMessages, outgoingMessages)
Hope it helps.