I've got a very basic skeleton Scala application (with Akka, Camel and ActiveMQ) where I want to publish onto an ActiveMQ queue as quickly as possible, but then only consume from that queue at a particular rate (eg. 1 per second).
Here's some code to illustrate that:
MyProducer.scala
class Producer extends Actor with Producer with Oneway {
def endpointUri = "activemq:myqueue"
}
MyConsumer.scala
class MyConsumer extends Actor with Consumer {
def endpointUri = "activemq:myqueue"
def receive = {
case msg: CamelMessage => println("Ping!")
}
}
In my main method, I then have all the boilerplate to set up Camel and get it talking to ActiveMQ, and then I have:
// Start the consumer
val consumer = system.actorOf(Props[MyConsumer])
val producer = system.actorOf(Props[MyProducer])
// Imagine I call this line 100+ times
producer ! "message"
How can I make it so that MyProducer sends things to ActiveMQ as quickly as possible (ie. no throttling) whilst making sure that MyConsumer only reads a message every x seconds? I'd like each message to stay on the ActiveMQ queue until the last possible moment (ie. when it's read by MyConsumer).
So far, I've managed to use a TimerBasedThrottler to consume at a certain rate, but this still consumes all of the messages in one big go.
Apologies if I've missed something along the way, I'm relatively new to Akka/Camel.
How many consumers comprise "MyConsumer"?
a) If it were only one, then it is unclear why a simple sleep between reading/consuming messages would not work.
If there are multiple consumers, which behavior are you requiring:
each consumer is throttled to the specified consumption rate. In that case each Consumer thread still behaves as mentioned in a)
the overall pool of consumers is throttled to the consumption rate. In that case a central Throttler would need to retain the inter-message delay and block each consumer until the required delay were met . There would be the complexity of managing when there were backlogs - to allow "catch-up". You probably get the drift here.
It may be you were looking for something else /more specific in this question. If so then please elaborate.
Related
I am polling to kafka consumer as follows.
val records = consumer.poll(Duration.ofMillis(5000)).asScala.toList
This sometimes bring message. Sometimes not. I am repeatedly calling a method many times in which I am consuming data from topic. Shall I set "max.partition.fetch.bytes" to "5048576" or there is problem with something else?
Polling is not guaranteed to return records; it is guaranteed to block and wait for records.
You need to check if there is another consumer that is part of the same consumer group that is already consuming messages that you might expect.
In the comments, you mention a test, and Kafka provides a MockConsumer and MockProducer class for a unittesting scenario. Kafka Streams also has its own testing methods
suppose my producer is writing the message to Topic A...once the message is in Topic A, i want to copy the same message to Topic B. Is this possible in kafka?
If I understand correctly, you just want stream.to("topic-b"), although, that seems strange without doing something to the data.
Note:
The specified topic should be manually created before it is used
I am not clear about what use case you are exactly trying to achieve by simply copying data from one topic to another topic. If both the topics are in the same Kafka cluster then it is never a good idea to have two topics with the same message/content.
I believe the gap here is that probably you are not clear about the concept of the Consumer group in Kafka. Probably you have two action items to do by consuming the message from the Kafka topic. And you are believing that if the first application consumes the message from the Kafka topic, will it be available for the second application to consume the same message or not. Kafka allows you to solve this kind of common use case with the help of the consumer group.
Let's try to differentiate between other message queue and Kafka and you will understand that you do not need to copy the same data/message between two topics.
In other message queues, like SQS(Simple Queue Service) where if the message is consumed by a consumer, the same message is not available to get consumed by other consumers. It is the responsibility of the consumer to delete the message safely once it has processed the message. By doing this we guarantee that the same message should not get processed by two consumers leading to inconsistency.
But, In Kafka, it is totally fine to have multiple sets of consumers consuming from the same topic. The set of consumers form a group commonly termed as the consumer group. Here one of the consumers from the consumer group can process the message based on the partition of the Kafka topic the message is getting consumed from.
Now the catch here is that we can have multiple consumer groups consuming from the same Kafka topic. Each consumer group will process the message in the way they want to do. There is no interference between consumers of two different consumer groups.
To fulfill your use case I believe you might need two consumer groups that can simply process the message in the way they want. You do not essentially have to copy the data between two topics.
Hope this helps.
There are two immediate options to forward the contents of one topic to another:
by using the stream feature of Kafka to create a forwarding link
between the two topics.
by creating a consumer / producer pair
and using those to receive and then forward on messages
I have a short piece of code that shows both (in Scala):
def topologyPlan(): StreamsBuilder = {
val builder = new StreamsBuilder
val inputTopic: KStream[String, String] = builder.stream[String, String]("topic2")
inputTopic.to("topic3")
builder
}
def run() = {
val kafkaStreams = createStreams(topologyPlan())
kafkaStreams.start()
val kafkaConsumer = createConsumer()
val kafkaProducer = createProducer()
kafkaConsumer.subscribe(List("topic1").asJava)
while (true) {
val record = kafkaConsumer.poll(Duration.ofSeconds(5)).asScala
for (data <- record.iterator) {
kafkaProducer.send(new ProducerRecord[String, String]("topic2", data.value()))
}
}
}
Looking at the run method, the first two lines set up a streams object to that uses the topologyPlan() to listen for messages in 'topic2' and forward then to 'topic3'.
The remaining lines show how a consumer can listen to a 'topic1' and use a producer to send them onward to 'topic2'.
The final point of the example here is Kafka is flexible enough to let you mix options depending on what you need, so the code above will take messages in 'topic1', and send them to 'topic3' via 'topic2'.
If you want to see the code that sets up consumer, producer and streams, see the full class here.
In my Scala (2.11) stream application I am consuming data from one queue in IBM MQ and writing it to a Kafka topic that has one partition. After consuming the data from the MQ the message payload gets splitted into 3000 smaller messages that are stored in a Sequence of Strings. Then each of these 3000 messages are send to Kafka (version 2.x) using KafkaProducer.
How would you send those 3000 messages?
I can't increase the number of queues in IBM MQ (not under my control) nor the number of partitions in the topic (ordering of messages is required, and writing a custom partitioner will impact too many consumers of the topic).
The Producer settings are currently:
acks=1
linger.ms=0
batch.size=65536
But optimizing them is probably a question of its own and not part of my current problem.
Currently, I am doing
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
private lazy val kafkaProducer: KafkaProducer[String, String] = new KafkaProducer[String, String](someProperties)
val messages: Seq[String] = Seq(String1, …, String3000)
for (msg <- messages) {
val future = kafkaProducer.send(new ProducerRecord[String, String](someTopic, someKey, msg))
val recordMetadata = future.get()
}
To me it looks like not the most elegant and most efficient way. Is there a programmatic way to increase throughput?
edit after answer from #radai
Thanks to the answer pointing me to the right direction I had a closer look into the different Producer methods. The book Kafka - The Definitive Guide list these methods:
Fire-and-forget
We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.
Synchronous send
We send a message, the send() method returns a Future object, and we use get()
to wait on the future and see if the send() was successful or not.
Asynchronous send
We call the send() method with a callback function, which gets triggered when it
receives a response from the Kafka broker
And now my code looks like this (leaving out error handling and the definition of Callback class):
val asyncProducer = new KafkaProducer[String, String](someProperties)
for (msg <- messages) {
val record = new ProducerRecord[String, String](someTopic, someKey, msg)
asyncProducer.send(record, new compareProducerCallback)
}
asyncProducer.flush()
I have compared all the methods for 10000 very small messages. Here is my measure result:
Fire-and-forget: 173683464ns
Synchronous send: 29195039875ns
Asynchronous send: 44153826ns
To be honest, there is probably more potential to optimize all of them by choosing the right properties (batch.size, linger.ms, ...).
the biggest reason i can see for your code to be slow is that youre waiting on every single send future.
kafka was designed to send batches. by sending one record at a time youre waiting round-trip time for every single record and youre not getting any benefit from compression.
the "idiomatic" thing to do would be send everything, and then block on all the resulting futures in a 2nd loop.
also, if you intend to do this i'd bump linger back up (otherwise your 1st record would result in a batch of size one, slowing you down overall. see https://en.wikipedia.org/wiki/Nagle%27s_algorithm) and call flush() on the producer once your send loop is done.
We are trying to use Akka Streams with Alpakka Kafka to consume a stream of events in a service. For handling event processing errors we are using Kafka autocommit and more than one queue. For example, if we have the topic user_created, which we want to consume from a products service, we also create user_created_for_products_failed and user_created_for_products_dead_letter. These two extra topics are coupled to a specific Kafka consumer group. If an event fails to be processed, it goes to the failed queue, where we try to consume again in five minutes--if it fails again it goes to dead letters.
On deployment we want to ensure that we don't lose events. So we are trying to stop the stream before stopping the application. As I said, we are using autocommit, but all of these events that are "flying" are not processed yet. Once the stream and application are stopped, we can deploy the new code and start the application again.
After reading the documentation, we have seen the KillSwitch feature. The problem that we are seeing in it is that the shutdown method returns Unit instead Future[Unit] as we expect. We are not sure that we won't lose events using it, because in tests it looks like it goes too fast to be working properly.
As a workaround, we create an ActorSystem for each stream and use the terminate method (which returns a Future[Terminate]). The problem with this solution is that we don't think that creating an ActorSystem per stream will scale well, and terminate takes a lot of time to resolve (in tests it takes up to one minute to shut down).
Have you faced a problem like this? Is there a faster way (compared to ActorSystem.terminate) to stop a stream and ensure that all the events that the Source has emitted have been processed?
From the documentation (emphasis mine):
When using external offset storage, a call to Consumer.Control.shutdown() suffices to complete the Source, which starts the completion of the stream.
val (consumerControl, streamComplete) =
Consumer
.plainSource(consumerSettings,
Subscriptions.assignmentWithOffset(
new TopicPartition(topic, 0) -> offset
))
.via(businessFlow)
.toMat(Sink.ignore)(Keep.both)
.run()
consumerControl.shutdown()
Consumer.control.shutdown() returns a Future[Done]. From its Scaladoc description:
Shutdown the consumer Source. It will wait for outstanding offset commit requests to finish before shutting down.
Alternatively, if you're using offset storage in Kafka, use Consumer.Control.drainAndShutdown, which also returns a Future. Again from the documentation (which contains more information about what drainAndShutdown does under the covers):
val drainingControl =
Consumer
.committableSource(consumerSettings.withStopTimeout(Duration.Zero), Subscriptions.topics(topic))
.mapAsync(1) { msg =>
business(msg.record).map(_ => msg.committableOffset)
}
.toMat(Committer.sink(committerSettings))(Keep.both)
.mapMaterializedValue(DrainingControl.apply)
.run()
val streamComplete = drainingControl.drainAndShutdown()
The Scaladoc description for drainAndShutdown:
Stop producing messages from the Source, wait for stream completion and shut down the consumer Source so that all consumed messages reach the end of the stream. Failures in stream completion will be propagated, the source will be shut down anyway.
I am probably missing the point of the Kafka Consumer but what I want to do is:
Consumer subscribes to a topic, grabs all messages within the topic and returns a Future with a list of all of those messages
The code I have written to try and accomplish this is
val sink = Sink.fold[List[KafkaMessage], KafkaMessage](List[KafkaMessage]()) { (list, kafkaMessage) =>
list :+ kafkaMessage
}
def consume(topic: String) =
Consumer.committableSource(consumerSettings, Subscriptions.topics(topic))
.map { message =>
logger.info(s"Consuming ${message.record.value}")
KafkaMessage(Some(message.record.key()), Some(message.record.value()))
}
.buffer(bufferSize, overflowStrategy)
.runWith(sink)
The Future never returns though, it consumes the necessary messages and then continues to poll the topic repeatedly. Is there a way to return the Future and then close the consumer?
As Kafka is for streaming data, there is no such thing as "all messages" as new data can be appended to a topic at any point.
I guess, there are two possible things you could do:
check how many records got returned by the last poll and terminate or
you would need to get "current end of log" via endOffsets, and compare this to the offset of the latest record per partition. If both match, then you can return.
The first approach is simpler, but might have the disadvantage, that it's not as reliable as the second approach. Theoretically, a poll could return zero records, even if there are records available (even if the chances are not very high that this happens).
Not sure, how to express this termination condition in Scala though (as I am not very familiar with Scala).