In my Scala (2.11) stream application I am consuming data from one queue in IBM MQ and writing it to a Kafka topic that has one partition. After consuming the data from the MQ the message payload gets splitted into 3000 smaller messages that are stored in a Sequence of Strings. Then each of these 3000 messages are send to Kafka (version 2.x) using KafkaProducer.
How would you send those 3000 messages?
I can't increase the number of queues in IBM MQ (not under my control) nor the number of partitions in the topic (ordering of messages is required, and writing a custom partitioner will impact too many consumers of the topic).
The Producer settings are currently:
acks=1
linger.ms=0
batch.size=65536
But optimizing them is probably a question of its own and not part of my current problem.
Currently, I am doing
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
private lazy val kafkaProducer: KafkaProducer[String, String] = new KafkaProducer[String, String](someProperties)
val messages: Seq[String] = Seq(String1, …, String3000)
for (msg <- messages) {
val future = kafkaProducer.send(new ProducerRecord[String, String](someTopic, someKey, msg))
val recordMetadata = future.get()
}
To me it looks like not the most elegant and most efficient way. Is there a programmatic way to increase throughput?
edit after answer from #radai
Thanks to the answer pointing me to the right direction I had a closer look into the different Producer methods. The book Kafka - The Definitive Guide list these methods:
Fire-and-forget
We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.
Synchronous send
We send a message, the send() method returns a Future object, and we use get()
to wait on the future and see if the send() was successful or not.
Asynchronous send
We call the send() method with a callback function, which gets triggered when it
receives a response from the Kafka broker
And now my code looks like this (leaving out error handling and the definition of Callback class):
val asyncProducer = new KafkaProducer[String, String](someProperties)
for (msg <- messages) {
val record = new ProducerRecord[String, String](someTopic, someKey, msg)
asyncProducer.send(record, new compareProducerCallback)
}
asyncProducer.flush()
I have compared all the methods for 10000 very small messages. Here is my measure result:
Fire-and-forget: 173683464ns
Synchronous send: 29195039875ns
Asynchronous send: 44153826ns
To be honest, there is probably more potential to optimize all of them by choosing the right properties (batch.size, linger.ms, ...).
the biggest reason i can see for your code to be slow is that youre waiting on every single send future.
kafka was designed to send batches. by sending one record at a time youre waiting round-trip time for every single record and youre not getting any benefit from compression.
the "idiomatic" thing to do would be send everything, and then block on all the resulting futures in a 2nd loop.
also, if you intend to do this i'd bump linger back up (otherwise your 1st record would result in a batch of size one, slowing you down overall. see https://en.wikipedia.org/wiki/Nagle%27s_algorithm) and call flush() on the producer once your send loop is done.
Related
I have an application that receives an api request and relays its to a Kafka API Producer. Each request calls the producer to send a message to Kafka. The producer exists throughout the application lifetime and is shared for all requests.
producer.send(new ProducerRecord[String, String](topic, requestBody))
This works OK. Now I want to use instead, an alpakka Producer for the job. The code looks like this:
val kafkaProducer = producerSettings.createKafkaProducer()
val settingsWithProducer = producerSettings.withProducer(kafkaProducer)
val done = Source.single(requestBody)
.map(value => new ProducerRecord[String, String](topic, value))
.runWith(Producer.plainSink(settingsWithProducer))
What are the advantages of the alpakka Producer over the plain, vanilla Producer? I don't know whether the new approach can help me handle a large number of API requests in order at the same time.
For that case of producing a single message to a Kafka topic, the Alpakka Producer sink that you're using doesn't really offer a benefit (the only tangential one might be if you're interested in using Akka Discovery to discover your Kafka brokers). Alpakka's SendProducer might be useful in your Scala code for that case: it exposes a Scala Future instead of a Java Future.
Where the Alpakka Producer sinks and flows shine is in a stream context where there's a sequence of elements that you want produced in order with backpressure, especially if the messages to be produced are the output of a complex stream topology.
I'm taking "large number of API requests" to mean HTTP/gRPC requests coming into your service and each request resulting in producing at most one message to Kafka. You can contort such a thing into a stream (e.g. feeding a stream via a Source.actorRef), but that's probably getting over-elaborate.
As for "in order at the same time": that's kind of a contradiction, as "in order" somewhat rules out simultaneity. Are you thinking of a situation where you can partition the requests and then you want ordering within that partition of requests, but are OK with any ordering across partitions (note that I'm not necessarily implying anything about partitioning of the Kafka topic you're producing to)? In that case, Akka Streams (and likely actors) will come in handy and the Producer sinks/flows will likely come in handy.
I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.
We are trying to use Akka Streams with Alpakka Kafka to consume a stream of events in a service. For handling event processing errors we are using Kafka autocommit and more than one queue. For example, if we have the topic user_created, which we want to consume from a products service, we also create user_created_for_products_failed and user_created_for_products_dead_letter. These two extra topics are coupled to a specific Kafka consumer group. If an event fails to be processed, it goes to the failed queue, where we try to consume again in five minutes--if it fails again it goes to dead letters.
On deployment we want to ensure that we don't lose events. So we are trying to stop the stream before stopping the application. As I said, we are using autocommit, but all of these events that are "flying" are not processed yet. Once the stream and application are stopped, we can deploy the new code and start the application again.
After reading the documentation, we have seen the KillSwitch feature. The problem that we are seeing in it is that the shutdown method returns Unit instead Future[Unit] as we expect. We are not sure that we won't lose events using it, because in tests it looks like it goes too fast to be working properly.
As a workaround, we create an ActorSystem for each stream and use the terminate method (which returns a Future[Terminate]). The problem with this solution is that we don't think that creating an ActorSystem per stream will scale well, and terminate takes a lot of time to resolve (in tests it takes up to one minute to shut down).
Have you faced a problem like this? Is there a faster way (compared to ActorSystem.terminate) to stop a stream and ensure that all the events that the Source has emitted have been processed?
From the documentation (emphasis mine):
When using external offset storage, a call to Consumer.Control.shutdown() suffices to complete the Source, which starts the completion of the stream.
val (consumerControl, streamComplete) =
Consumer
.plainSource(consumerSettings,
Subscriptions.assignmentWithOffset(
new TopicPartition(topic, 0) -> offset
))
.via(businessFlow)
.toMat(Sink.ignore)(Keep.both)
.run()
consumerControl.shutdown()
Consumer.control.shutdown() returns a Future[Done]. From its Scaladoc description:
Shutdown the consumer Source. It will wait for outstanding offset commit requests to finish before shutting down.
Alternatively, if you're using offset storage in Kafka, use Consumer.Control.drainAndShutdown, which also returns a Future. Again from the documentation (which contains more information about what drainAndShutdown does under the covers):
val drainingControl =
Consumer
.committableSource(consumerSettings.withStopTimeout(Duration.Zero), Subscriptions.topics(topic))
.mapAsync(1) { msg =>
business(msg.record).map(_ => msg.committableOffset)
}
.toMat(Committer.sink(committerSettings))(Keep.both)
.mapMaterializedValue(DrainingControl.apply)
.run()
val streamComplete = drainingControl.drainAndShutdown()
The Scaladoc description for drainAndShutdown:
Stop producing messages from the Source, wait for stream completion and shut down the consumer Source so that all consumed messages reach the end of the stream. Failures in stream completion will be propagated, the source will be shut down anyway.
I am probably missing the point of the Kafka Consumer but what I want to do is:
Consumer subscribes to a topic, grabs all messages within the topic and returns a Future with a list of all of those messages
The code I have written to try and accomplish this is
val sink = Sink.fold[List[KafkaMessage], KafkaMessage](List[KafkaMessage]()) { (list, kafkaMessage) =>
list :+ kafkaMessage
}
def consume(topic: String) =
Consumer.committableSource(consumerSettings, Subscriptions.topics(topic))
.map { message =>
logger.info(s"Consuming ${message.record.value}")
KafkaMessage(Some(message.record.key()), Some(message.record.value()))
}
.buffer(bufferSize, overflowStrategy)
.runWith(sink)
The Future never returns though, it consumes the necessary messages and then continues to poll the topic repeatedly. Is there a way to return the Future and then close the consumer?
As Kafka is for streaming data, there is no such thing as "all messages" as new data can be appended to a topic at any point.
I guess, there are two possible things you could do:
check how many records got returned by the last poll and terminate or
you would need to get "current end of log" via endOffsets, and compare this to the offset of the latest record per partition. If both match, then you can return.
The first approach is simpler, but might have the disadvantage, that it's not as reliable as the second approach. Theoretically, a poll could return zero records, even if there are records available (even if the chances are not very high that this happens).
Not sure, how to express this termination condition in Scala though (as I am not very familiar with Scala).
I've got a very basic skeleton Scala application (with Akka, Camel and ActiveMQ) where I want to publish onto an ActiveMQ queue as quickly as possible, but then only consume from that queue at a particular rate (eg. 1 per second).
Here's some code to illustrate that:
MyProducer.scala
class Producer extends Actor with Producer with Oneway {
def endpointUri = "activemq:myqueue"
}
MyConsumer.scala
class MyConsumer extends Actor with Consumer {
def endpointUri = "activemq:myqueue"
def receive = {
case msg: CamelMessage => println("Ping!")
}
}
In my main method, I then have all the boilerplate to set up Camel and get it talking to ActiveMQ, and then I have:
// Start the consumer
val consumer = system.actorOf(Props[MyConsumer])
val producer = system.actorOf(Props[MyProducer])
// Imagine I call this line 100+ times
producer ! "message"
How can I make it so that MyProducer sends things to ActiveMQ as quickly as possible (ie. no throttling) whilst making sure that MyConsumer only reads a message every x seconds? I'd like each message to stay on the ActiveMQ queue until the last possible moment (ie. when it's read by MyConsumer).
So far, I've managed to use a TimerBasedThrottler to consume at a certain rate, but this still consumes all of the messages in one big go.
Apologies if I've missed something along the way, I'm relatively new to Akka/Camel.
How many consumers comprise "MyConsumer"?
a) If it were only one, then it is unclear why a simple sleep between reading/consuming messages would not work.
If there are multiple consumers, which behavior are you requiring:
each consumer is throttled to the specified consumption rate. In that case each Consumer thread still behaves as mentioned in a)
the overall pool of consumers is throttled to the consumption rate. In that case a central Throttler would need to retain the inter-message delay and block each consumer until the required delay were met . There would be the complexity of managing when there were backlogs - to allow "catch-up". You probably get the drift here.
It may be you were looking for something else /more specific in this question. If so then please elaborate.