I have an application that receives an api request and relays its to a Kafka API Producer. Each request calls the producer to send a message to Kafka. The producer exists throughout the application lifetime and is shared for all requests.
producer.send(new ProducerRecord[String, String](topic, requestBody))
This works OK. Now I want to use instead, an alpakka Producer for the job. The code looks like this:
val kafkaProducer = producerSettings.createKafkaProducer()
val settingsWithProducer = producerSettings.withProducer(kafkaProducer)
val done = Source.single(requestBody)
.map(value => new ProducerRecord[String, String](topic, value))
.runWith(Producer.plainSink(settingsWithProducer))
What are the advantages of the alpakka Producer over the plain, vanilla Producer? I don't know whether the new approach can help me handle a large number of API requests in order at the same time.
For that case of producing a single message to a Kafka topic, the Alpakka Producer sink that you're using doesn't really offer a benefit (the only tangential one might be if you're interested in using Akka Discovery to discover your Kafka brokers). Alpakka's SendProducer might be useful in your Scala code for that case: it exposes a Scala Future instead of a Java Future.
Where the Alpakka Producer sinks and flows shine is in a stream context where there's a sequence of elements that you want produced in order with backpressure, especially if the messages to be produced are the output of a complex stream topology.
I'm taking "large number of API requests" to mean HTTP/gRPC requests coming into your service and each request resulting in producing at most one message to Kafka. You can contort such a thing into a stream (e.g. feeding a stream via a Source.actorRef), but that's probably getting over-elaborate.
As for "in order at the same time": that's kind of a contradiction, as "in order" somewhat rules out simultaneity. Are you thinking of a situation where you can partition the requests and then you want ordering within that partition of requests, but are OK with any ordering across partitions (note that I'm not necessarily implying anything about partitioning of the Kafka topic you're producing to)? In that case, Akka Streams (and likely actors) will come in handy and the Producer sinks/flows will likely come in handy.
Related
I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.
Participating in a chalenge, it says like: Your first step - consume data sample from a Apache Kafka.
So they give me topic name, API_KEY and API_SECRET. Oh, and bootstrap server.
Then they claim as if you unfamiliar with Kafka, there is comprehensive documentation provided by Confluent. So ok, sign in to confluent, make a cluster and.. what is the next step to consume data ?
Here's a basic pattern for putting messages from Kafka into a list in Python.
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'someTopicName',
bootstrap_servers=['192.168.1.160:9092'],
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='my-group',
value_deserializer=lambda x: loads(x.decode('utf-8')))
print("We have a consumer instantiated")
print(consumer)
messageCache = []
for message in consumer:
messageCache.append(message.value)
In this case, my Kafka broker is on my private LAN, using the default port, so my bootstrap servers list is just ["192.168.1.160:9092"].
You can use standard counters and if statements to save the list to file or whatever, since the Kafka stream is assumed to run forever. For example, I have a process that consumes Kafka messages and saves them as a dataframe in parquet to HDFS for every 1,000,000 messages. In this case I was wanting to save historical messages to develop an ML model. The great thing about Kafka is that I could write another process that evaluated and potentially responded to every message in real time.
I wish to describe the following scenario:
I have a node.js backend application (It uses a single thread event loop).
This is the general architecture of the system:
Producer -> Kafka -> Consumer -> Database
Let's say that the producer sends a message to Kafka, and the purpose of this message is the make a certain query in database and retrieve the query result.
However, as we all know Kafka is an asynchronous system. If the producer sends a message to Kafka, it gets a response that the message has been accepted by a Kafka broker. Kafka broker doesn't wait until the consumer polls the message and processes it.
In this case, how can the producer get the query result operated on the database?
The flow using Kafka will look like this:
The only way of the Producer A be aware of what happened with the message consumed by the Consumer A is producing another message. Which will be handled accordingly by any other consumer available (in this case, Consumer B).
As you already mentioned, this flow is asynchronous. This can be useful when you have a very heavy processing on your query, like a report generation or something like that, and the second producer will notify an user inbox for example.
If that is not the case, perhaps you should use HTTP, which is synchronous and you will have the response at the end of processing.
You must generate new flow for communicate the query result:
Consumer (now its a producer) -> Kafka topic -> Producer (now its a consumer)
You should consider using another synchronous communication mechanism like HTTP.
I'm trying to design an Akka Stream using Alpakka to read events from kafka topic and put them to the Couchbase.
So far I have the following code and it seems to work somehow:
Consumer
.committableSource(consumerSettings, Subscriptions.topics(topicIn))
.map(profile ⇒ {
RawJsonDocument.create(profile.record.key(), profile.record.value())
})
.via(
CouchbaseFlow.upsertDoc(
sessionSettings,
writeSettings,
bucketName
)
)
.log("Couchbase stream logging")
.runWith(Sink.seq)
By "somehow" I mean that the stream is actually reads events from topic and put them to Couchbase as json documents and it looks even nice despite a fact that I don't understand how to commit consumer offsets to Kafka.
If I've clearly understood the main idea that hides behind Kafka consumer offsets, in case of any failure or restart happens, the stream reads all messages from the last commited offset and, since we haven't committed any, it probably re-reads the records being read at previous session once again.
So am I right in my assumptions? If so, how to handle consumer commits in case of reading from Kafka and publishing to some database? The official Akka Streams documentation provides the examples showing how to deal with such cases using plain Kafka Streams, so I have no idea about how to committing the offsets in my case.
Great thanks!
You will need to commit the offsets in Couchbase in order to obtain "exactly once" semantics.
This should help: https://doc.akka.io/docs/alpakka-kafka/current/consumer.html#offset-storage-external-to-kafka
I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.