Making Kafka producer and Consumer synchronous - apache-kafka

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .

Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.

Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Is it better to keep a Kafka Producer open or to create a new one for each message?

I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Kafka: How to retrieve a response from consumer?

I wish to describe the following scenario:
I have a node.js backend application (It uses a single thread event loop).
This is the general architecture of the system:
Producer -> Kafka -> Consumer -> Database
Let's say that the producer sends a message to Kafka, and the purpose of this message is the make a certain query in database and retrieve the query result.
However, as we all know Kafka is an asynchronous system. If the producer sends a message to Kafka, it gets a response that the message has been accepted by a Kafka broker. Kafka broker doesn't wait until the consumer polls the message and processes it.
In this case, how can the producer get the query result operated on the database?
The flow using Kafka will look like this:
The only way of the Producer A be aware of what happened with the message consumed by the Consumer A is producing another message. Which will be handled accordingly by any other consumer available (in this case, Consumer B).
As you already mentioned, this flow is asynchronous. This can be useful when you have a very heavy processing on your query, like a report generation or something like that, and the second producer will notify an user inbox for example.
If that is not the case, perhaps you should use HTTP, which is synchronous and you will have the response at the end of processing.
You must generate new flow for communicate the query result:
Consumer (now its a producer) -> Kafka topic -> Producer (now its a consumer)
You should consider using another synchronous communication mechanism like HTTP.

Message queue (like RabbitMQ) or Kafka for Microservices?

We are starting a new project, where we are evaluating the tech stack for asynchronous communication between microservices? We are considering RabbitMQ and Kafka for this.
Can anyone shed some light on the key considerations to decide one between these twos?
Thanks
Selection depends upon what exactly your microservices needs. Both has something different as compared to other.
RabbitMQ in a nutshell
Who are the players:
Consumer
Publisher
Exchange
Route
The flow starts from the Publisher, which send a message to exchange, Exchange is a middleware layer that knows to route the message to the queue, consumers can define which queue they are consuming from (by defining binding), RabbitMQ pushes the message to the consumer, and once consumed and acknowledgment has arrived, message is removed from the queue.
Any piece in this system can be scaled out: producer, consumer, and also the RabbitMQ itself can be clustered, and highly available.
Kafka
Who are the players
Consumer / Consumer groups
Producer
Kafka source connect
Kafka sink connect
Topic and topic partition
Kafka stream
Broker
Zookeeper
Kafka is a robust system and has several members in the game. but once you understand well the flow, this becomes easy to manage and to work with.
Producer send a message record to a topic, a topic is a category or feed name to which records are published, it can be partitioned, to get better performance, consumers subscribed to a topic and start to pull messages from it, when a topic is partitioned, then each partition get its own consumer instance, we called all instances of same consumer a consumer group.
In Kafka messages are always remaining in the topic, also if they were consumed (limit time is defined by retention policy)
Also, Kafka uses sequential disk I/O, this approach boosts the performance of Kafka and makes it a leader option in queues implementation, and a safe choice for big data use cases.
Use Kafka if you need
Time travel/durable/commit log
Many consumers for the same message
High throughput
Stream processing
Replicability
High availability
Message order
Use RabbitMq if you need:
flexible routing
Priority Queue
A standard protocol message queue
For more info
In order to select a message broker, I think this list could be really helpful.
 Supported programming languages: You probably should pick one that supports a
variety of programming languages.
 Supported messaging standards: Does the message broker support any standards,
such as AMQP and STOMP, or is it proprietary?
 Messaging order: Does the message broker preserve the ordering of messages?
 Delivery guarantees: What kind of delivery guarantees does the broker make?
 Persistence: Are messages persisted to disk and able to survive broker crashes?
 Durability: If a consumer reconnects to the message broker, will it receive the
messages that were sent while it was disconnected?
 Scalability: How scalable is the message broker?
 Latency: What is the end-to-end latency?
 Competing consumers: Does the message broker support competing consumers?
Kafka
Rabbit MQ
It's a distributed streaming platform, working on the pub-sub model.
It's a message-broker, that works on pub-sub and queue-based model.
No out of the box support for retries and DLQ
Supports retries and DLQ out of the box(via DLX).
Consumers can't filter messages specifically.
Topic exchange and header exchange facilitate consumer-based message filtering.
Messages are retained till their validity period.
Messages are gone as soon as they are read.
Does not support scheduled or delayed message routing.
Supports scheduled and delayed routing of messages.
Scales horizontally
Scales vertically
Pull based approach
Pull based apporach
Supports event replay with consumer groups
No way to event replay