Idempotent Kafka Producer writing to multi-partitioned topic - apache-kafka

Does an idempotent producer have to be transactional in order to ensure idempotency when publishing to a multi-partitioned topic? After reading Kafka documentation I am still unsure if it does or not.
My environment is Kafka 1.0 cluster and Kafka 1.1 client.

Idempotent producer create an id which is send with the messages. With this id, the lead broker is able to say 'Oh, I already treated this message'.
Idempotent producer and transactional messaging are two different approaches of making a exactly-once messaging semantics.
So, no !

Related

Message queue (like RabbitMQ) or Kafka for Microservices?

We are starting a new project, where we are evaluating the tech stack for asynchronous communication between microservices? We are considering RabbitMQ and Kafka for this.
Can anyone shed some light on the key considerations to decide one between these twos?
Thanks
Selection depends upon what exactly your microservices needs. Both has something different as compared to other.
RabbitMQ in a nutshell
Who are the players:
Consumer
Publisher
Exchange
Route
The flow starts from the Publisher, which send a message to exchange, Exchange is a middleware layer that knows to route the message to the queue, consumers can define which queue they are consuming from (by defining binding), RabbitMQ pushes the message to the consumer, and once consumed and acknowledgment has arrived, message is removed from the queue.
Any piece in this system can be scaled out: producer, consumer, and also the RabbitMQ itself can be clustered, and highly available.
Kafka
Who are the players
Consumer / Consumer groups
Producer
Kafka source connect
Kafka sink connect
Topic and topic partition
Kafka stream
Broker
Zookeeper
Kafka is a robust system and has several members in the game. but once you understand well the flow, this becomes easy to manage and to work with.
Producer send a message record to a topic, a topic is a category or feed name to which records are published, it can be partitioned, to get better performance, consumers subscribed to a topic and start to pull messages from it, when a topic is partitioned, then each partition get its own consumer instance, we called all instances of same consumer a consumer group.
In Kafka messages are always remaining in the topic, also if they were consumed (limit time is defined by retention policy)
Also, Kafka uses sequential disk I/O, this approach boosts the performance of Kafka and makes it a leader option in queues implementation, and a safe choice for big data use cases.
Use Kafka if you need
Time travel/durable/commit log
Many consumers for the same message
High throughput
Stream processing
Replicability
High availability
Message order
Use RabbitMq if you need:
flexible routing
Priority Queue
A standard protocol message queue
For more info
In order to select a message broker, I think this list could be really helpful.
 Supported programming languages: You probably should pick one that supports a
variety of programming languages.
 Supported messaging standards: Does the message broker support any standards,
such as AMQP and STOMP, or is it proprietary?
 Messaging order: Does the message broker preserve the ordering of messages?
 Delivery guarantees: What kind of delivery guarantees does the broker make?
 Persistence: Are messages persisted to disk and able to survive broker crashes?
 Durability: If a consumer reconnects to the message broker, will it receive the
messages that were sent while it was disconnected?
 Scalability: How scalable is the message broker?
 Latency: What is the end-to-end latency?
 Competing consumers: Does the message broker support competing consumers?
Kafka
Rabbit MQ
It's a distributed streaming platform, working on the pub-sub model.
It's a message-broker, that works on pub-sub and queue-based model.
No out of the box support for retries and DLQ
Supports retries and DLQ out of the box(via DLX).
Consumers can't filter messages specifically.
Topic exchange and header exchange facilitate consumer-based message filtering.
Messages are retained till their validity period.
Messages are gone as soon as they are read.
Does not support scheduled or delayed message routing.
Supports scheduled and delayed routing of messages.
Scales horizontally
Scales vertically
Pull based approach
Pull based apporach
Supports event replay with consumer groups
No way to event replay

What is the general recipe for Kafka Connect sinks exactly-once semantics?

I read Kafka has added a lot to support exactly-once semantics. Things like idempotent producers, transactional producers, read committed consumers, and Kafka Streams exactly-once config. From my understanding though, this does not apply to Kafka Connect sinks? Would the general idea be then that the sink has to store the offset itself?

Idempotent and Transactions

I am exploring Transactions in Kafka, and I want to understand all the details.
I noticed in Spring-Kafka that idempotent is enabled when you provide a transactionsalId.
public void setTransactionIdPrefix(String transactionIdPrefix) {
Assert.notNull(transactionIdPrefix, "'transactionIdPrefix' cannot be null");
this.transactionIdPrefix = transactionIdPrefix;
enableIdempotentBehaviour();
}
At first glance, I assume Spring-Kafka enabled idempotent in transactions because it is "good-to-have". I assumed it was to ensure to ensure exactly-once semantics in transactions.
I did a bit more digging and discovered that idempotent is required for transactions to work. This is mentioned in KIP-98
Note that enable.idempotence must be enabled if a TransactionalId is
configured.
Kafka idempotent is a feature to avoid duplicated messages, such as network errors after the message has been sent.
My understanding is that, Kafka transactions basically writes to an internal topic and idempotent has to be enabled to avoid duplicates.
Idempotent enables exactly-once semantics for producers.
Transactions enables exactly-once semantics for transitivity; consume -> produce.
Is my understanding correct?
What enables exactly-once for only consumers? Committing offset, idempotent, or transactions.
The Idempotent producer enables exactly once for a producer against a single topic. Basically each single message send has stonger guarantees and will not be duplicated in case there's an error.
The Transactional producer on the other hand enables to group a number of send (that can be across many partitions) together and have all of them (or none) applied. Transactions can also contain offset commits (in the end commiting offsets is the same as writing to a topic).
Because Consumers fetch data from Kafka, it's sort of already exactly once. When the consumer asks Kafka messages from offset N, if it does not receives them, it will just retry, there can't be any duplication. The only exactly once need for COnsumers is for committing offsets and that can be done by the Transactional Producer (The consumer needs to pass its current offsets to the Producer).

Can I get producer client id in kafka consumer?

During creating kafka producer, we can assign a client id. What is it used for? Can I get the producer client id in a consumer? For example, to see which producer produced the message?
No, a consumer cannot get the producer's client-id.
From the Kaka documentation, client-ids are:
An id string to pass to the server when making requests. The purpose
of this is to be able to track the source of requests beyond just
ip/port by allowing a logical application name to be included in
server-side request logging.
They are only used for identifying clients in the broker logs.
No, you'd have to pass it on as part of the key or value if you need it at the consumer side.
Kafka's philosophy is to decouple producers and consumers. A topic can be read by 0-n consumers and be written to by 0-n producers. Kafka is usually used for communication between (micro)service boundaries where services don't care about who produced a message, just about its contents.

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.