confluent kafka - Rate limiting - apache-kafka

Rate limiting: As Kafka is able to generate messages at a much higher rate than MQ can consume, can we have some configuration setup # kafka consumer to to enable a rate-limiting transfer to protect the stability of MQ?
Also Exactly-Once Semantic - Understand that kafka supports exactly-once semantics which would stop the retransfer of messages that have already been consumed by consumers. Can someone guide me on how to setup this configuration?
we are using confluent kafka enterprise version in our organization.

Rate limiting: Kafka is pull based so your consumer could read messages at an own peace and transfer them into MQ (but if the second system is constantly slower, the buffer of unprocessed message in Kafka will increase though the time).
Exactly once semantic: to ensure exactly once sematic for consumer you need to commit read offset manually once the message is successfully processed (the default behavior is automatic commit of read offset after small timeout. It could lead to lost of the message, if the fail happens after commit of read offset but before end of the processing of the message)

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

kafka offset management auto vs manual

I'm working on an application of spring boot which uses Kafka stream, in my application, I want to manage Kafka offset and commit the offset in case of the successful message processing only. This is important, to be certain I won't lose messages even if Kafka restarted or the zookeeper is down. my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages.
also, I need to know what is the difference between managing the Kafka offset automatic using autoCommitOffset and manging it manually using HBase or zookeeper or checkpoints?
also, what are the benefits of managing it manually if there is an automatic config we can use?
You have no guarantee of durability with auto commit
Older Kafka clients did use Zookeeper for offset storage, but now it is all in the broker to minimize dependencies. Kafka Streams API has no way to integrate offset storage outside of Kafka itself, and so you must use the Consumer API to lookup and seek/commit offsets to external storage, if you choose to do so, however, you can still then end up with less than optimal message processing.
my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages
Sounds like you set auto.offset.reset=earliest and you never commit any offsets at all...
The auto commit setting does a periodic commit, not "automatic after reading any message".
If you want to guarantee delivery, then you need to set at least acks=1 in the producer and actually do a commitSync in the consumer

Message queue (like RabbitMQ) or Kafka for Microservices?

We are starting a new project, where we are evaluating the tech stack for asynchronous communication between microservices? We are considering RabbitMQ and Kafka for this.
Can anyone shed some light on the key considerations to decide one between these twos?
Thanks
Selection depends upon what exactly your microservices needs. Both has something different as compared to other.
RabbitMQ in a nutshell
Who are the players:
Consumer
Publisher
Exchange
Route
The flow starts from the Publisher, which send a message to exchange, Exchange is a middleware layer that knows to route the message to the queue, consumers can define which queue they are consuming from (by defining binding), RabbitMQ pushes the message to the consumer, and once consumed and acknowledgment has arrived, message is removed from the queue.
Any piece in this system can be scaled out: producer, consumer, and also the RabbitMQ itself can be clustered, and highly available.
Kafka
Who are the players
Consumer / Consumer groups
Producer
Kafka source connect
Kafka sink connect
Topic and topic partition
Kafka stream
Broker
Zookeeper
Kafka is a robust system and has several members in the game. but once you understand well the flow, this becomes easy to manage and to work with.
Producer send a message record to a topic, a topic is a category or feed name to which records are published, it can be partitioned, to get better performance, consumers subscribed to a topic and start to pull messages from it, when a topic is partitioned, then each partition get its own consumer instance, we called all instances of same consumer a consumer group.
In Kafka messages are always remaining in the topic, also if they were consumed (limit time is defined by retention policy)
Also, Kafka uses sequential disk I/O, this approach boosts the performance of Kafka and makes it a leader option in queues implementation, and a safe choice for big data use cases.
Use Kafka if you need
Time travel/durable/commit log
Many consumers for the same message
High throughput
Stream processing
Replicability
High availability
Message order
Use RabbitMq if you need:
flexible routing
Priority Queue
A standard protocol message queue
For more info
In order to select a message broker, I think this list could be really helpful.
 Supported programming languages: You probably should pick one that supports a
variety of programming languages.
 Supported messaging standards: Does the message broker support any standards,
such as AMQP and STOMP, or is it proprietary?
 Messaging order: Does the message broker preserve the ordering of messages?
 Delivery guarantees: What kind of delivery guarantees does the broker make?
 Persistence: Are messages persisted to disk and able to survive broker crashes?
 Durability: If a consumer reconnects to the message broker, will it receive the
messages that were sent while it was disconnected?
 Scalability: How scalable is the message broker?
 Latency: What is the end-to-end latency?
 Competing consumers: Does the message broker support competing consumers?
Kafka
Rabbit MQ
It's a distributed streaming platform, working on the pub-sub model.
It's a message-broker, that works on pub-sub and queue-based model.
No out of the box support for retries and DLQ
Supports retries and DLQ out of the box(via DLX).
Consumers can't filter messages specifically.
Topic exchange and header exchange facilitate consumer-based message filtering.
Messages are retained till their validity period.
Messages are gone as soon as they are read.
Does not support scheduled or delayed message routing.
Supports scheduled and delayed routing of messages.
Scales horizontally
Scales vertically
Pull based approach
Pull based apporach
Supports event replay with consumer groups
No way to event replay

Kafka: Why broker isn't pull based like consumers

I was reading Kafka docs where it was mentioned that:-
Consumers pulls data from broker by requesting from offset.
Producer pushes messages to broker.
Making Kafka consumers pull based make sense that the consumers can drive the pace and broker can store the data for a really long time.
However with producers being push based, How does Kafka make sure that speed mismatch between producer and kafka won't happen? Also producers don't have persistance by design.This seems to be a bigger problem, when producers and brokers are separated over high latency network(internet).
As a distributed commit log, Kafka solves exactly this (impedance mismatch). You produce your events at the rate at which they occur into Kafka, and then you consume them at the rate at which your application can. The data is persisted in Kafka regardless. If your application needs to consume at a greater rate, you scale it out and partition your topic and consume in parallel. Because the data is persisted the only factor is how fast you want to consume the data.

Preventing message loss with Kafka High Level Consumer 0.8.x

A typical kafka consumer looks like the following:
kafka-broker ---> kafka-consumer ----> downstream-consumer like Elastic-Search
And according to the documentation for Kafka High Level Consumer:
The ‘auto.commit.interval.ms’ setting is how often updates to the
consumed offsets are written to ZooKeeper
It seems that there can be message loss if the following two things happen:
Offsets are committed just after some messages are retrieved from kafka brokers.
Downstream consumers (say Elastic-Search) fail to process the most recent batch of messages OR the consumer process itself is killed.
It would perhaps be most ideal if the offsets are not committed automatically based on a time interval but they are committed by an API. This would make sure that the kafka-consumer can signal the committing of offsets only after it receives an acknowledgement from the downstream-consumer that they have successfully consumed the messages. There could be some replay of messages (if kafka-consumer dies before committing offsets) but there would at least be no message loss.
Please let me know if such an API exists in the High Level Consumer.
Note: I am aware of the Low Level Consumer API in 0.8.x version of Kafka but I do not want to manage everything myself when all I need is just one simple API in High Level Consumer.
Ref:
AutoCommitTask.run(), look for commitOffsetsAsync
SubscriptionState.allConsumed()
There is a commitOffsets() API in the High Level Consumer API that can be used to solve this.
Also set option "auto.commit.enable" to "false" so that at no time, the offsets are committed automatically by kafka consumer.