Kafka transactions - why do I need to replicate? - apache-kafka

I am using Kafka for a circular buffer of the last 24 hours of events.
I have 4 brokers that are run on on ephemeral cloud providers. So the disk is local, if the broker dies I loose the data for that broker. I can start the broker again and it an replicate the data from another broker. I have replicas setup for my topic and the offsets topic:
default.replication.factor=2
offsets.topic.replication.factor=2
I'm using transactions to commit the new offsets + new records atomically. My app is side affect free, so if the transaction fails, I can poll and get the same records and repeat the processing and produce the same resultant events.
So the defaults for the transaction durability properties:
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
I feel that in my setup, I can set both of these properties to 1 i.e. no replication/durability (as my app is side affect free). Yet I can't shake the niggling feeling for some reason that I'm wrong.
Am I wrong? Why are the transactions durable in the first place - what scenario does the durability help with?

Related

If the partition a Kafka producer try to send messages to went offline, can the producer try to send to a different partition?

My Kafka cluster has 5 brokers and the replication factor is 3 for topics. At some time some partitions went offline but eventually they went back online. My questions are:
How many brokers were down does it indicate, given the fact that there were offline partitions? I think given the cluster setup above, I can afford to lose 2 brokers at the same time. However, if there were 2 brokers down, for some partitions they no longer have quorum; will these partitions go offline in this case?
If there are offline partitions, and a Kafka producer tries to send messages to them and fails, will the producer try a different partition that may be online? The messages have no key in them.
Not sure if I understood your question completely right but I have the impression that you are mixing up partitions and replications. Or at least, your question cannot be looked at isolated on the producer. As soon as one broker is down some things will happen on the cluster.
Each TopicPartition has one Partition Leader and your clients (e.g. Producer and Consumer) are only communicating with this one leader, independen of the number of replications.
In the case where two out of five broker are not available, Kafka will move the partition leader as well as the replicas to a healthy broker. In that scenario you should therefore not get into trouble although it might take some time and retries for the new leader to be selected and the new replications to be created on the healthy broker. A leader selection can be made fast as you have set the replication factor to three, so even if two brokers go down, one broker should still have the complete data (assuming all partitions were in-sync). However, creating two new replicas could take some time dependent on the amount of data. For that scenario you need to look into the topic level configuration min.insync.replicas and the KafkaProducer confiruation acks (see below).
I think the following are the most important configurations for your KafkaProducer to handle such situation:
bootstrap.servers: If you are anticipating regular connection problems with your brokers, you should ensure that you list all five of them. Although it is sufficient to only mention one address (as one broker will then communicate will all other brokers in the cluster) it is safe to have them all listed in case one or even two broker are not available.
acks: This defaults to 1 and defines the number of acknowledgments the producer requires the partition leader to have received before considering a request as successful. Possible values are 0, 1 and all.
retries: This value defaults to 2147483647 and will cause the client to resend any record whose send fails with a potentially transient error until the time of delivery.timeout.ms is reached
delivery.timeout.ms: An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.
You will find more details on the documentation on the Producer configs.

Kafka Producer Timeout Exception : even with max request timeout and proper batch size

We currently have around 80 applications (Around 200 K8s replicas) writing 16-17 Million records everyday to kafka and some of those records were failing intermittently with time out and rebalance exceptions. The failure rate was less than 0.02%.
We have validated and configured all the parameters properly as suggested by other stackoverflow links and still we are getting multiple issues.
One issue is related to Rebalance , We are facing issues on Producer and Consumer side both with this issue. For Consumer, we are using auto commit and sometimes Kafka is rebalancing, and consumer is receiving duplicate records. we didn't put any duplicate check because it will reduce the rate of processing and the duplicate record rate is less than 0.1%. We are thinking of going for manual commit and offset management using database. But need to understand from Kafka brokers perspective why rebalancing is happening on a daily basis.
Producer Error:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
Second issue is related to Timeout Exception . It's happening intermittently for some of the apps, Producer was trying to send a record and it has been added to the batch, and it was not unable to deliver until the request timeout which we have increased to 5minutes. Ideal case Kafka should be retrying at certain interval. During debugging ,we found that record accumulator is expiring the previous batching without even trying to send them in case of request time out - is it the expected behavior? Can we anyway add the retry for this?
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for xxTopic-41:300014 ms has passed since batch creation. </b>
Configuration: 1. 5 brokers and 3 zookeepers - Kafka version 2.2
2. Brokers are running in Kubernetes with statefulset. 3. Each broker has 32GB and 8 CPU as recommended by Confluent for Production.
4. Topic has 200 partitions and 8 replica consumers. 5. Each consumer is handling around 25-30 threads only. Consumer has 4GB and 4CPU capacity.
#Value("${request.timeout:300000}") <br/>
private String requestTimeOut;
#Value("${batch.size:8192}") <br/>
private String batchSize;
#Value("${retries:5}") <br/>
private Integer kafkaRetries;
#Value("${retry.backoff.ms:10000}") <br/>
private Integer kafkaRetryBackoffMs;
As we are from the development team, didn't have much insights into networking aspect, Need help whether this is something related to network congestion or we need to improve anything in the application itself. We didn't face any issues when the load was less than 10 Million per day and with lot of new apps sending the messages and increased load, we are seeing the above mentioned two issues intermittently.
Regarding the error from the producer side, make sure to include all the brokers that are partition leaders for your topic. You can find out which broker is the leader of a partition by running:
./kafka-topics.sh \
--zookeeper zookeeper-host:2181 \
--describe \
--topic your-topic-name
Topic: your-topic-name PartitionCount:3 ReplicationFactor:1
Topic: your-topic-name Partition: 0 Leader: 2 Replicas: 2 Isr: 2
Topic: your-topic-name Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: your-topic-name Partition: 2 Leader: 1 Replicas: 1 Isr: 1
In the above example, you'd have to provide all the addresses for brokers 0, 1 and 2.
I think I'd have to disagree with Giorgos, though I may be just misunderstanding his point.
Producers and consumers only ever talk to the partition leader - the other replicas are just on standby in case the leader goes down, and they act as consumers to the leader to keep their data up to date.
However on application startup the client code can connect to any one of the brokers and will find out the partition leaders when they fetch metadata. This metadata is then cached at the client side, but if there is a change in leadership at the broker end that is when you will see a NotLeaderForPartitionException, and this will prompt the client code to fetch metadata again to get the current set of partition leaders. Leadership election does take time, and so there will be some delay during this process, but it is a sign that the replication and broker end resilience is working correctly.
On the consumer side, manual commit vs autocommit will make no difference if you use the standard offset commit topic - autocommit just means that each time you poll, the previous processed messages will be committed (actually possibly not EVERY poll), but this is likely the same thing you would do manually. Storing offsets in a database will help keep things transactional if processing a message means updating data in that database - in that case you can commit offsets and processed data in the same DB transaction.
Basically, as I'm fairly sure you realise, normally duplicates are an inevitable part of consumer scalability as it allows any consumer process to pick up a partition and go from the last committed offset. Duplicates happen when a consumer has processed part of a batch and then been considered to be offline, either because the process has actually died or because of taking too long to process a batch. To avoid duplicates you have to ensure that every processed message is associated with a commit, in the same transaction. The cost is normally the rate of throughput, but as you suggest manual commit of each message rather than at batch level, and storing offsets in the same DB transaction can prevent duplicate consumption.
On the question of why rebalance is happening, there are only 2 reasons - change in number of partitions on the topic, or a perceived change in consumer group membership. There are two possible reasons for the latter - heartbeat thread has stopped, which would normally means the consumer application has stopped, or processing a batch has exceeded the max.poll.interval.ms (this configuration is intended to stop livelock, where a consumer is alive and sending heartbeats but has stopped polling).This last is the normal cause of rebalances outside application restarts - inevitably there is sometimes a bit of lag somewhere in any system and so consumer rebalances are generally considered normal if they don't happen too often because of a bit of delay processing a batch.
I'm not sure on the producer side issues - in my case I handle duplicates in the consumer, and in the producer I just allow a high number of retries, with acks=all (essential if you can't afford to lose messages) and 1 maximum in-flight request (to ensure ordering). Are the producer timeouts related to the NotLeaderForPartitionException? Is it just because of the leadership election?
(there is some more detail at https://chrisg23.blogspot.com/2020/02/kafka-acks-configuration-apology.html - a slightly rambling blog post, but may be interesting)

How to handle various failure conditions in Kafka

Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)

Kafka: Can consumers read messages before all replicas are in sync?

I'm designing an event driven distributed system.
One of the events we need to distribute needs
1- Low Latency
2- High availability
Durability of the message and consistency between replicas is not that important for this event type.
Reading the Kafka documentation it seems that consumers need to wait until all sync replicas for a partition have applied the message to their log before consumers can read it from any replica.
Is my understanding correct? If so is there a way around it
If configured improperly; consumers can read data that has not been written to replica yet.
As per the book,
Data is only available to consumers after it has been committed to Kafka—meaning it was written to all in-sync.
If you have configured min.insync.replicas=1 then only Kafka will not wait for replicas to catch-up and serve the data to Consumers.
Recommended configuration for min.insync.replicas depends on type of application. If you don't care about data then it can be 1, if it's critical piece of information then you should configure it to >1.
There are 2 things you should think:
Is it alright if Producer don't send message to Kafka? (fire & forget strategy with ack=0)
Is it alright if consumer doesn't read a message? (if min.insync.replica=1 then if a broker goes down then you may lose some data)

Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers?

Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.