Is it possible to force abort a Kafka transaction? - apache-kafka

We have a test Kafka cluster that we were experimenting with adjusting various settings.
One of the settings that was adjusted was to set the transaction.max.timeout.ms to 7 days.
While that setting was in place we had a network failure to one of the ZK nodes. It was brief but enough that it triggered a broker leader election. This leader election wasn't clean as it only registered 6 of the 8 brokers when it came up. We manually triggered another election and everything came up cleanly.
The problem that we have now is that we have a bunch of zombie transactions that have not aborted or committed.
This means that our apps that use transactions/have an isolation level of read_committed are no longer reading from certain partitions.
I know this is because that Last Stable Offset (LSO) is at the point where the transaction was created.
I've tested this by using the console consumer to read from a particular topic:partition offset and it was fine and then added --isolation-level read_committed and it doesn't return any records.
Is there any way to force the transaction coordinator to abort the zombie transactions or to manually set the LSO? I've even 'purged' the topic by setting retention.ms to 100 and seen the consumer group offset record shift but any read_committed clients still wont read from the partition and the consumer group wont advance past the log rotation.
Thanks

Related

Kafka streams losing messages in State Store during first launch of app

We have been using Kafka Streams to implement a CDC based app. Attached is the sub topology of interest.
table2 topic is created by Debezium who is connected to a SQL DB. It contains 26K lines. We take table2 and create a key which is only a conversion of the key from string to int. This means that we should expect that #table2=#repartition-topic=#state-store; which actually is not verified. What we end up with is the following #table2=#repartition-topic, but #repartition-topic>#state-store. We actually lose messages and thus corrupt the state store, which makes the app live in incorrect state. (Please note that there is no insertion in table2 as we paused the connector to verify the cardinality.)
The above happens only during the first launch, i.e. the app has never been launched before, so internal topics do not exist yet. Restarts of pre-existing apps do not yield any problems.
We have:
Broker on Kafka 3.2.
Client app on 2.8|3.2 (we tried both and we faced the same issue).
All parameters are by default except CACHE_MAX_BYTES_BUFFERING_CONFIG set to 0 and NUM_STREAM_THREADS_CONFIG set to >1.
What actually worked
Use a monothread at first launch: using one thread solves the problem. The #table2=#repartition-topic=#state-store is verified.
Pre-creating kafka internal topics: we noticed that whenever there is rebalance during the first launch of Kafka Streams app, the state stores ended up missing values. This also happens when you launch multiple pods in K8s for example. When we read through the logs, we noticed that there is a rebalance that is triggered when we first launch the app. This comes from the fact that the internal topics get created and assigned, thus the rebalance. So by creating the internal topics before, we avoid the rebalance and we end up by #table2=#repartition-topic=#state-store.
What we noticed from the logs
On multi-thread mode, we noticed that it is the partitions that are assigned to the thread chosen by the Coordinator to be the Leader of consumers that suffer the data loss. What we think is happening is the following:
Consumers threads are launched and inform the coordinator.
Coordinator assign topics and choses the Leader among the threads.
App create internal topics.
Consumers/producers process data. Specifically the Consumer leader consumes from the repartition topic, which triggers the delete of those messages without flushing them to changelog topic.
Leader notified of new assignment with internal topics. Triggers rebalance.
Leader pauses partitions.
Rebalance finished. The leader resumes partitions.
Leader fetches the oldest offset of repartition partitions he got assigned. He will not start from zero, but instead from where he got interrupted in 4. The chunk of early messages are thus lost.
Please note, that on mono-thread mode, there is no data loss which is weird since the leader is actually the unique thread.
So my questions are:
Are we understanding wrongly what's happening?
What can be the origin of this problem?

Kafka transactions - why do I need to replicate?

I am using Kafka for a circular buffer of the last 24 hours of events.
I have 4 brokers that are run on on ephemeral cloud providers. So the disk is local, if the broker dies I loose the data for that broker. I can start the broker again and it an replicate the data from another broker. I have replicas setup for my topic and the offsets topic:
default.replication.factor=2
offsets.topic.replication.factor=2
I'm using transactions to commit the new offsets + new records atomically. My app is side affect free, so if the transaction fails, I can poll and get the same records and repeat the processing and produce the same resultant events.
So the defaults for the transaction durability properties:
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
I feel that in my setup, I can set both of these properties to 1 i.e. no replication/durability (as my app is side affect free). Yet I can't shake the niggling feeling for some reason that I'm wrong.
Am I wrong? Why are the transactions durable in the first place - what scenario does the durability help with?

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.

Apache Kafka Cleanup while consuming messages

Playing around with Apache Kafka and its retention mechanism I'm thinking about following situation:
A consumer fetches first batch of messages with offsets 1-5
The cleaner deletes the first 10 messages, so the topic now has offsets 11-15
In the next poll, the consumer fetches the next batch with offsets 11-15
As you can see the consumer lost the offsets 6-10.
Question, is such a situation possible at all? With other words, will the cleaner execute while there is an active consumer? If yes, is the consumer able to somehow recognize that gap?
Yes such a scenario can happen. The exact steps will be a bit different:
Consumer fetches message 1-5
Messages 1-10 are deleted
Consumer tries to fetch message 6 but this offset is out of range
Consumer uses its offset reset policy auto.offset.reset to find a new valid offset.
If set to latest, the consumer moves to the end of the partition
If set to earliest the consumer moves to offset 11
If none or unset, the consumer throws an exception
To avoid such scenarios, you should monitor the lead of your consumer group. It's similar to the lag, but the lead indicates how far from the start of the partition the consumer is. Being near the start has the risk of messages being deleted before they are consumed.
If consumers are near the limits, you can dynamically add more consumers or increase the topic retention size/time if needed.
Setting auto.offset.reset to none will throw an exception if this happens, the other values only log it.
Question, is such a situation possible at all? will the cleaner execute while there is an active consumer
Yes, if the messages have crossed TTL (Time to live) period before they are consumed, this situation is possible.
Is the consumer able to somehow recognize that gap?
In case where you suspect your configuration (high consumer lag, low TTL) might lead to this, the consumer should track offsets. kafka-consumer-groups.sh command gives you the information position of all consumers in a consumer group as well as how far behind the end of the log they are.

kafka Burrow reports Multiple Offsets for same consumer_group and same topic

SetUp
myTopic has a single partition.
consumer_group is my spring-boot app using spring-kafka client and there is always a single consumer for that consumer group. spring-kafka version 1.1.8 RELEASE
I have a single broker node in kafka. Kafka version 0.10.1.1
When I query a particular consumer_group using burrow, I see 15 offset entries for same topic.
Observations
curl http://burrow-node:8000/v3/kafka/mykafka-1/consumer/my_consumer_grp
"myTopic":[
{"offsets":[
{"offset":6671,"timestamp":1533099130556,"lag":0},
{"offset":6671,"timestamp":1533099135556,"lag":0},
{"offset":6671,"timestamp":1533099140558,"lag":0},
{"offset":6671,"timestamp":1533099145558,"lag":0},
{"offset":6671,"timestamp":1533099150557,"lag":0},
{"offset":6671,"timestamp":1533099155558,"lag":0},
{"offset":6671,"timestamp":1533099160561,"lag":0},
{"offset":6671,"timestamp":1533099165559,"lag":0},
{"offset":6671,"timestamp":1533099170560,"lag":0},
{"offset":6671,"timestamp":1533099175561,"lag":0},
{"offset":6671,"timestamp":1533099180562,"lag":0},
{"offset":6671,"timestamp":1533099185562,"lag":0},
{"offset":6671,"timestamp":1533099190563,"lag":0},
{"offset":6671,"timestamp":1533099195562,"lag":0},
{"offset":6671,"timestamp":1533099200564,"lag":0}
]
More Observations
When I restarted the app again, I didn't find a new offset entry to be created, except the timestamp kept on updating; which is probably due to the auto.commit.interval.ms;
When I started producing/consuming; I saw the changes in offset and lag in one of the offsets; later on the other offsets caught up; which makes me think those are replicas;
offset.retention.minutes is default 1440
Questions
Why do we have 15 offset entries in burrow reports?
If they are replicas, why does a single partition topic gets split up in 14 different replicas under __consumer_offsets? Is there any documentation for this?
If they are NOT replicas, what else are they?
Here's my understanding, based on the docs. Burrow stores a configurable number of committed offsets. It's a rolling window. Every time a consumer commits, burrow stores the committed offset and lag at time of commit. What you are seeing is likely the result of having applied a Storage config something like this (culled from burrow.iml):
[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1
Note that intervals is set to 15.
I believe this feature is simply to provide some history of consumer group commits and associated lags, and has nothing to do with replicas.
EDIT:
The Consumer Lag Evaluation Rules page on the Burrow wiki explains this functionality in more detail. In short, this configurable window of offset/lag data is used to calculate consumer group status.