I'm having issue with Kafka Mirror Maker.
I've stopped the mirror maker for 30 minutes due to a cluster upgrade and at the restart of the cluster the mirror maker is not able to consume data from the source cluster.
I see that the lag of the consumer group of the mirror maker is very high so I'm thinking about some parameters to change in order to increase the buffer size of the mirror maker.
I've tried changing the consumer group for the mirror maker and in this case this operation allows to restart consuming data from the latest messages. When I try to restart the process from the last saved offsets I see a peak of consumed data but the mirror maker is not able to commit offsets infact the log is blocked at the row: INFO kafka.tools.MirrorMaker$: Committing offsets and no more rows are showev after this one.
I think that the problem is related to the huge amount of data to process.
Ive running a cluster with Kafka 0.8.2.1 with this configuration:
dual.commit.enabled=false offsets.storage=zookeeper auto.offset.reset=largest
Thanks in advance
Related
I'm in the midst of migrating a kafka cluster (1.0.0) to a new kafka cluster (3.1). I'm using MirrorMaker2 to mirror the source cluster to the target cluster. My MirrorMaker2 setup looks something like
refresh_groups_interval_seconds = 60
refresh_topics_enabled = true
refresh_topics_interval_seconds = 60
sync_group_offsets_enabled = true
sync_topic_configs_enabled = true
emit_checkpoints_enabled = true
When looking at topics which doesn't have any migrated consumer groups, everything looks fine. When I migrate a consumer group to consumer from the target cluster (Kafka 3.1), some consumer groups are migrated successfully, while some get a huge negative lag on some partitions. This results in a lot of
Reader-18: ignoring already consumed offset <message_offset> for <topic>-<partition>
At first I didn't think of this as a big problem, I just figured that it would eventually get caught up, but after some investigation, this is a problem. I produced a new message on the source cluster, checked which offset and partition that specific message landed on the target cluster, and noticed that the migrated consumer decided to ignore that new message and log
Reader-18: ignoring already consumed offset <message_offset> for <topic>-<partition>
After that I found https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaUnboundedReader.java#L202
So for some reason my consumer thinks its offset is much lower than it should be - on some partitions, not all. Any ideas on what can be wrong?
It should also be mentioned that the offset difference on the different partitions can be quite huge, almost stretching to an order of magnitude in difference.
p.s when migrating I noticed that I'm unable to do update a job. I have to kill the job and start a new one.
I am working on kafka mirror making and I am using MM2. I can able to start the mirror process and all topics and date are replicated from source to target cluster.
I need to start the consumers in target cluster from where it has been left off in source cluster. I came across RemoteClusterUtils.translateOffsets to translate the consumer offsets from remote to local.
On checking the code, I can see that it just consumes the checkpoint topic and returning the offset for the consumer group we provided and it does not commit the offset.
Whether we need to explicitly commit the offset using OffsetCommitRequest and start the consumers in target clusters or whether some other way for this?
We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?
For migration of Kafka clusters from AWS to AZURE, the challenge is that we are using our custom offsets management for consumers. If I replicate the ZK nodes with offsets, the Kafka Mirror will change those offsets. Is there any way to ensure the offsets are same so that migration can be smooth?
I think the problem might be your custom management. Without more details on this, it's hard to give suggestions.
The problem I see with trying to copy offsets at all is that you consume from cluster A, topic T offset 1000. You copy this to a brand new cluster B, you now have topic T, offset 0. Having consumers starting at offset 1000 will just fail in this scenario, or if at least 1000 messages were mirrored, then you're effectively skipping that data.
With newer versions of Kafka (post 0.10), MirrorMaker uses the the __consumer_offsets topic, not Zookeeper since it's built on newer Java clients.
As for replication tools, uber/uReplicator uses ZooKeeper for offsets.
There are other tools that manage offsets differently, such as Comcast/MirrorTool or salesforce/mirus via the Kafka Connect Framework.
And the enterprise supported tool would be Confluent Replicator, which has unique ways of handling cluster failover and migrations.
We have v0.8.2 consumers that we are in the process of updating to v0.10.2. These are high volume applications so we normally do rolling updates. The problem is that the v8 consumers commit offsets to zookeeper while v10 consumers commit to kafka. In our testing, trying to run a mix of v8 and v10 consumers results in messages being double consumed. See a more detailed writeup of the problem here: http://www.search-hadoop.com/m/Kafka/uyzND1ymwxk17UWdj?subj=Re+kafka+0+10+offset+storage+on+zookeeper
Has anyone found a work around so that we can update from v8 to v10 consumers without taking an outage?
Two things here that I would suggest.
Do the version upgrade and the API upgrade in separate steps. You can rolling upgrade to 0.10 and continue using the old consumer API and committing offsets to zookeeper. Then in a second step do your API update. This will reduce your number of moving parts per step.
When doing the API update, you'll have to do some kind of full switchover because as you've observed, the consumer group info and offsets are going to be stored in Kafka now. One thing you can do to help this is here
https://kafka.apache.org/documentation/#offsetmigration
However, the consumer groups are being coordinated differently (in the old version they are being coordinated in Zookeeper no matter where you commit offsets), so at some point you will need to stop all old consumers and start up your new API implementation. Migrating your offsets first should make this a very short outage with no data loss, just whatever the duration is of bringing down all consumers and starting up the new ones.