Running two instances of MirrorMaker 2.0 halting data replication for newer topics - apache-kafka

We tried below scenario using mirror-maker 2.0 and want to know if output of second scenario is expected.
Scenario 1.) We ran single mirror-maker 2.0 instance using the below properties and start command.
clusters=a,b
tasks.max=10
a.bootstrap.servers=kf-test-cluster-a:9092
a.config.storage.replication.factor=1
a.offset.storage.replication.factor=1
a.security.protocol=PLAINTEXT
a.status.storage.replication.factor=1
b.bootstrap.servers=kf-test-cluster-b:9092
b.config.storage.replication.factor=1
b.offset.storage.replication.factor=1
b.security.protocol=PLAINTEXT
b.status.storage.replication.factor=1
a->b.checkpoints.topic.replication.factor=1
a->b.emit.checkpoints.enabled=true
a->b.emit.hearbeats.enabled=true
a->b.enabled=true
a->b.groups=group1|group2|group3
a->b.heartbeats.topic.replication.factor=1
a->b.offset-syncs.topic.replication.factor=1
a->b.refresh.groups.interval.seconds=30
a->b.refresh.topics.interval.seconds=10
a->b.replication.factor=2
a->b.sync.topic.acls.enabled=false
a->b.topics=.*
Start command: /usr/bin/connect-mirror-maker.sh connect-mirror-maker.properties &
Verification: Created new topic "test" on source cluster(a), produced data to topic on source cluster and ran consumer on target-cluster(b),topic "a.test" to verify data replication.
Observation: Worked fine as expected.
Scenario 2.) Ran one more instance of MirrorMaker 2.0 using the same properties as mentioned above.
Start command: /usr/bin/connect-mirror-maker.sh connect-mirror-maker.properties &
Verification: Created one more "test2" topic on source cluster, produced data to topic on source cluster and ran consumer on target-cluster(b),topic "a.test2" to verify data replication.
Observation: MM2 was able to replicate the topic on the target cluster, a.test2 was present on target cluster b but consumer didn't get any record to consume.
On newer mirror-maker 2.0 instance logs, after topic replication, mirror-sourceconnector task had not restarted which was restarting in single instance after topic replication.
NOTE: There were no error logs seen.

I observed the same behavior, your messages are most likely replicated, you can verify this by checking your consumer group offset, the problem is most likely your lag offset is 0 meaning your consumer assumes all previous messages have been consumed. You can reset the offset or read from beginning.
Ideally, the checkpoint heartbeat should contain the latest offset but I currently find this to be empty even though starting with Kafka 2.7, checkpoint heartbeat replication should be automatic

Related

Consumer wrongly ignoring already consumed messages

I'm in the midst of migrating a kafka cluster (1.0.0) to a new kafka cluster (3.1). I'm using MirrorMaker2 to mirror the source cluster to the target cluster. My MirrorMaker2 setup looks something like
refresh_groups_interval_seconds = 60
refresh_topics_enabled = true
refresh_topics_interval_seconds = 60
sync_group_offsets_enabled = true
sync_topic_configs_enabled = true
emit_checkpoints_enabled = true
When looking at topics which doesn't have any migrated consumer groups, everything looks fine. When I migrate a consumer group to consumer from the target cluster (Kafka 3.1), some consumer groups are migrated successfully, while some get a huge negative lag on some partitions. This results in a lot of
Reader-18: ignoring already consumed offset <message_offset> for <topic>-<partition>
At first I didn't think of this as a big problem, I just figured that it would eventually get caught up, but after some investigation, this is a problem. I produced a new message on the source cluster, checked which offset and partition that specific message landed on the target cluster, and noticed that the migrated consumer decided to ignore that new message and log
Reader-18: ignoring already consumed offset <message_offset> for <topic>-<partition>
After that I found https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaUnboundedReader.java#L202
So for some reason my consumer thinks its offset is much lower than it should be - on some partitions, not all. Any ideas on what can be wrong?
It should also be mentioned that the offset difference on the different partitions can be quite huge, almost stretching to an order of magnitude in difference.
p.s when migrating I noticed that I'm unable to do update a job. I have to kill the job and start a new one.

Kafka connect-distributed mode fault tolerance not working

I have created kafka connect cluster with 3 EC2 machines and started 3 connectors ( debezium-postgres source) on each machine reading a different set of tables from postgres source. In one of the machines, I started the s3 sink connector as well. So the changed data from postgres is being moved to kafka broker via source connectors (3) and S3 sink connector consumes these messages and pushes them to S3 bucket.
The cluster is working fine and so are the connectors too. When I pause any of the connectors running on one of EC2 machine, I was expecting that its task should be taken by another connector (postgres-debezium) running on another machine. But that's not happening.
I installed kafdrop as well to monitor the brokers. I see 3 internal topics connect-offsets, connect-status and connect-configs are getting populated with necessary offsets, configs, and status too ( when I pause, status paus message appears).
But somehow connectors are not taking the task when I paused.
Let me know in what scenario connector takes the task of other failed one? Is pause is the right way? or we should produce some error on one of the connectors then it takes.
Please guide.
Sounds like it's working as expected.
Pausing has nothing to do with the fault tolerance settings and it'll completely stop the tasks. There's nothing to rebalance until unpaused.
The fault tolerance settings for dead letter queue, skip+log, or halt are for when there are actual runtime exception in the connector that you cannot control through the API. For example, a database or S3 network / authentication exception, or serialization error in the Kafka client

Why my Kafka connect sink cluster only has one worker processing messages?

I've recently setup a local Kafka on my computer for testing and development purposes:
3 brokers
One input topic
Kafka connect sink between the topic and elastic search
I managed to configure it in standalone mode, so everything is localhost, and the Kafka connect was started using ./connect-standalone.sh script.
What I'm trying to do now is to run my connectors in distributed mode, so the Kafka messages can be separated into both workers.
I've started the two workers (still everything on the same machine), but when I send message to my Kafka topic, only one worker (the last started) is processing messages.
So my question is: Why only one worker is processing Kafka messages instead of both ?
When I kill one of the worker, the other one takes the message flow back, so I think the cluster is well setup.
What I think:
I don't put Keys inside my Kafka messages, can it be related to this ?
I'm running everything in localhost, does distributed mode can work this way ? (I've correctly configure specific unique field such as ret.port)
Resolved:
From Kafka documentation:
The division of work between tasks is shown by the partitions that each task is assigned
If you don't use partition (push all messages in same partition), workers won't be able to divide messages.
You don't need to use message keys, you can just push your messages to different partition in a cyclic way.
See: https://docs.confluent.io/current/connect/concepts.html#distributed-workers

Kafka 0.10 quickstart: consumer fails when "primary" broker is brought down

So I'm trying the kafka quickstart as per the main documentation. Got the multi-cluster example all setup and test per the instructions and it works. For example, bringing down one broker and the producer and consumer can still send and receive.
However, as per the example, we setup 3 brokers and we bring down broker 2 (with broker id = 1). Now if I bring up all brokers again, but I bring down broker 1 (with broker id = 0), the consumer just hangs. This only happens with broker 1 (id = 0), does not happen with broker 2 or 3. I'm testing this on Windows 7.
Is there something special here with broker 1? Looking at the config they are exactly the same between all 3 brokers except the id, port number and log file location.
I thought it is just a problem with the provided console consumer which doesn't take a broker list, so I wrote a simple java consumer as per their documentation using the default setup but specify the list of brokers in the "bootstrap.servers" property, but no dice, still get the same problem.
The moment I startup broker 1 (broker id = 0), the consumers will just resume working. This isn't a highly available/fault tolerant behavior for the consumer... any help on how to setup a HA/fault tolerant consumer?
Producers doesn't seem to have an issue.
If you follow the quick-start, the created topic should have only one partition with one replica which is hosted in the first broker by default, namely broker 1. That's why the consumer got failed when you brought down this broker.
Try to create a topic with multiple replicas(specifying --replication-factor when creating topic) and rerun your test to see whether it brings higher availability.

Kafka 0.10.0.1 partition reassignment after broker failure

I'm testing kafka's partition reassignment as a precursor to launching a production system. I have several topics with 9 partitions each and a replication factor of 3. I've killed one of the brokers to simulate a failure condition and verified that some topics became under replicated (verification done via a fork of yahoo's kafka manager modified to allow adding a version 0.10.0.1 cluster).
I then started a new broker with a different id. I would now like to distribute partitions to this new broker. I attempted to use kafka manager's reassign partitions functionality however that did not work (possibly due to an improperly modified fork).
I saw that kafka comes with a bin/kafka-reassign-partitions.sh script but the docs say that I have to manually write out the partition reassignments for each topic in json format. Is there a way to handle this without manually deciding on which brokers partitions must go?
Hmm what a coincidence that I was doing exactly the same thing today. I don't have an answer you're probably going to like but I achieved what I wanted in the end.
Ultimately, what I did was executed the kafka-reassign-partitions command with what the same tool proposed for a reassignment. But whatever it generated I just replaced the new broker id with the old failed broker id. For some reason the generated json moved everything around.
This will fail (or rather never complete) because the old broker has passed on. I then had to delete the reassignment operation in zookeeper (znode: admin/reassign_partitions or something).
Then I restarted kafka on the new broker and it magically picked up as leader of the partition that was looking for a new replacement leader.
I'll let you know if everything is still working tomorrow and if I still have a job ;-)