Avoid duplicate data processing between Kafka data centers - apache-kafka

In active - active Kafka design, when data are replicated from DC1 to DC2 (withing same or different zone). How to make sure that DC2 consumer do not process data that has already been processed by DC1 consumer? What Kafka level config should be done?

The consumer only knows about the local offsets topic, and the cluster doesn't know its being replicated into another, so there's no config in the broker that would modify consumer behavior with respect to another cluster.
I assume you're using MirrorMaker2, which has the ability to translate topic offsets using timestamp markers, which a secondary failover consumer can pick up from, but this assumes it is only running after the source DC has failed, not in parallel, in which case, since it's a separate topic, different maintained consumer group, etc, you'll either need to have a cross-DC distributed lock or find a place to centrally store and manage offsets that both consumers will use

Related

How are consumers setup in Active - Active Kafka setup

We are having Active Active Kafka cluster setup with topic renaming using Mirror Maker 2.0 as specified in https://strimzi.io/blog/2020/03/30/introducing-mirrormaker2/. I believe topic such as us-email are setup as follows:
dc1
us-email
us-email-dc2 (mirror of dc2)
dc2
us-email
us-email-dc1 (mirror of dc1)
Producers can publish to their local DC's and both clusters would contain data of both the DC's. So far so good.
Consumer app would subscribe to wild card topic (us-email-*) to read data of both DC's. If that's the case, Do I setup a consumer to read from their respective DC's? In this case, there will be duplicate message read for reach message due to mirroring. OR it is recommended to point a single consumer group to a single DC only at a time to prevent duplication? If yes, if a single DC fails, how will the failover happen?
Does consumers in both data centers have to point to single DC
Consumers cannot read from more than one list of bootstrap servers, so yes
there is manual failover?
Not clear what you mean by manual.
If the Mirror or destination brokers fail, then consumer stops reading anything
If the source is down, then the mirroring stops, leading back to (1)
consumers in both DC's will get replicated messages as well
Mirroring doesn't guarantee exactly once delivery
Automatic failover is not possible. Whenever one dc fails, you have to update the consumer to read from other dc manually. Also about consumer offsets, I am not sure if they sync and they let you continue or treat the consumer as new consumer-group.

Manually setting Kafka consumer offset

In our project, there are Active Kafka servers( PR) and Passive Kafka servers (DR), both Kafka brokers are configured with the same group name, topic name and partition in our project. When switching from PR to DR the _consumer_offsets is manually set on DR.
My question here is, would the Kafka consumer be able to seamlessly consume the messages from where it was last read?
When replicating messages across 2 clusters, it's not possible to ensure offsets stay in sync.
For example, if a topic exists for a little while on the Active cluster the log start offset for some partitions may not be 0 (some records have been deleted by the retention policies). Hence when replicating this topic, offsets between both clusters will not be the same. This can also happen when messages are lost or duplicated as you can't have exactly once semantics when replicating between 2 clusters.
So you can't just replicate the __consumer_offsets topic, this will not work. Consumer group positions have to be explicitly "translated" between both clusters. While it's possible to reset them "manually" by directly committing, it's not recommended as finding the new positions is not obvious.
Instead, you should use a replication tool that supports "offset translation" to ensure consumers can seamlessly switch from 1 cluster to the other.
For example, Mirror Maker 2, the official Kafka tool for mirroring clusters, supports offset translation via RemoteClusterUtils. You can find the details in the KIP.
In itself, relying on the fact that both clusters will have the same offset is faulty.
Offset - is relative characteristic. It's not a part of a message. It's literally a position inside the file. And those files, Kafka log files, also rotate and have retentions. There's no guarantee that those log files are identical at any given point in time. Kafka doesn't claim to solve such an issue.
Besides, it's tricky to solve from CAP point of view.
And it's also pointless unless you want strict physical replication.
That's why Kafka multi-cluster tools are usually about logical replication. I have not used Mirror Maker(MM) but I've used Replicator(which is a more advanced commercial tool by Confluent) and it has a feature for that called, who would have guessed, just like the MM one - offset translation.
Replicator does the following:
Reads the consumer offset and timestamp information from the
__consumer_timestamps topic in the origin cluster to understand a consumer group’s progress.
Translates the committed offsets in the
origin datacenter to the corresponding offsets in the destination
datacenter.
Writes the translated offsets to the __consumer_offsets
topic in the destination cluster, as long as no consumers in that
group are connected to the destination cluster.
Note: You do need to add an interceptor to your Kafka Consumers.

How customer offsets are maintained in mirrored cluster in Kafka?

Lets say I have two Kafka clusters and I am using mirror maker to mirror the topics from one cluster to another. I understand consumer has an embedded producer to commit offsets to __consumer-offset topic in Kafka cluster. I need to know what will happen if primary Kafka cluster goes down? Do we sync the __consumer-offset topic as well? Because secondary cluster could have different number of brokers and other settings, I think.
Please tell how Kafka mirrored cluster takes care of consumer offset?
Does auto.offset.reset setting play a role here?
Update
Since Apache Kafka 2.7.0, MirrorMaker is able to replicate committed offsets. Cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0
Original Answer
Mirror maker does not replicate offsets.
Furthermore, auto.offset.reset is completely unrelated to this, because it's a consumer setting that defines where a consumer should start reading for the case, that no valid committed offset is found at startup.
The reason for not mirroring offsets is basically, that they can be meaningless on the mirror cluster because it is not guaranteed, that the messages will have the same offsets in both cluster.
Thus, in fail over case, you need to figure out something "smart" by yourself. One way would be to remember the metadata timestamp of you last processed record. This allows you to "seek" based on timestamp on the mirror cluster to find an approximate offset there. (You will need Kafka 0.10 for this.)

**Kafka** Bi-directional sync between datacenter across regions

I have a deployment where we are using kafka to send messages from the services. But we need to have master Kafka in all the regions. So once the message is pushed in 1 data center, it should be synced in other. And when it is done in other data center again it should be synced back. Mirror Maker can offer sync from 1 to other, but how do I achieve the bi-directional sync?
Master-Master kind of replication is not available in Kafka, Kafka MirrorMaker can only mirror in one direction.
Why ?
Kafka MirrorMaker is basically a combination of producer and consumer transferring events from one DC to another and during this process the offset of mirrored topic will be different from one in the source cluster. Now if we wanted to have bi-directional will have to keep track of messages produced at one end, which is hard(not worth) without tweaking too much in all your consumers and producers.
There is no way to run make mirrormaker to do master-master kind of replication in Kafka, it will only end up in loops.
If you want to achieve your requirement you might have to keep data center specific topic and aggregate them to a master topic.
Say you want to produce messages to topicA from both DC1 and DC2.
Have topicA-DC1 in DC1 and topicA-DC2 in DC2. And have master topic topicA in both DC1 and DC2.
Your mirromaker should have aggregate messages from topicA-DC1 and topicA-DC2 to master topic topicA in both data centres.
I suspect the reason MirrorMaker is one-directional is to avoid "loops" of the same event being read from site A into site B and then synched from B back to A.
If you look at this blog post (specifically "tiers and aggregation") a solution is to have "local" and "aggregate" topics, where you use MM to read from the local topic into remote aggregate topic(s)

Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers?

Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.