Disable mirrormaker2 offset-sync topics on source kafka cluster - apache-kafka

We're using MirrorMaker2 to replicate some topics from one kerberized kafka cluster to another kafka cluster (strictly unidirectional). We don't control the source kafka cluster and we're given only access to describe and read specific topics that are to be consumed.
MirrorMaker2 creates and maintains a topic (mm2-offset-syncs) in the source cluster to encode cluster-to-cluster offset mappings for each topic-partition being replicated and also creates an AdminClient in the source cluster to handle ACL/Config propagation. Because MM2 needs authorization to create and write to these topics in the source cluster, or to perform operations through AdminClient, I'm trying to understand why/if we need these mechanisms in our scenario.
My question is:
In a strictly unidirectional scenario, what is the usefulness of this source-cluster offset-sync topic to Mirrormaker?
If indeed it's superfluous, is it possible to disable it or operate mm2 without access to create/produce to this topic?
If ACL and Config propagation is disabled, is it safe to assume that the AdminClient is not used for anything else?
In the MirrorMaker code, the offset-sync topic it is readily created by MirrorSourceConnector when it starts and then maintained by the MirrorSourceTask. The same happens to AdminClient in the MirrorSourceConnector.
I have found no way to toggle off these features but honestly I might be missing something in my line of thought.

There is an option inroduced in Kafka 3.0 to make MM2 not to create the mm2-offset-syncs topic in the source cluster and operate on it in the target cluster.
Thanks to the KIP-716: https://cwiki.apache.org/confluence/display/KAFKA/KIP-716%3A+Allow+configuring+the+location+of+the+offset-syncs+topic+with+MirrorMaker2
Pull-request:
https://issues.apache.org/jira/browse/KAFKA-12379
https://github.com/apache/kafka/pull/10221
Tim Berglund noted this KIP-716 in Kafka 3.0 release: https://www.youtube.com/watch?v=7SDwWFYnhGA&t=462s
So, to make MM2 to operate on the mm2-offset-syncs topic in the target cluster you should:
set option src->dst.offset-syncs.topic.location = target
manually create mm2-offset-syncs.dst.internal topic in the target cluster
start MM2
src and dst - are examples of aliases, replace it with yours.
Keep in mind: if mm2-offset-syncs.dst.internal topic is not created manually in the target cluster, then MM2 still tries to create this topic in the source cluster.
In case of one-direction replication process this topic is useless, because it is empty all the time, but MM2 requires it anyway.

Related

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Kafka HA + flume. How can I use the Kafka HA configuration with Flume?

environment
Apache Kafka 2.7.0
Apache Flume 1.9.0
Problem
Currently, in our architecture,
We are using Flume with Kafka channel, no source and sink to HDFS.
In the future, We are going to build a Kafka HA cluster using kafka mirror maker.
So, even if one cluster is shut down, I try to use it so that there is no problem with failure by connecting to the other cluster.
To do this, I think that we need to subscribe topic with a regex pattern with Flume.
Assume that cluster A and cluster B exist, and two clusters have a topic called ex. And the mirror maker copy each other ex, so cluster A has topic : ex, b.ex and cluster B has topic : ex, a.ex.
For example, while reading e and b.e from cluster A, if there is a failure, it tries to read ex and a.ex by going to the opposite cluster.
Like below.
test.channel = c1 c2
c1.channels.kafka.topics.regex = .*e (impossible in kafka channel)
...
c1.source.kafka.topics.regex = .*e (possible in kafka source)
In the case of flume kafka source, there is a property to read the topic as a regex pattern.
But This property does not exist in channel.
Is there any good way?
I'd appreciate it if you could suggest a better way. Thank you.
Sure, using a regex or simply a list of both topics would be preferred, but you then end up with data split across different directories based on the topic name, leaving HDFS clients to merge the data back together
A channel includes a producer, thus why a regex isn't possible
By going to the opposite cluster
There's no way Flume will automatically do that unless you modify its bootstrap servers config and restart it. Same applies for any Kafka client, really... This isn't exactly what I'd call "highly available" because all clients pointing to the down cluster will experience downtime
Instead, you should be using a Flume pipeline (or Kafka Connect) from each cluster. That being said, MirrorMaker would only then be making extra copies of your data or allowing clients to consume data from the other cluster for their own purposes rather than acting as a backup/fallback
Aside: unclear from the question, but make sure you are using MirrorMaker2, also, which would imply you'd already be using Kafka Connect and can therefore install the HDFS sink rather than need Flume

Does Confluent Schema Registry keep track of the producers to various Kafka topics?

I am trying to plot an overall topology for my Kafka cluster (i.e., producers-->topics-->consumers).
For the mapping from topics to consumers, I'm able to obtain it using the kafka-consumer-groups.sh script.
However, for the mapping from producers to topics, I understand there is no equivalent script in vanilla Kafka.
Question:
Does the Schema Registry allow us to associate metadata with producers and/or topics or otherwise create a mapping of all producers producing to a particular topic?
Schema Registry has no such functionality
Closest I've seen to something like this, is using Distributed Tracing (Brave library) or Cloudera's SMM tool, which requires authorized Kafka clients so it can trace requests and Producer client.id to topics, then consumer instances to groups
There's also the Stream Registry project
which I helped with the initial version for the vision of managing client state/discovery, but I think it took different direction and the documentation is not maintained

Kafka MirrorMaker2 automated consumer offset sync

I am using MirrorMaker2 for DR.
Kafka 2.7 should support
automated consumer offset sync
Here is the yaml file I am using (I use strimzi for creating it)
All source cluster topics are replicated in destination cluster.
Also ...checkpoint.internal topic is created in destination cluster that contains all source cluster offsets synced, BUT I don't see these offsets being translated into destination cluster _consumer_offsets topic which means when I will start consumer (same consumer group) in destination cluster it will start reading messages from the beginning.
My expectation is that after allowing automated consumer offsets sync all consumer offsets from source clusters translated and stored in _consumer_offsets topic in the destination cluster.
Can someone please clarify if my expectation is correct and if not how it should work.
The sync.group.offsets.enabled setting is for MirrorCheckpointConnector.
I'm not entirely sure how Strimzi runs MirrorMaker 2 but I think you need to set it like:
checkpointConnector:
config:
checkpoints.topic.replication.factor: 1
sync.group.offsets.enabled: "true"

Kafka internal topic : Where are the internal topics created - source or target broker?

We are doing a stateful operation. Our cluster is managed. Everytime for internal topic creation , we have to ask admin guys to unlock so that internal topics can be created by the kafka stream app. We have control over target cluster not source cluster.
So, wanted to understand which cluster - source/ target are internal topics created?
AFAIK, There is only one cluster that the kafka-streams app connects to and all topics source/target/internal are created there.
So far, Kafka Stream applications can support connection to only one cluster as defined in the BOOTSTRAP_SERVERS_CONFIG in Stream configurations.
As answered above also, all source topics reside in those brokers and all internal topics(changelog/repartition topics) are created in the same cluster. KStream app will create the target topic in the same cluster as well.
It will be worth looking into the server logs to understand and analyze the actual root cause.
As the other answers suggest there should be only one cluster that the Kafka Stream application connects to. Internal topics are created by the Kafka stream application and will only be used by the application that created it. However, there could be some configuration related to security set on the Broker side which could be preventing the streaming application from creating these topics:
If security is enabled on the Kafka brokers, you must grant the underlying clients admin permissions so that they can create internal topics set. For more information, see Streams Security.
Quoted from here
Another point to keep in mind is that the internal topics are automatically created by the Stream application and there is no explicit configuration for auto creation of internal topics.