Can Kafka Connect config and offset topics be shared - apache-kafka

Can the two Apacke Kafka Connect metadata topics, config.storage.topic and offset.storage.topic, be shared safely between two or more separate Kafka Connect clusters?
While the documentation states for each that:
This must be the same for all workers with the same group.id.
It's not clear from the documentation whether these topics need to essentially be owned by single Kafka Connect clusters.
Broadly speaking, I can see that my connect group.id could be used as a prefix for the items in the topic, but it's not clear whether it is, nor whether it's safe to rely on as a way to separate out Connect clusters from each other's metadata.

Each cluster needs its own dedicated topics.
If you have multiple Kafka Connect clusters trying to use the same config/offset/status topics—even with different group.id set—you'll get a ton of errors and/or unexpected behaviour from Kafka Connect.

Related

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

Disable mirrormaker2 offset-sync topics on source kafka cluster

We're using MirrorMaker2 to replicate some topics from one kerberized kafka cluster to another kafka cluster (strictly unidirectional). We don't control the source kafka cluster and we're given only access to describe and read specific topics that are to be consumed.
MirrorMaker2 creates and maintains a topic (mm2-offset-syncs) in the source cluster to encode cluster-to-cluster offset mappings for each topic-partition being replicated and also creates an AdminClient in the source cluster to handle ACL/Config propagation. Because MM2 needs authorization to create and write to these topics in the source cluster, or to perform operations through AdminClient, I'm trying to understand why/if we need these mechanisms in our scenario.
My question is:
In a strictly unidirectional scenario, what is the usefulness of this source-cluster offset-sync topic to Mirrormaker?
If indeed it's superfluous, is it possible to disable it or operate mm2 without access to create/produce to this topic?
If ACL and Config propagation is disabled, is it safe to assume that the AdminClient is not used for anything else?
In the MirrorMaker code, the offset-sync topic it is readily created by MirrorSourceConnector when it starts and then maintained by the MirrorSourceTask. The same happens to AdminClient in the MirrorSourceConnector.
I have found no way to toggle off these features but honestly I might be missing something in my line of thought.
There is an option inroduced in Kafka 3.0 to make MM2 not to create the mm2-offset-syncs topic in the source cluster and operate on it in the target cluster.
Thanks to the KIP-716: https://cwiki.apache.org/confluence/display/KAFKA/KIP-716%3A+Allow+configuring+the+location+of+the+offset-syncs+topic+with+MirrorMaker2
Pull-request:
https://issues.apache.org/jira/browse/KAFKA-12379
https://github.com/apache/kafka/pull/10221
Tim Berglund noted this KIP-716 in Kafka 3.0 release: https://www.youtube.com/watch?v=7SDwWFYnhGA&t=462s
So, to make MM2 to operate on the mm2-offset-syncs topic in the target cluster you should:
set option src->dst.offset-syncs.topic.location = target
manually create mm2-offset-syncs.dst.internal topic in the target cluster
start MM2
src and dst - are examples of aliases, replace it with yours.
Keep in mind: if mm2-offset-syncs.dst.internal topic is not created manually in the target cluster, then MM2 still tries to create this topic in the source cluster.
In case of one-direction replication process this topic is useless, because it is empty all the time, but MM2 requires it anyway.

Kafka internal topic : Where are the internal topics created - source or target broker?

We are doing a stateful operation. Our cluster is managed. Everytime for internal topic creation , we have to ask admin guys to unlock so that internal topics can be created by the kafka stream app. We have control over target cluster not source cluster.
So, wanted to understand which cluster - source/ target are internal topics created?
AFAIK, There is only one cluster that the kafka-streams app connects to and all topics source/target/internal are created there.
So far, Kafka Stream applications can support connection to only one cluster as defined in the BOOTSTRAP_SERVERS_CONFIG in Stream configurations.
As answered above also, all source topics reside in those brokers and all internal topics(changelog/repartition topics) are created in the same cluster. KStream app will create the target topic in the same cluster as well.
It will be worth looking into the server logs to understand and analyze the actual root cause.
As the other answers suggest there should be only one cluster that the Kafka Stream application connects to. Internal topics are created by the Kafka stream application and will only be used by the application that created it. However, there could be some configuration related to security set on the Broker side which could be preventing the streaming application from creating these topics:
If security is enabled on the Kafka brokers, you must grant the underlying clients admin permissions so that they can create internal topics set. For more information, see Streams Security.
Quoted from here
Another point to keep in mind is that the internal topics are automatically created by the Stream application and there is no explicit configuration for auto creation of internal topics.

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.