Is there a way to replicate kafka topics from one cluster to another using java? - apache-kafka

I need to replicate kafka topics from one cluster to another using a java process. The new messages to the source cluster should also be replicated to the destination cluster. Is there any simple java code to done this?

Related

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

Migration Cloudera Kafka (CDK) to Apache Kafka

I am looking to migrate a small 4 node Kafka cluster with about 300GB of data on the each brokers to a new cluster. The problem is we are currently running Cloudera's flavor of Kafka (CDK) and we would like to run Apache Kafka. For the most part CDK is very similar to Apache Kafka but I am trying to figure out the best way to migrate. I originally looked at using MirrorMaker, but to my understanding it will re-process messages once we cut over the consumers to the new cluster so I think that is out. I was wondering if we could spin up a new Apache Kafka cluster and add it to the CDK cluster (not sure how this will work yet, if at all) then decommission the CDK server one at a time. Otherwise I am out of ideas other than spinning up a new Apache Kafka cluster and just making code changes to every producer/consumer to point to the new cluster. which I am not really a fan of as it will cause down time.
Currently running 3.1.0 which is equivalent to Apache Kafka 1.0.1
MirrorMaker would copy the data, but not consumer offsets, so they'd be left at their configured auto.offset.reset policies.
I was wondering if we could spin up a new Apache Kafka cluster and add it to the CDK cluster
If possible, that would be the most effective way to migrate the cluster. For each new broker, give it a unique broker ID and the same Zookeeper connection string as the others, then it'll be part of the same cluster.
Then, you'll need to manually run the partition reassignment tool to move all existing topic partitions off of the old brokers and onto the new ones as data will not automatically be replicated
Alternatively, you could try shutting down the CDK cluster, backing up the data directories onto new brokers, then starting the same version of Kafka from your CDK on those new machines (as the stored log format is important).
Also make sure that you backup a copy of the server.properties files for the new brokers

Listen to a topic continiously, fetch data, perform some basic cleansing

I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.

Load data from separate kafka cluster to Samza?

I am trying to create a Samza job that as closely resembles the Wikipedia example job as I can make it. However in the "WikipediaFeed" object I am trying to get data from a different Kafka broker than the Kafka broker that is running when you start the Hello-Samza grid.
Do I have to create a thread safe Kafka consumer inside the "WikipediaFeed" object to consume data from a different Kafka cluster or is there another way I'm not seeing?
Edit 1:
Here is a link to their Wikipedia example.
https://github.com/apache/samza-hello-samza/tree/master/src/main
Thanks
In your example you need change this config (https://github.com/apache/samza-hello-samza/blob/master/src/main/config/wikipedia-feed.properties) :
systems.kafka.consumer.zookeeper.connect=KAFKA_CLUSTER_FRONTING:2181
systems.kafka.producer.bootstrap.servers=KAFKA_CLUSTER_FRONTING:9092
task.inputs=kafka.topic1,kafka.topic2,kafka.topic3
Change the config with your Fronting Kafka cluster
and add your topic in task.inputs separated with ","
Edit:
Just to be clear, you can deploy your Samza into a Cluster 1 and consume a Kafka topic from another cluster. You need change the config in your Samza properties.
To see more information : Samza config
Then if you need send your message after process to another Kafka cluster you will need create another system in your config.
See more information : https://samza.apache.org/learn/documentation/0.13/api/overview.html