Kafka Mirror Maker (Multiple Sources --> One Target) possible? - apache-kafka

short question to the Kafka Pros out there. I have multiple datacenters DC_REMOTE_1, DC_REMOTE_2 and DC_LOCAL, where the remote datacenters are actually sending messages to one topic.
In the local datacenter (DC_LOCAL) we are running a mirror-maker which currently transfers the remote topic (events#dc_remote_1) to the local topic (events#dc_local). Is is possible to configure mirror maker in that way, that (events#dc_remote_2) are as well copied to the events#local?
It's kind of merging different remote topics into one local topic, or do we run into problems due to the offset management?
thanks for your help.

I have worked on similar requirement where you could have one independent MirrorMaker run in each DC (DC_REMOTE_1 and DC_REMOTE_2) which will act as a consumer for the local kafka clusters and send the data to DC_LOCAL(this way there are no issues on offset management).
Indeed the capacity of the kafka cluster DC_LOCAL should be approximately 2 times the DC_REMOTE_1/2 kafka clusters.
Let me know if this helps your requirement.

Related

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

Kafka Cluster Architecture - Mirror Maker

I have a Kafka cluster composed by 5 brokers and 4 mirror maker to mirror date from 2 different data centers. I know that a kafka broker requires its own dedicated hardware especially because of the high disk I/O, memory usage and CPU intensive application.
I would like to know if could make sense to deploy a mirror maker process on a node that is even a Kafka broker or if I should consider to have the mirror maker on:
a dedicated node
a node which hostes a zookeeper server
HDFS and others cloudera services are deployed on different nodes.
Thanks in advance,
Beniamino
MirrorMaker is just a regular Java Producer/Consumer pair.
If you wrote an application to read from the remote data center, would it make sense to run it on its own hardware? Do you have the resources available to do so? I personally wouldn't run it on a broker or zookeeper.
If you're running in a data center with Docker or Kubernetes available, you can deploy all mirroring instances in their own containers. Or you can run all topics in one JVM using a regex whitelist pattern.
However you choose to deploy, it's recommended to have the consuming process of the MirrorMaker to be in the remote data center pulling data and producing to the local cluster.
Confluent has discussions about this topic
Edit: As of Kafka 2.4, MirrorMaker2 is built on the Kafka Connect framework and is the recommended deployment going forward

Kafka partitions and offsets disappeared

My Kafka clients are running in GCP App Engine Flex environment with auto scale enabled (GCP keeps the instance count to at least two and it has been mostly 2 due to low CPU usages). The consumer groups running in that 2 VMs have been consuming messages from various topics in 20 partitions for several months and recently I noticed that partitions in older topics shrank to just 1 (!) and offsets for that consumer group was reset to 0. topic-[partition] directories were also gone from the kafka-logs directory. Strangely, recently created topic partitions are intact. I have 3 different environments (all in GCP) and this happened to all three. We didn't see any lost messages or data problem but want to understand what had happened to avoid this happening again.
The kafka broker and zookeeper are running in the same and single GCP compute engine instance (I know it's not the best practice and have plan to improve) and I suspect it has something to do with machine restart and that wipes out some information. However, I verified that data files are written under /opt/bitnami/(kafka|bitnami) directory and not /tmp which can be removed by machine restarts.
Spring Kafka 1.1.3
kafka client 0.10.1.1
single node kafka broker 0.10.1.0
single node zookeeper 3.4.9
Any insights on this will be appreciated!
Bitnami developer here. I could reproduce the issue and track it down to an init script that was clearing the content of the tmp/kafka-logs/ folder.
We released a new revision of the kafka installers, virtual machines and cloud images fixing the issue. The revision that includes the fix is 1.0.0-2.

How to scale single node Kafka to multiple node cluster?

I am going to install Kafka for company messaging. The plan is to first install the kafka on a single huge machine and scale it to 4-5 machines (a cluster) later if needed.
I have little experience about kafka. Want to ask whether it is possible to scale by just changing the parameter in broker configuration and install zookeeper on newly joined machine.
Or how can I roughly do this in the easiest way ? More specifically Cloudera Kafka in CDH.
Thanks
To scale Kafka you will have to add more partitions to topics if needed to using kafka-topics.sh. And than reassign partitions to your new brokers using kafka-reassign-partitions.sh.
The reassign utility will replicate and dispatch your data automatically. You can do it for a whole topic or for a selective set of partitions.
The complete documentation is here. Just take a look at section 6.