Kafka Cluster Architecture - Mirror Maker - apache-kafka

I have a Kafka cluster composed by 5 brokers and 4 mirror maker to mirror date from 2 different data centers. I know that a kafka broker requires its own dedicated hardware especially because of the high disk I/O, memory usage and CPU intensive application.
I would like to know if could make sense to deploy a mirror maker process on a node that is even a Kafka broker or if I should consider to have the mirror maker on:
a dedicated node
a node which hostes a zookeeper server
HDFS and others cloudera services are deployed on different nodes.
Thanks in advance,
Beniamino

MirrorMaker is just a regular Java Producer/Consumer pair.
If you wrote an application to read from the remote data center, would it make sense to run it on its own hardware? Do you have the resources available to do so? I personally wouldn't run it on a broker or zookeeper.
If you're running in a data center with Docker or Kubernetes available, you can deploy all mirroring instances in their own containers. Or you can run all topics in one JVM using a regex whitelist pattern.
However you choose to deploy, it's recommended to have the consuming process of the MirrorMaker to be in the remote data center pulling data and producing to the local cluster.
Confluent has discussions about this topic
Edit: As of Kafka 2.4, MirrorMaker2 is built on the Kafka Connect framework and is the recommended deployment going forward

Related

Where to run Kafka stream processor?

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?
Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.
Kafka Connect does not run Kafka Streams applications.
ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.
Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.
And none of the above require Confluent Platform.
how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.
If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.

Do you need multiple zookeeper instances to run a multiple-broker kafka?

I'm new to kafka.
Kafka is supposed to be used as a distributed service. But the tutorials and blog posts i found online never mention if there is one or several zookeeper nodes.
The tutorials just pop one zookeper instance, and then multiple kafka brokers.
Is it how it is supposed to be done?
Zookeeper is a co-ordination service (in a centralized manner) for distributed systems that is used by clusters for maintenance of distributed system . The distributed synchronization achieved by it via metadata such as configuration information, naming, etc.
In general architectures, Kafka cluster shall be served by 3 ZooKeeper nodes, but if the size of deployment is huge, then it can be ramped up to 5 ZooKeeper nodes but that in turn will add load on the nodes as all nodes try to be in sync as all metadata related activities are handled by ZooKeeper.
Also, it should be noted that as an improvement, the new release of Kafka reduces dependency on ZooKeeper in order to enhance scalability of metadata across, to reduce the complexity in maintaining the meta data with external components and to enhance the recovery from unexpected shutdowns. With new approach, the controller failover is almost instantaneous. This is achieved by Kafka Raft Metadata mode termed as 'KRaft' that will run Kafka without ZooKeeper by merging all the responsibilities handled by ZooKeeper inside a service in the Kafka Cluster itself and operates on event based mechanism that is used in the KRaft protocol.
Tutorials generally keep things nice and simple, so one ZooKeeper (often one Kafka broker too). Useful for getting started; useless for any kind of resilience :)
In practice, you are going to need three ZooKeeper nodes minimum.
If it helps, here is an enterprise reference architecture whitepaper for the deployment of Apache Kafka
Disclaimer: I work for Confluent, who publish the above whitepaper.

Separate zookeeper install or not using kafka 10.2?

I would like to use the embedded Zookeeper 3.4.9 that come with Kafka 10.2, and not install Zookeeper separately. Each Kafka broker will always have a 1:1 Zookeeper on localhost.
So if I have 5 brokers on hosts A, B, C, D and E, each with a single Kafka and Zookeeper instance running on them, is it sufficient to just run the Zookeeper provided with Kafka?
What downsides or configuration limitations, if any, does the embedded 3.4.9 Zookeeper have compared to the standalone version?
These are a few reason not to run zookeeper on the same box as Kafka brokers.
They scale differently
5 zk and 5 Kafka works but 6:6 or 11:11 do not. You don't need more than 5 zookeeper nodes even for a quite large Kafka cluster. Unlike Kafka, Zookeeper replicates data to all nodes so it gets slower as you add more nodes.
They compete for disk I/O
Zookeeper is very disk I/O latency sensitive. You need to have it on a separate physical disk from the Kafka commit log or you run the risk that a lot of publishing to Kafka will slow zookeeper down and cause it to drop out of the ensemble causing potential problems.
They compete for page cache memory
Kafka uses Linux OS page cache to reduce disk I/O. When other apps run on the same box as Kafka you reduce or "pollute" the page cache with other data that takes away from cache for Kafka.
Server failures take down more infrastructure
If the box reboots you lose both a zookeeper and a broker at the same time.
Even though ZooKeeper comes with each Kafka release it does not mean they should run on the same server. Actually, it is advised that in a production environment they run on separate servers.
In the Kafka broker configuration you can specify the ZooKeeper address, and it can be local or remote. This is from broker config (config/server.properties):
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181
You can replace localhost with any other accessible server name or IP address.
We've been running a setup as you described, with 3 to 5 nodes, each running a kafka broker and the zookeeper that comes with kafka distribution on the same nodes. No issues with that setup so far, but our data throughput isn't high.
If we were to scale above 5 nodes we'd separate them, so that we only scale kafka brokers but keep the zookeeper ensemble small. If zookeeper and kafka start competing for I/O too much, then we'd move their data directories to separate drives. If they start competing for CPU, then we'd move them to separate boxes.
All in all, it depends on your expected throughput and how easily you can upgrade your setup if it starts causing contention. You can start small and easy, with kafka and zookeeper co-located as long as you have the flexibility to upgrade your setup with more nodes and introduce separation later on. If you think this will be hard to add later, better start running them separate from the start. We've been running them co-located for 18+ months and haven't encountered resource contention so far.

Run Kafka and Kafka-connect on different servers?

I want to know if Kafka and Kafka-connect can run on different servers? So a connector would be started on server A and send data from a kafka topic on server B to HDFS or S3 etc. Thanks
Yes, and for Production deployments this is typically recommended for resource reasons. Generally you'd deploy a cluster of Kafka Brokers (3+ for HA), and then a cluster of Kafka Cluster workers (as many as needed for throughput capacity / resilience) -- all on separate nodes.
For more details, see the Confluent Enterprise Reference Architecture.
Yes, you can do it.
I have my set of kafka servers and kafka connect applications are running in different machines and writing data in hdfs. you have to mention list of brokers in bootstrap.servers under worker properties file (config/connect-distributed.properties or config/connect-standalone.properties) instead of localhost:9092

Best Practices for Kafka Cluster Deployment Configuration?

I'm asking for general best practices here:
If I want a five node cluster, do all five nodes run the Confluent Platform Umbrella Packages that include Zookeeper, Kafka, schema-registry?
Is it ever recommended to run the zookeper cluster on separate servers from the Kafka cluster?
If I want to run the Kafka Connect distributed worker, do I run that on all cluster nodes? Do I ever want to run on separate servers? Is Docker recommended for this or is Docker unnecessary?
With Kafka Streaming apps, should they be run on all cluster nodes? Should they be dockerized? Should they ever run on separate nodes?
Is something like Mesos recommended?
It is a best practice to run Kafka Brokers on dedicated servers (or virtual servers). The same is true of Zookeeper.
All the other components of the Confluent Platform can run colocated on common servers or on separate machines.
You would typically run only one Schema Registry (or two if you want fault tolerance). They can run on any machine that can connect back to the Kafka Brokers.
Kafka Connect distributed workers only need to run on machines that you want to host Kafka Connectors. They just need to be able to connect back to the Kafka Brokers.
Kafka Streams apps can run anywhere you want so long as they can connect back to the Kafka Brokers.
All components can run inside docker containers or without docker.
You can use whatever microservices or data center resource management tools you want (or none at all) - it is your choice.