Best Practices for Kafka Cluster Deployment Configuration? - apache-kafka

I'm asking for general best practices here:
If I want a five node cluster, do all five nodes run the Confluent Platform Umbrella Packages that include Zookeeper, Kafka, schema-registry?
Is it ever recommended to run the zookeper cluster on separate servers from the Kafka cluster?
If I want to run the Kafka Connect distributed worker, do I run that on all cluster nodes? Do I ever want to run on separate servers? Is Docker recommended for this or is Docker unnecessary?
With Kafka Streaming apps, should they be run on all cluster nodes? Should they be dockerized? Should they ever run on separate nodes?
Is something like Mesos recommended?

It is a best practice to run Kafka Brokers on dedicated servers (or virtual servers). The same is true of Zookeeper.
All the other components of the Confluent Platform can run colocated on common servers or on separate machines.
You would typically run only one Schema Registry (or two if you want fault tolerance). They can run on any machine that can connect back to the Kafka Brokers.
Kafka Connect distributed workers only need to run on machines that you want to host Kafka Connectors. They just need to be able to connect back to the Kafka Brokers.
Kafka Streams apps can run anywhere you want so long as they can connect back to the Kafka Brokers.
All components can run inside docker containers or without docker.
You can use whatever microservices or data center resource management tools you want (or none at all) - it is your choice.

Related

Where to run Kafka stream processor?

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?
Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.
Kafka Connect does not run Kafka Streams applications.
ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.
Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.
And none of the above require Confluent Platform.
how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.
If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.

multiple connectors in kafka to different topics are going to same node

I have created two kafka connectors in kafka-connect which use the same Connector class but have different topics they listen to.
When I launch the process on my node, both the connectors end up creating tasks on this process. However, I would like one node to only handle one connector/topic. How can I limit a topic/connector to a single node? I don't see any configuration in connect-distributed.properties where a process could specify which connector to use.
Thanks
Kafka Connect in distributed mode can run as a cluster of one or more workers. Each worker can run multiple tasks. Depending on how many connectors and workers you are running, you will have tasks running on the same worker. This is deliberate - the idea is that Kafka Connect will manage your tasks and workload for you, across the available workers.
If you want to isolate your processing you can run Kafka Connect as separate Connect clusters, either on the same machine (make sure to use different REST ports), or separate machines.
For more info, see architecture and config for steps to configure separate clusters. Note that a cluster can actually be a single worker, but then you don't have any redundancy in the event of failure.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

Kafka Cluster Architecture - Mirror Maker

I have a Kafka cluster composed by 5 brokers and 4 mirror maker to mirror date from 2 different data centers. I know that a kafka broker requires its own dedicated hardware especially because of the high disk I/O, memory usage and CPU intensive application.
I would like to know if could make sense to deploy a mirror maker process on a node that is even a Kafka broker or if I should consider to have the mirror maker on:
a dedicated node
a node which hostes a zookeeper server
HDFS and others cloudera services are deployed on different nodes.
Thanks in advance,
Beniamino
MirrorMaker is just a regular Java Producer/Consumer pair.
If you wrote an application to read from the remote data center, would it make sense to run it on its own hardware? Do you have the resources available to do so? I personally wouldn't run it on a broker or zookeeper.
If you're running in a data center with Docker or Kubernetes available, you can deploy all mirroring instances in their own containers. Or you can run all topics in one JVM using a regex whitelist pattern.
However you choose to deploy, it's recommended to have the consuming process of the MirrorMaker to be in the remote data center pulling data and producing to the local cluster.
Confluent has discussions about this topic
Edit: As of Kafka 2.4, MirrorMaker2 is built on the Kafka Connect framework and is the recommended deployment going forward

Run Kafka and Kafka-connect on different servers?

I want to know if Kafka and Kafka-connect can run on different servers? So a connector would be started on server A and send data from a kafka topic on server B to HDFS or S3 etc. Thanks
Yes, and for Production deployments this is typically recommended for resource reasons. Generally you'd deploy a cluster of Kafka Brokers (3+ for HA), and then a cluster of Kafka Cluster workers (as many as needed for throughput capacity / resilience) -- all on separate nodes.
For more details, see the Confluent Enterprise Reference Architecture.
Yes, you can do it.
I have my set of kafka servers and kafka connect applications are running in different machines and writing data in hdfs. you have to mention list of brokers in bootstrap.servers under worker properties file (config/connect-distributed.properties or config/connect-standalone.properties) instead of localhost:9092