Where to run Kafka stream processor? - apache-kafka

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?
Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.

Kafka Connect does not run Kafka Streams applications.
ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.
Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.
And none of the above require Confluent Platform.
how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.
If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.

Related

Deploying Kafka Consumers

We are deploying kafka consumers based of Java API in a seperate VM grouped by usage. Probably 3-4 consumers (not in same group)/vm based on throughput of these consumers.
Is it best to use this method or deploy the consumer using dockers? Any pointers would be helpful.
Though you can use Kafka confluent REST proxy and others, my question is about consumer deployment.
A VM has too much overhead for simply running one or few JVM applications. If you have a container platform, then that would be preferred, and would start the app faster than provisioning new VMs per app

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Kafka Cluster Architecture - Mirror Maker

I have a Kafka cluster composed by 5 brokers and 4 mirror maker to mirror date from 2 different data centers. I know that a kafka broker requires its own dedicated hardware especially because of the high disk I/O, memory usage and CPU intensive application.
I would like to know if could make sense to deploy a mirror maker process on a node that is even a Kafka broker or if I should consider to have the mirror maker on:
a dedicated node
a node which hostes a zookeeper server
HDFS and others cloudera services are deployed on different nodes.
Thanks in advance,
Beniamino
MirrorMaker is just a regular Java Producer/Consumer pair.
If you wrote an application to read from the remote data center, would it make sense to run it on its own hardware? Do you have the resources available to do so? I personally wouldn't run it on a broker or zookeeper.
If you're running in a data center with Docker or Kubernetes available, you can deploy all mirroring instances in their own containers. Or you can run all topics in one JVM using a regex whitelist pattern.
However you choose to deploy, it's recommended to have the consuming process of the MirrorMaker to be in the remote data center pulling data and producing to the local cluster.
Confluent has discussions about this topic
Edit: As of Kafka 2.4, MirrorMaker2 is built on the Kafka Connect framework and is the recommended deployment going forward

Best Practices for Kafka Cluster Deployment Configuration?

I'm asking for general best practices here:
If I want a five node cluster, do all five nodes run the Confluent Platform Umbrella Packages that include Zookeeper, Kafka, schema-registry?
Is it ever recommended to run the zookeper cluster on separate servers from the Kafka cluster?
If I want to run the Kafka Connect distributed worker, do I run that on all cluster nodes? Do I ever want to run on separate servers? Is Docker recommended for this or is Docker unnecessary?
With Kafka Streaming apps, should they be run on all cluster nodes? Should they be dockerized? Should they ever run on separate nodes?
Is something like Mesos recommended?
It is a best practice to run Kafka Brokers on dedicated servers (or virtual servers). The same is true of Zookeeper.
All the other components of the Confluent Platform can run colocated on common servers or on separate machines.
You would typically run only one Schema Registry (or two if you want fault tolerance). They can run on any machine that can connect back to the Kafka Brokers.
Kafka Connect distributed workers only need to run on machines that you want to host Kafka Connectors. They just need to be able to connect back to the Kafka Brokers.
Kafka Streams apps can run anywhere you want so long as they can connect back to the Kafka Brokers.
All components can run inside docker containers or without docker.
You can use whatever microservices or data center resource management tools you want (or none at all) - it is your choice.

Scaling Kafka stream application across multiple users

I have a setup where I'm pushing events to kafka and then running a Kafka Streams application on the same cluster. Is it fair to say that the only way to scale the Kafka Streams application is to scale the kafka cluster itself by adding nodes or increasing Partitions?
In that case, how do I ensure that my consumers will not bring down the cluster and ensure that the critical pipelines are always "on". Is there any concept of Topology Priority which can avoid a possible downtime? I want to be able to expose the streams for anyone to build applications on without compromising the core pipelines. If the solution is to setup another kafka cluster, does it make more sense to use Apache storm instead, for all the adhoc queries? (I understand that a lot of consumers could still cause issues with the kafka cluster, but at least the topology processing is isolated now)
It is not recommended to run your Streams application on the same servers as your brokers (even if this is technically possible). Kafka's Streams API offers an application-based approach -- not a cluster-based approach -- because it's a library and not a framework.
It is not required to scale your Kafka cluster to scale your Streams application. In general, the parallelism of a Streams application is limited by the number of partitions of your app's input topics. It is recommended to over-partition your topic (the overhead for this is rather small) to guard against scaling limitations.
Thus, it is even simpler to "offer anyone to build applications" as everyone owns their application. There is no need to submit apps to a cluster. They can be executed anywhere you like (thus, each team can deploy their Streams application the same way by which they deploy any other application they have). Thus, you have many deployment options from a WAR file, over YARN/Mesos, to containers (like Kubernetes). Whatever works best for you.
Even if frameworks like Flink, Storm, or Samza offer cluster management, you can only use such tools that are integrated with those frameworks (for example, Samza requires YARN -- no other options available). Let's say you have already a Mesos setup, you can reuse it for your Kafka Streams applications -- no need for a dedicated "Kafka Streams cluster" (because there is no such thing).
An application’s processor topology is scaled by breaking it into
multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based
on the input stream partitions for the application, with each task
assigned a list of partitions from the input streams (i.e., Kafka
topics).
The assignment of partitions to tasks never changes so that each task
is a fixed unit of parallelism of the application. Tasks can then
instantiate their own processor topology based on the assigned
partitions; they also maintain a buffer for each of its assigned
partitions and process messages one-at-a-time from these record
buffers.
As a result stream tasks can be processed independently and in
parallel without manual intervention.
It is important to understand that Kafka Streams is not a resource
manager, but a library that “runs” anywhere its stream processing
application runs. Multiple instances of the application are executed
either on the same machine, or spread across multiple machines and
tasks can be distributed automatically by the library to those running
application instances.
The assignment of partitions to tasks never changes; if an application
instance fails, all its assigned tasks will be restarted on other
instances and continue to consume from the same stream partitions.
The processing of the stream happens in the machines where the application is running.
I recommend you to have a look to this guide, it can help you to better understand the way Kafka Streams work.