Kafka Streams application deployment - embedded vs application management frameworks - apache-kafka

I'm pretty new to Kafka Streams. Right now I'm trying to understand the basic principles of this system.
This is a quote from the following article https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
You just use the library in your app, and start as many instances of the app as you like, and Kafka will partition up and balance the work over these instances.
Right now it is not clear to me how it works. Where will the business logic(computation tasks) of Kafka Streams be executed? It will be executed inside of my application, or it is just a client for Kafka cluster and this client will only prepare tasks that will be executed on Kafka cluster? If no, how to properly scale the computation power of my Kafka Streams application? Is it possible to execute inside Yarn or something like this? This way is it a good idea to implement the Kafka Streams application as an embedded component of the core application(let's say web application in my case) or it should be implemented as a separate service and deployed to Yarn/Mesos(if it is possible) separately from the main web application? Also, how to prepare Kafka Streams application to be ready deploy with Yarn/Mesos application management frameworks?

You stream processing code is running inside your applications -- it's not running in the Kafka cluster.
You can deploy anyway you like: Yarn/Mesos/kubernetes/WAR/Chef whatever. The idea is to embed it directly into your application to avoid setting up a processing cluster.
You don't need to prepare Kafka Streams for a deployment method -- it's completely agnostic to how it gets deployed. For Yarn/Mesos you would deploy it as any other Java application within the framework.

Related

Where to run Kafka stream processor?

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?
Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.
Kafka Connect does not run Kafka Streams applications.
ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.
Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.
And none of the above require Confluent Platform.
how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.
If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.

Kafka Streams without Sink

I'm currently planning the architecture for an application that reads from a Kafka topic and after some conversion puts data to RabbitMq.
I'm kind new for Kafka Streams and they look a good choice for my task. But the problem is that Kafka server is hosted at another vendor's place, so I can't even install Cafka Connector to RabbitMq Sink plugin.
Is it possible to write Kafka steam application that doesn't have any Sink points, but just processes input stream? I can just push to RabbitMQ in foreach operations, but I'm not sure will Stream even work without a sink point.
foreach is a Sink action, so to answer your question directly, no.
However, Kafka Streams should be limited to only Kafka Communication.
Kafka Connect can be installed and ran anywhere, if that is what you wanted to use... You can also use other Apache tools like Camel, Spark, NiFi, Flink, etc to write to RabbitMQ after consuming from Kafka, or write any application in a language of your choice. For example, the Spring Integration or Cloud Streams frameworks allows a single contract between many communication channels

Listen to a topic continiously, fetch data, perform some basic cleansing

I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.

Automatically deploy Kafka Stream app

I'm just starting with Kafka and Kafka Streaming Applications. I wrote a Kafka Stream App that consumes from one topic, process this messages, and send them to another topic.
To the best of my knowledge, the only ways that I have found to run this Kafka Stream App coded are:
Run Java Class from IDE.
Generate *.jar file and run it from prompt.
I would like to know if there is any way to automatically run Kafka Streaming Applications on Kafka server startup. For example: copy the *.jar file to some folder of my Kafka installation, and automatically run this stream app when I start my Kafka server.
Your Kafka broker (server) and your Kafka Streams application are independent from one another. You can start them however you manage processes on your server, whether it's something like initd or systemd, or container-based solutions like Docker or Kubernetes.
In my experience, if your streams application starts well before your broker or ZooKeeper, then it may time out waiting for them to come online. So you may need to configure the streams process to restart in such a situation.

How to start kafka server programmatically

we are trying to start the kafka server (zookeeper and broker) programmatically in our application. Any api / library available for the same?
Yes, you can use embedded Kafka, which will run zookeeper and kafka server for you. It is generally used for testing kafka producer/consumer where there is no need to explicitly run them.
For more detail refer
To run it, we write EmbeddedKafka.start() at the start and EmbeddedKafka.stop() at the end.