Test kafka and flink integration flow - scala

I would like to test Kafka / Flink integration with FlinkKafkaConsumer011 and FlinkKafkaProducer011 for example.
The process will be :
read from kafka topic with Flink
some manipulation with Flink
write into another kafka topic with Flink
With a string example it will be, read string from input topic, convert to uppercase, write into a new topic.
The question is how to test the flow ?
When I say test this is Unit/Integration test.
Thanks!

Flink documentation has a little doc on how you can write unit\integration tests for your transformation operators: link. The doc also has a little section about testing checkpointing and state handling, and about using AbstractStreamOperatorTestHarness.
However, I think you are more interested in end-to-end integration testing (including testing sources and sinks). For that, you can start a Flink mini cluster. Here is a link to an example code that starts a Flink mini cluster: link.
You can also launch a Kafka Broker within a JVM and use it for your testing purposes. Flink's Kafka connector does that for integration tests. Here is a sample code starting the Kafka server: link.
If you are running locally, you can use a simple generator app to generate messages for your source Kafka Topic (there are many available. You can generate messages continuously or based on different configured interval). Here is an example on how you can set Flink's job global parameters when running locally: Kafka010Example.
Another alternative is to create an integration environment (vs. production) to run your end-to-end testing. You will be able to get a real feel of how your program will behave in a production-like environment. It is always advised to have a complete parallel testing environment - including a test source\sink Kafka topics.

Related

How to import and use kafka connect datagen in spark application

We need to perform unit testing for our real time streaming application written in scala-spark.
One option is to use embedded-Kafka for kafka test case simulation.
The other option is to use kafka connect datagen - https://github.com/confluentinc/kafka-connect-datagen
The examples found on various blogs include CLI option.
What i'm looking for is an example to import kafka connect datagen within scala application.
Appreciate help on any good resource on kafka connect datagen OR simulating streaming application within scala application
Kafka Connect is meant to be standalone.
You can use TestContainers project to start a broker and Connect worker, then run datagen connector from there.
Otherwise, for more rigorous testing, write your own KafkaProducer.send methods with data you control

Kafka Streams without Sink

I'm currently planning the architecture for an application that reads from a Kafka topic and after some conversion puts data to RabbitMq.
I'm kind new for Kafka Streams and they look a good choice for my task. But the problem is that Kafka server is hosted at another vendor's place, so I can't even install Cafka Connector to RabbitMq Sink plugin.
Is it possible to write Kafka steam application that doesn't have any Sink points, but just processes input stream? I can just push to RabbitMQ in foreach operations, but I'm not sure will Stream even work without a sink point.
foreach is a Sink action, so to answer your question directly, no.
However, Kafka Streams should be limited to only Kafka Communication.
Kafka Connect can be installed and ran anywhere, if that is what you wanted to use... You can also use other Apache tools like Camel, Spark, NiFi, Flink, etc to write to RabbitMQ after consuming from Kafka, or write any application in a language of your choice. For example, the Spring Integration or Cloud Streams frameworks allows a single contract between many communication channels

View consumer and producer statistics on shell : kafka

I am new to kafka. I have been given a task to send 2kb message with optimized throughput and latency. i really don't know how to benchmark these two metrics and setup my cluster. I do not have any cluster monitoring tool to use but to see the statistics on the terminal when i started the producer and consumer. Can anyone please help me which script i can use to see relevant statistics on the consumer end while the data flow is in progress?
make sure you check command-line tools that comes with Apache Kafka installation (bin/ directory). Those include kafka-producer-perf-test.sh and kafka-consumer-perf-test which can help you to test your cluster performance.
This article includes good examples: https://community.cloudera.com/t5/Community-Articles/Kafka-2-3-Performance-testing/ta-p/284767

Listen to a topic continiously, fetch data, perform some basic cleansing

I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.

Kafka Streams application deployment - embedded vs application management frameworks

I'm pretty new to Kafka Streams. Right now I'm trying to understand the basic principles of this system.
This is a quote from the following article https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
You just use the library in your app, and start as many instances of the app as you like, and Kafka will partition up and balance the work over these instances.
Right now it is not clear to me how it works. Where will the business logic(computation tasks) of Kafka Streams be executed? It will be executed inside of my application, or it is just a client for Kafka cluster and this client will only prepare tasks that will be executed on Kafka cluster? If no, how to properly scale the computation power of my Kafka Streams application? Is it possible to execute inside Yarn or something like this? This way is it a good idea to implement the Kafka Streams application as an embedded component of the core application(let's say web application in my case) or it should be implemented as a separate service and deployed to Yarn/Mesos(if it is possible) separately from the main web application? Also, how to prepare Kafka Streams application to be ready deploy with Yarn/Mesos application management frameworks?
You stream processing code is running inside your applications -- it's not running in the Kafka cluster.
You can deploy anyway you like: Yarn/Mesos/kubernetes/WAR/Chef whatever. The idea is to embed it directly into your application to avoid setting up a processing cluster.
You don't need to prepare Kafka Streams for a deployment method -- it's completely agnostic to how it gets deployed. For Yarn/Mesos you would deploy it as any other Java application within the framework.