How to import and use kafka connect datagen in spark application - scala

We need to perform unit testing for our real time streaming application written in scala-spark.
One option is to use embedded-Kafka for kafka test case simulation.
The other option is to use kafka connect datagen - https://github.com/confluentinc/kafka-connect-datagen
The examples found on various blogs include CLI option.
What i'm looking for is an example to import kafka connect datagen within scala application.
Appreciate help on any good resource on kafka connect datagen OR simulating streaming application within scala application

Kafka Connect is meant to be standalone.
You can use TestContainers project to start a broker and Connect worker, then run datagen connector from there.
Otherwise, for more rigorous testing, write your own KafkaProducer.send methods with data you control

Related

How to use kafka connect with JDBC sink and source using python

I want to live stream from one system to another system .
I am using kafka-python and am able to live stream locally.
Figures out that connectors will handle multiple devices. Can someone suggest me a way to use connectors to implement it in python?
Kafka Connect is a Java Framework, not Python.
Kafka Connect runs a REST API which you can use urllib3 or requests to interact with it, not kafka-python
https://kafka.apache.org/documentation/#connect
Once you create a connector, you are welcome to use kafka-python to produce data, which the JDBC sink would consume, for example, or you can use pandas for example to write to a database, which the JDBC source (or Debezium) would consume

Listen to a topic continiously, fetch data, perform some basic cleansing

I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.

Why is Kafka connect light weight?

I have been working with kafka connect, Spark streaming , Nifi with kafka for streaming data.
I am aware that unlike other technologies kafka connect is not a separate application and it is a tool of kafka.
In case of distributed mode all technologies implement the parallelism by the underlying tasks or threads. What makes kafka connect to be efficient when dealing with kafka and why is it called light weight?
It's efficient and lightweight because it uses the built-in Kafka protocols and doesn't require an external system such as YARN. While it is arguably better/easier to deploy Connect in Mesos/Kubernetes/Docker, it is not required
The connect API is also maintained by the core Kafka developers rather than people that just want a simple integration into another tool. For example, last time I checked, NiFi cannot access the Kafka message timestamps. And dealing with the Avro Schema Registry seems to be an after thought in the other tools as compared to using Confluent Certified Connectors

Test kafka and flink integration flow

I would like to test Kafka / Flink integration with FlinkKafkaConsumer011 and FlinkKafkaProducer011 for example.
The process will be :
read from kafka topic with Flink
some manipulation with Flink
write into another kafka topic with Flink
With a string example it will be, read string from input topic, convert to uppercase, write into a new topic.
The question is how to test the flow ?
When I say test this is Unit/Integration test.
Thanks!
Flink documentation has a little doc on how you can write unit\integration tests for your transformation operators: link. The doc also has a little section about testing checkpointing and state handling, and about using AbstractStreamOperatorTestHarness.
However, I think you are more interested in end-to-end integration testing (including testing sources and sinks). For that, you can start a Flink mini cluster. Here is a link to an example code that starts a Flink mini cluster: link.
You can also launch a Kafka Broker within a JVM and use it for your testing purposes. Flink's Kafka connector does that for integration tests. Here is a sample code starting the Kafka server: link.
If you are running locally, you can use a simple generator app to generate messages for your source Kafka Topic (there are many available. You can generate messages continuously or based on different configured interval). Here is an example on how you can set Flink's job global parameters when running locally: Kafka010Example.
Another alternative is to create an integration environment (vs. production) to run your end-to-end testing. You will be able to get a real feel of how your program will behave in a production-like environment. It is always advised to have a complete parallel testing environment - including a test source\sink Kafka topics.

How to start kafka server programmatically

we are trying to start the kafka server (zookeeper and broker) programmatically in our application. Any api / library available for the same?
Yes, you can use embedded Kafka, which will run zookeeper and kafka server for you. It is generally used for testing kafka producer/consumer where there is no need to explicitly run them.
For more detail refer
To run it, we write EmbeddedKafka.start() at the start and EmbeddedKafka.stop() at the end.