I need to build an app that reads from Kafka and writes the data to MongoDB.
Most of the times, the data will be written as is, but there will be cases where some processing on the data will be needed.
I wonder what to do -
Use Kafka Connect MongoDB Sink or use our "old and familiar" approach of building an app with Kafka consumer and write the data to Mongo using MongoDB client (runs on K8s).
What are the adventages\ disadvantages of using Kafka connect? in terms of monitoring, scaling, debugging and pre-processing of the data?
Thanks
Related
I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors
I need to consume changes coming from changes in MongoDB.
As I explored my options, I noticed that there are two common options to do so:
Consume the MongoDB change streams directly
Use the MongoDB Kafka source connector to publish the messages to the Kafka topic and then consume this topic.
I'm dealing with a high throughput so scalability is important.
What is the right option and why? Thanks
I am using Confluent MongoDB Atlas Source Connector to pull data from MongoDB collection to Kafka. I have noticed that the connector is creating multiple topics in the Kafka Cluster. I need the data to be available on one topic so that the consumer application can consume the data from the topic. How can I do this?
Besides, why the Kafka connector is creating so many topics? isn't is difficult for consumer applications to retrieve the data with that approach?
Kafka Connect creates 3 internal topics for the whole cluster for managing its own workload. You should never need/want external consumers to use these
In addition to that, connectors can create their own topics. Debezium for example creates a "database history topic", and again, this shouldn't be read outside of the Connect framework.
Most connectors only need to create one for the source to pull data into, which is what consumers actually should care about
i have a Spring boot Kafka Stream application which process all the incoming events and store it in the State Store which Kafka Streams provides internally and query it using interactive query service. Inside all these Kafka Streams using "RocksDB" , i want to replace this RocksDB with any other db that can configurable like MariaDB or MongoDB. Is there a way to do it ? if not
How can i configure Kafka Stream application to use MongoDB for creating the state stores.
StateStore / KeyValueStore are open interfaces in Kafka Streams which can be used with TopologyBuilder.addStateStore
Yes, you can materialize values to your own store implementation with a database of your choice, but it'll affect processing semantics should there be any database connection issues, particularly with remote databases.
Instead, using a topic more of a log of transactions then following that up with Kafka Connect is the proper approach for external systems
I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.