MongoDB Atlas Source Connector Single Topic - mongodb

I am using Confluent MongoDB Atlas Source Connector to pull data from MongoDB collection to Kafka. I have noticed that the connector is creating multiple topics in the Kafka Cluster. I need the data to be available on one topic so that the consumer application can consume the data from the topic. How can I do this?
Besides, why the Kafka connector is creating so many topics? isn't is difficult for consumer applications to retrieve the data with that approach?

Kafka Connect creates 3 internal topics for the whole cluster for managing its own workload. You should never need/want external consumers to use these
In addition to that, connectors can create their own topics. Debezium for example creates a "database history topic", and again, this shouldn't be read outside of the Connect framework.
Most connectors only need to create one for the source to pull data into, which is what consumers actually should care about

Related

Publish rdbms table record in kafka topic

I have a workflow where upstream is generating a data and transformer module applies some business logic on it and store the result in table. Now requirement is I need to publish that result into Kafka topic
You can use Debezium to pull CDC logs from a few supported databases into a Kafka topic.
Otherwise, Kafka Connect offers many plugins for different data sources, and Confluent Hub is a sort of index where you can search for those
Otherwise, simply make your data generator into a Kafka producer instead of just a database client

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

How to group kafka topics in different dbs and collections with mongodb sink connector depending on kafka topic name or message key/value

As the title states, I'm using debezium Postgres source connector and I would like MongoDB sink connector to group kafka topics in different collection and databases (also different dbs to isolate unrelated data) according to their names. While inquiring I came across with topic.regex connector property at mongo docs. Unfortunately, this only creates a collection in mongo for each kafka topic successfully matched against the specified regex, and I'm planning on using the same mongodb server to harbor many dbs captured from multiple debezium source connectors. Can you help me?
Note: I read this mongo sink setting FieldPathNamespaceMapper, but I'm not sure if it would fit my needs nor how to correctly configure it.
topics.regex is a general sink connector peppery, not unique to Mongo.
If I understand the problem, correctly, obviously only collections will get created in the configured database for Kafka topics that actually exist (match the pattern) and get consumed by the sink.
If you want collections that don't match a pattern, then you'll still need to consume them, but need to explicitly rename the topics via RegexRouter transform before records are written to Mongo
In kafka workers are simple containers that can run multiple connectors. For each connector workers generate tasks according to internal rules and your configurations. So, if you take a look at mongodb sink connector configurations:
https://www.mongodb.com/docs/kafka-connector/current/sink-connector/configuration-properties/all-properties/
You can create different connectors with the same connection.uri, database and collection, or different values. So you might use the topics.regex or topics parameters to group the topics for a single connector with its own connection.uri, database and collection, and run multiple connectors at the same time. Remember that if tasks.max > 1 in your connector, messages might be read out of order. If this is not a problem, set a value of tasks.max next to the number of mongodb shards. The worker will adjust the number of tasks automatically.

Why should I use Kafka connect mongo DB source connector over change streams only?

I need to consume changes coming from changes in MongoDB.
As I explored my options, I noticed that there are two common options to do so:
Consume the MongoDB change streams directly
Use the MongoDB Kafka source connector to publish the messages to the Kafka topic and then consume this topic.
I'm dealing with a high throughput so scalability is important.
What is the right option and why? Thanks

How to enable Kafka sink connector to insert data from topics to tables as and when sink is up

I have developed kafka-sink-connector (using confluent-oss-3.2.0-2.11, connect framework) for my data-store (Amppol ADS), which stores data from kafka topics to corresponding tables in my store.
Every thing is working as expected as long as kafka servers and ADS servers are up and running.
Need a help/suggestions about a specific use-case where events are getting ingested in kafka topics and underneath sink component (ADS) is down.
Expectation here is Whenever a sink servers comes up, records that were ingested earlier in kafka topics should be inserted into the tables;
Kindly advise how to handle such a case.
Is there any support available in connect framework for this..? or atleast some references will be a great help.
SinkConnector offsets are maintained in the _consumer_offsets topic on Kafka against your connector name and when SinkConnector restarts it will pick messages from Kafka server from the previous offset it had stored on the _consumer_offsets topic.
So you don't have to worry anything about managing offsets. Its all done by the workers in the Connect framework. In your scenario you go and just restart your sink connector. If the messages are pushed to Kafka by your source connector and are available in the Kafka, sink connector can be started/restarted at any time.