Is there a way to configure Debezium to store in Kafka not all the changes from database but only a certain ones? - mongodb

I have mongodb and I need to send the changes from a certain query to kafka broker. I heard that debezium tracks changes from database and stores it to kafka. But is there a way to configure that process to store not all the changes that happen in database but only a certain ones?

You can perform some filtering using their single message transform (SMT) Kafka Connect plugin. You can check its documentation to see if it has the features that you need: https://debezium.io/documentation/reference/stable/transformations/filtering.html

Depending on the source technology you could.
When using PostgreSQL as a source, for example, you can define which operations to include in the PG publication that is read by Debezium
More info at the Debezium docs

Related

Direct Kafka Topic to Database table

Is there a way to automatically tell Kafka to send all events of a specific topic to a specific table of a database?
In order to avoid creating a new consumer that needs to read from that topic and perform the copy explicitly.
You have two options here:
Kafka Connect - this is the standard way to connect your Kafka to a database. There are a lot of connectors. In order to choose one:
The best bet is to use the specific one for your database that is maintained by confluent.
If you don't have a specific one, the second best option is to use the JDBC connector.
Direct ingestion from the database if your database supports it (for instance Clickhouse, and MemSQL are able to load data coming from a Kafka topic). The difference between this and Kafka connects is this way it is fully supported and tested by the db vendor and you need to maintain less pieces of infrastructure.
Which one is better? It depends on:
your data volume
how much you can (and need !) to paralelize the load
and how much you can tolerate downtime or latencies.
Direct ingestion from DB is usually from one node (consumer) to Kafka.
It is good for mid-low volume data traffic. If it fails (or throttles), you might have latency issues.
Kafka connect allows you to insert data in parallel into the db using several workers. If one of the worker fails, the load is redistributed among the others. If you have a lot of data, this probably the best way to load it into the db, but you'll need to take care of the kafka connect infrastructure unless you're using a managed cloud offering.

CDC (Changed Data Capture) with Couchbase

I have a requirement to capture changes from a stream of data. Below given is my solution.
Data Flows into Kafka -> Consumer Picks up data and inserts/updates (trimmed data) to DynamoDB(We have configured DynamoDB Streams). After every insert/update a stream is generated with changed data, which is then interpreted and processed by a Lambda.
Now my question is if have to replace DynamoDB with Couchbase, will Couchbase provide a CDC out of the box? I am pretty much new to Couchbase and I tried searching for the CDC feature but no direct documentation.
Any pointers would be very helpful! Thanks!
Couchbase has an officially supported Kafka Connector (documentation here).
I'm not familiar with the "CDC" term, but this Couchbase Kafka connector can act as both a sink and a source.. It's not "out of the box" per se, it's a separate connector.
It seems Change Data Capture (CDC) isn't supported in Couchbase but there are features to notify when documents change. For example, the Source Kafka Connector use that and send documents when changing, including metadata when configured with DefaultSchemaSourceHandler, which should be close enough to CDC:
https://docs.couchbase.com/kafka-connector/current/quickstart.html#defaultschemasourcehandler

Is it possible to use Kafka Connect to mirror an RDBMS table to a Kafka Stream?

I know it's possible to push updates from a database to a Kafka stream using Kafka Connect. My question is, can I create a consumer to write changes from that same stream back into the table without creating an infinite loop?
I'm assuming if I create a consumer that writes updates into the database table, it would trigger Connect to push that update to the stream, etc. Is there a way around this so I can mirror a database table to a stream?
You can stream from a Kafka topic to a database using the JDBC Sink connector for Kafka Connect.
You'd need to code in your business logic for avoiding an infinite replication loop into either the connectors or your consumer. For example:
JDBC Source connector uses a WHERE clause to only pull records with a flag set to indicate they are the original record
Custom Single Message Transform in the source connector to drop records with a flag set to indicate they are not the original record
Stream application (e.g. KSQL / Kafka Streams) processes the inbound stream of all database changes to filter out only those with a flag set to indicate they are the original record
Inefficient because then you're still streaming everything from the database
Yes. It is possible to configure synchronisation/replication.

Querying MySQL tables using Apache Kafka

I am trying to use Kafka Streams for achieving a use-case.
I have two tables in MySQL - User and Account. And I am getting events from MySQL into Kafka using a Kafka MySQL connector.
I need to get all user-IDs within an account from within Kafka itself.
So I was planning to use KStream on MySQL output topic, process it to form an output and publish it to a topic with Key as the account-id and value as the userIds separated by comma (,).
Then I can use interactive query to get all userIds using account id, with the get() method of ReadOnlyKeyValueStore class.
Is this the right way to do this? Is there a better way?
Can KSQL be used here?
You can use Kafka Connect to stream data in from MySQL, e.g. using Debezium. From here you can use KStreams, or KSQL, to transform the data, including re-keying which I think is what you're looking to do here, as well as join it to other streams.
If you ingest the data from MySQL into a topic with log compaction set then you are guaranteed to always have the latest value for every key in the topic.
I would take a look at striim if you want built in CDC and interactive continuous SQL queries on the streaming data in one UI. More info here:
http://www.striim.com/blog/2017/08/making-apache-kafka-processing-preparation-kafka/

Send different instances of Kafka Connect to different Kafka topic

I have tried to send the information of a Kafka Connnect instance in distributed mode with one worker to a specific topic, I have the topic name in the "archive.properties" file that use when I launch the instance.
But, when I send five or more instances, I see the messages merged in all topics.
The "solution" I thought was make a map to store the relation between ID and topic but it doesn't worked
Is there an specific Kafka connect implementation to do this?
Thanks.
First, details on how you are running connect and which connector you are using will be very helpful.
Some connectors support sending data to more than one topic. For example, confluent-jdbc-sink will send each table to a separate topic. So this could be a limitation of the connector you are using.
Also depending on the connector and your use case - whether you need to run more than one connector. With the JDBC connector, you need one connector per database and it will handle all the tables. If you run two connectors on the same database and same tables, you'll get duplicates.
In short hopefully your connector has helpful documentation.
In the next release of Apache Kafka we are adding Single Message Transformations. One of the transformations can modify the target topic based on data in the event - so you can use the transformation to perform event routing.