CDC (Changed Data Capture) with Couchbase - apache-kafka

I have a requirement to capture changes from a stream of data. Below given is my solution.
Data Flows into Kafka -> Consumer Picks up data and inserts/updates (trimmed data) to DynamoDB(We have configured DynamoDB Streams). After every insert/update a stream is generated with changed data, which is then interpreted and processed by a Lambda.
Now my question is if have to replace DynamoDB with Couchbase, will Couchbase provide a CDC out of the box? I am pretty much new to Couchbase and I tried searching for the CDC feature but no direct documentation.
Any pointers would be very helpful! Thanks!

Couchbase has an officially supported Kafka Connector (documentation here).
I'm not familiar with the "CDC" term, but this Couchbase Kafka connector can act as both a sink and a source.. It's not "out of the box" per se, it's a separate connector.

It seems Change Data Capture (CDC) isn't supported in Couchbase but there are features to notify when documents change. For example, the Source Kafka Connector use that and send documents when changing, including metadata when configured with DefaultSchemaSourceHandler, which should be close enough to CDC:
https://docs.couchbase.com/kafka-connector/current/quickstart.html#defaultschemasourcehandler

Related

Is there a way to configure Debezium to store in Kafka not all the changes from database but only a certain ones?

I have mongodb and I need to send the changes from a certain query to kafka broker. I heard that debezium tracks changes from database and stores it to kafka. But is there a way to configure that process to store not all the changes that happen in database but only a certain ones?
You can perform some filtering using their single message transform (SMT) Kafka Connect plugin. You can check its documentation to see if it has the features that you need: https://debezium.io/documentation/reference/stable/transformations/filtering.html
Depending on the source technology you could.
When using PostgreSQL as a source, for example, you can define which operations to include in the PG publication that is read by Debezium
More info at the Debezium docs

Why Kafka Connect Works?

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

How to do the transformations in Kafka (PostgreSQL-> Red shift )

I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.

How does Kafka know when source data has changed?

I can't find a definitive answer, so I figured I would ask the experts. How does Kafka observe and detect what data in a given source has changed? For instance, in a Relational Database?
Polling comes to mind, but wouldn't it then have to maintain a data set of all primary keys per available table, and then run checks if new primary keys are available? Where is this stored, since memory is probably not durable enough?
This is a very general question so you can imagine the answer is "it depends". Kafka isn't tracking this per se. It's done by whatever Kafka client implementation you have. For example, if you implement a Kafka Connect source connector then you can store offsets to checkpoint what data has been read in Kafka itself. If you are just writing a producer it's a different story. A pretty general example can be found in the Confluent JDBC source connector. It has multiple modes for loading that can give you an idea of the flexibility https://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html#features

flume or kafka's equivalent to mongodb

In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.