How to let debezium start reading binlog from the last row - apache-kafka

I'm trying to let Debezium start reading the binlog from the bottom of the file directly.
Could someone help in this matter please ?

Based on the docs it looks like you can use snapshot.mode=schema_only:
If you don’t need the topics to contain a consistent snapshot of the data but only need them to have the changes since the connector was started, you can use the schema_only option, where the connector only snapshots the schemas (not the data).

Related

Kafka connect: How to handle database schema/table changes

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector
Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

Mimic `schema_only` snapshot mode with Debezium PostgreSQL connector

The MySQL connector has a snapshot mode schema_only:
the connector runs a snapshot of the schemas and not the data. This setting is useful when you do not need the topics to contain a consistent snapshot of the data but need them to have only the changes since the connector was started.
The PostgreSQL connector does not have the schema_only option. I am wondering if the following pattern would work to mimic that capability:
Use initial for snapshot mode.
Then, for every included table, add the following config:
"snapshot.select.statement.overrides": "schema.table1,schema.table2...",
"snapshot.select.statement.overrides.schema.table1": "SELECT * FROM [schema].[table1] LIMIT 0"
"snapshot.select.statement.overrides.schema.table2": "SELECT * FROM [schema].[table2] LIMIT 0"
...
According to the docs, snapshot.select.statement.overrides:
Specifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.
It seems as though the above procedure would include the relevant schemas for debezium, while preventing any read events from being emitted.
Are there complications with this technique that I am not taking into account?

Need to Understand the CDC flow of Debezium Postgres Source Connector

Folks,
I am trying to understand the CDC process of Debezium Postgres source connector as i suspect data loss. Here is my scenario, I have a connector name "deb-connector" that pulls data from about 10 tables via the replication slot "deb-1" and plugin "wal2json", everything works fine and all good. Now i wanted to add 2 more tables to the sync up, so i created a new connector named "deb-connector-updt" with additional 2 tables hop on to the same rep slot "deb-1". It took some time initially (as it normally does) for the snapshot to finish and running. Here is my question,
As far as the snapshot process is concerned, my understanding is that its gonna take a new snapshot for the second connector (pls correct me if i am wrong).?
If it does, the initial process would be READ events for all records in the db and all records (2.5K recs) should have flown to kafka no matter what, right? But i don't see them, instead i see only the cdc records (23 recs) flowing.
This can only happen if it doesn't take new snapshot and basically start from where it left off (that makes sense only if the snapshot is tied to the slot). (or) Should i create a new slot whenever i want to add/remove tables list from the config (that doesn't make sense at all)?
I would highly appreciate if the experts can answer this! Thank you!

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.
I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.
If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here
However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.
Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.
You can also take a look at PGSync.
It's similar to Debezium but a lot easier to get up and running.
PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch.
It allows you to keep Postgres as your source-of-truth and expose structured denormalized
documents in Elasticsearch.
You simply define a JSON schema describing the structure of the data in
Elasticsearch.
Here is an example schema: (you can also have nested objects)
e.g
{
"nodes": {
"table": "book",
"columns": [
"isbn",
"title",
"description"
]
}
}
PGsync generates queries for your document on the fly.
No need to write queries like Logstash. It also supports and tracks deletion operations.
It operates both a polling and an event-driven model to capture changes made to date
and notification for changes that occur at a point in time.
The initial sync polls the database for changes since the last time the daemon
was run and thereafter event notification (based on triggers and handled by the pg-notify)
for changes to the database.
It has very little development overhead.
Create a schema as described above
Point pgsync at your Postgres database and Elasticsearch cluster
Start the daemon.
You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.
Have a look at the github repo for more details.
You can install the package from PyPI
Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data
I created a simple github repository, which shows how it works