Exclude Postgresql batch delete logs in confluent debezium connector - postgresql

We have a requirement for Debezium connector to exclude Postgresql delete logs generated part of batch delete query on a table. This batch runs every month and generates a lot of delete logs which are not needed at down streams.
We have tried below,
add filter conditions to exclude delete logs on the table, but this is excluding all the other delete logs on the table as well along with batch delete logs -> not a suitable option in prod.
Use txId in Debezium filter to skip particular transaction assigned for batch delete . but this requires Debezium connector config changes every time txId changes -> not a suitable option in prod.
Debezium version - 1.2.1
Source - Postgresql 10.18 database

Related

WAL Disk Space is grown very large during the Alter Column Statement CDC Debezium Connector

We are running a PostGres CDC Debezium Connector and we were running an alter statement in multiple Tables which caused the WAL Disk Space to grow huge.This made the connector not able to communicate to the DB and we need to manually delete the replication slot lag to recover and reprocessing the lost messages was tedious. We already have heartbeat.interval.ms=6000 enabled in our Connector Config.I would like to know if adding "heartbeat.action.query" will help to keep the replicationSlotLag in such cases?

Postgres Publication record is showing up in standby too

Is there anyway we can avoid creating publication record from master to slave? I am trying to upgrade postgres version in standby by using pg_wal_replay_pause() still i could see records are comping up from master.

o110.pyWriteDynamicFrame. null

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

CDC change data capture start time - Postgres replication

I'm using AWS DMS for Postgres-postgres migration. For on-going replication for other engines there is a parameter CDC start time where we can specify the start time of picking up up changes for replication but unfortunately postgres does not support that parameter.
By default, my assumption is when you create the CDC task it utilizes the current start time for the CDC. But since postgres does not have the ability to filter the logs for the start time, I assume it starts from the beginning of the WAL. Is that right? My goal is instead of using DMS FULL LOAD I want to use only CDC feature but after the pg_dump is restored on the target how would I make sure no records are missed by CDC?
Thank you!
DMS Ongoing replication task when it starts, will create a replication slot. A replication slot can not be created with any open transactions. The LSN captured by the SLOT will be the first LSN read by DMS.
Now Postgres as source also support custom CDC start position: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html

Copy data between two servers that are not connected

I have got 2 postgresql servers on 2 different computers that are not connected.
Each server holds a database with the same schema.
I would like one of the server to be the master server: this server should store all data that are inserted on both databases.
For that I would like to import regularly (on a daily basis for example) data from one database to the second database.
It implies that I should be able to :
"dump" into file(s) all data that have been stored in the first database since a given date.
import the exported data to the second database
I haven't seen any time/date option in pg_dump/pg_restore commands.
So how could I do that ?
NB: data are inserted in the database and never updated.
I haven't seen any time/date option in pg_dump/pg_restore commands.
There isn't any, and you can't do it that way. You'd have to dump and restore the whole database.
Alternatives are:
Use WAL based replication. Have the master write WAL to an archive location, using an archive_command. When you want to sync, copy all the new WAL from the master to the replica, which must be made as a pg_basebackup of the master and must have a suitable recovery.conf. The replica will replay the master's WAL to get all the master's recent changes.
Use a custom trigger-based system to record all changes to log tables. COPY those log tables to external files, then copy them to the replica. Use a custom script to apply the log table change records to the main tables.
Add a timestamp column to all your tables. Keep a record of when you last synced changes. Do a \COPY (SELECT * FROM sometable WHERE insert_timestamp > 'last_sync_timestamp') TO 'somefile' for each table, probably scripted. Copy the files to the secondary server. There, automate the process of doing a \copy sometable FROM 'somefile' to load the changes from the export files.
In your situation I'd probably do the WAL-based replication. It does mean that the secondary database must be absolutely read-only though.