PostgreSQL logical replication - ignore pre-existing data - postgresql

Imagine dropping a subscription and recreating it from scratch. Is it possible to ignore existing data during the first synchronization?
Creating a subscription with (copy_data=false) is not an option because I do want to copy data, I just don't want to copy already existing data.
Example: There is a users table and a corresponding publication on the master. This table has 1 million rows and every minute a new row is added. Then we drop the subscription for a day.
If we recreate the subscription with (copy_data=true), replication will not start due to a conflict with already existing data. If we specify (copy_data=false), 1440 new rows will be missing. How can we synchronize the publisher and the subscriber properly?

You cannot do that, because PostgreSQL has no way of telling when the data were added.
You'd have to reconcile the tables by hand (or INSERT ... ON CONFLICT DO NOTHING).

Unfortunately PostgreSQL does not support nice skip options for conflicts yet, but I believe it will be enhanced in the feature.
Based on #Laurenz Albe answer which recommends the use of the statement:
INSERT ... ON CONFLICT DO NOTHING.
I believe that it would be better to use the following command which also will take care any possible updates on your data before you start the subscription again:
INSERT ... ON CONFLICT UPDATE SET...
Finally I have to say that both are dirty solutions as during the execution of the above statement and the creation of the subscription, new lines may have been arrived which will result in losing them until you perform again the custom sync.
I have seen some other suggested solutions using the LSN number from the Postgresql log file...
For me maybe is elegant and safe to delete all the data from the destination table and create the replication again!

Related

How to actually delete files in Iceberg

I know that in Apache Iceberg I can set limits on number and age of snapshots, and that "deleting" data from the table does not result in underlying data removal, it simply masks or deletes tracking information.
I would like to actually delete the underlying files on delete, however. I know this will make time-travel inconsistent, but it is still a business requirement.
https://iceberg.apache.org/docs/latest/configuration/
As best as I can tell, I'll have to track and manage the physical life-cycle every file independently. Am I missing something?
If you don't care about table history (or time travel) you can simply call the expire_snapshots procedure after each delete.
What you get is a common question for many iceberg users.
We often need an asynchronous task to delete and expire snapshots\data.
If you use spark, you can use https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots, as shay saied.
you can also do this using the java api provided by iceberg https://iceberg.apache.org/docs/latest/api/.
Starting a task for each table is difficult to manage. Tables often have different TTL. In this case, You can add custom configurations to a table. Manually scan all iceberg tables, then determines whether to delete expired snapshots and data based on these configurations.
If you are using Iceberg with Hive (4.0.0-alpha2 + version), you can try expire_snapshot command on beeline.
Like
ALTER TABLE test_table EXECUTE expire_snapshots('2021-12-09 05:39:18.689000000');
Can read:
https://docs.cloudera.com/cdw-runtime/cloud/iceberg-how-to/topics/iceberg-expiring-snapshots.html
Hive Jira adding support:
https://issues.apache.org/jira/browse/HIVE-26354

Found ddl change for a table in CDC management console

our target db is DB2 and source is ORACLE, we found ddl changes in CDC management console and i need to fix the instance in to proper running condition.
Paul Vernon answer assumes that what you are looking for is how to replicate DDL changes. I will assume that you don't want to replicate DDL changes, but just restart the subscription after minor layout changes (for example, after a column size has been increased or a column you are not going to replicate, changes).
If that is the case, right-click the specific table map on your subscription, and update table definition. I am not sure but I think after that, you have to refresh the entire subscription. If the table is very large, you will want to avoid refreshing them all, but that's another question.
Off course, if in the table change, a column has been added and you want to deal with it, you can edit column map and make the specific assignment you want to that column.
I hope this helps.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

Postgresql replication without DELETE statement

We have a requirement that says we should have a copy of all the items that were in our system at one point. The most simple way to explain it would be replication but ignoring the delete statement (INSERT and UPDATE are ok)
Is this possible ? or maybe the better question would be what is the best approach to tackle this kind of problem?
Make a copy/replica of current database and use triggers via dblink from current database to the replica. Use after insert and after update trigger to insert and update data in replica.
So whenever a row insertion/updation take place in current database it will directly reflect to replica.
I'm not sure that I understand the question completely, but I'll try to help:
First (opposite to #Sunit) - I suggest avoiding triggers. Triggers are introducing additional overhead and impacting performance.
The solution I would use (and I'm actually using in few of my projects with similar demands) - don't use DELETE at all. Instead you can add bit (boolean) column called "Deleted", set its default value to 0 (false), and instead of deleting the row you update this field to 1 (true). You'll also need to change your other queries (SELECT) to include something like "WHERE Deleted = 0".
Another option is to continue using DELETE as usual, and to allow deleting records from both primary and replica, but configure WAL archiving, and store WAL archives in some shared directory. This will allow you to moment-in-time recovery, meaning that you'll be able to restore another PostgreSQL instance to state of your cluster in any moment in time (i.e. before the deletion). This way you'll have a trace of deleted records, but pretty complicated procedure to reach the records. Depending on how often deleted records will be checked in the future (maybe they are not checked at all, but simply kept for just-in-case tracking) this approach my also help.

SQL Transactions - allow read original data before commit (snapshot?)

I am facing an issue, possibly quite easy to solve, I am just new to advanced transaction settings.
Every 30 minutes I am running an INSERT query that is getting latest data from a linked server to my client's server, to a table we can call ImportTable. For this I have a simple job that looks like this:
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer
COMMIT
The thing is, each time the job runs the ImportTable is locked for the query run time (2-5 minutes) and nobody can read the records. I wish the table to be read-accessible all the time, with as little downtime as possible.
Now, I read that it is possible to allow SNAPSHOT ISOLATION in the database settings that could probably solve my problem (set to FALSE at the moment), but I have never played with different transaction isolation types and as this is not my DB but my client's, I'd rather not alter any database settings if I am not sure if it can break something.
I know I could have an intermediary table that the records are inserted to and then inserted to the final table and that is certainly a possible solution, I was just hoping for something more sophisticated and learning something new in the process.
PS: My client's server & database is fairly new and barely used, so I expect very little impact if I change some settings, but still, I cannot just randomly change various settings for learning purposes.
Many thanks!
Inserts wont normally block the table ,unless it is escalated to table level.In this case,you are deleting table first and inserting data again,why not insert only updated data?.for the query you are using transaction level (rsci)snapshot isolation will help you,but you will have an added impact of row version which means sql will store row versions of rows that changed in tempdb.
please see MCM isolation videos of Kimberely tripp for indepth understanding ,also dont forget to test in stage enviornment.
You are making this harder than it needs to be
The problem is the 2-5 minutes that you let be part of a transaction
It is only a few thousand rows - that part takes like a few milliseconds
If you need ImportTable to be available during those few milliseconds then put it in a SnapShot
Delete ImportTableStaging;
INSERT INTO ImportTableStaging(columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer;
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns) with (tablock)
SELECT (columns)
FROM ImportTableStaging
COMMIT
If you are worried about concurrent update to ImportTableStaging then use a #temp