I need to create a slave for BI purposes and I need to modify some tables (e.g., remove all passwords or sensitive data). My database is PostgreSQL. I wonder if I can do it in database layer or I should do it programmatically by writing a code to do the replication.
You could use logical replication and have replica enabled triggers (that fire ony on replication) that modify the data when they are applied:
ALTER TABLE mytab DISABLE TRIGGER mytrig;
ALTER TABLE mytab ENABLE REPLICA TRIGGER mytrig;
You have to make sure that no replication conflicts can arise from these modifications. For example, never modify a key column.
Replication conflicts would stop replication and break your system.
The traditional way to solve this problem is to use an ETL process. That way you can have a different data model on the target database and for example pre-aggregate data so that the data warehouse doesn't grow too big and has a data model optimized for analytical queries.
Related
I'm migrating from a proprietary dbms to PG. In the proprietary dbms, "offlining" and "onlining" data partitions is a very lightweight operation. I'm looking to implement similar functionality with PG by backup and restore of individual table (partitions). Obviously I need to avoid a performance regression. So my question is what the fastest way is of:
Backing up a table (partition), both data and indexes
Taking the table offline (meaning that the data is now gone from the database)
Restoring the table (partition), both data and indexes
Once I have some advice I can design more targeted performance comparisons. Thanks in advance for any pointers.
What is fast and needs to be fast is adding or removing a partition (ALTER TABLE ... ATTACH/DETACH PARTITION).
After you have detached the partition you are in no great hurry to backup/export the data. This can be done comfortably with pg_dump.
Similarly, importing the data for a table that is to become a new partition is normally not time critical.
If you need this to happen faster (for example, you want the old partition to be visible in another database as soon as it is detached in the old one), you could use logical replication to replicate the partition to another PostgreSQL database before you detach it. As soon as replication has caught up, you detach or drop the original partition and attach the copy in the other database.
Currently I am setting up Master/Master Replication with bucardo between 5 Nodes on different locations (should provide location transparency). The database holds ~500 Tables which should be replicated. I grouped them into smaller replication herds of 50 Tables at maximum based on their dependency on each other. All tables have primary keys defined and the sequencers on each node are set up to provide system wide unique identities (based on residue class)
To get an initial database on each node, I made a --data-only custom format pg_dump into a File and restored this on each node via pg_restore. Bucardo sync is setup with the bucardo_latest strategy to resolve conflicts. Now when I start syncing bucardo is deleting all datasets in the origin database first and inserting it again from one of the restored nodes, because all restored datasets have a "later timestamp" (the point in time when I called pg_restore). This ultimately prohibits the inital startup as bucardo needs very much time and also fails, as there are lots of datasets to solve and timeouts often too short.
I also have 'last_modified' timestamps on each table which are managed by UPDATE triggers, but as I understand it, pg_dump inserts data via COPY, and therefore these triggers don't get fired.
Which timestamp does bucardo use to find out who is bucardo_latest?
Do I have to call pg_dump with something like set SESSION_REPLICATION_ROLE = 'replica';?
I just want bucardo to keep track of every new change, not executing pseudo changes because of the restore.
EDIT: pg_restore has definitely fired several triggers at restore time...as said I keep track on user and last modification date in each table, and those values are set to the user and timestamp when the restore was done. I am aware, that I can set SESSION_REPLICATION_ROLE for a plain text format restore via psql. Is this also possible for pg_restore somehow?
The common approach is make the dump/restore process before configure the replication.
So an option will be:
drop the bucardo schema in each database
do a bucardo remove for each object (most of them allow use all, like bucardo remove table all
dump/restore your data
Configure again the replication. Just make sure that when adding the sync, set the option onetimecopy=0. It's the default but I feel safer making it explicit.
Which timestamp does bucardo use to find out who is bucardo_latest?
bucardo handles its own timestamp value. Each table should have a trigger named like bucardo.delta_myschema_mytable that makes and insert in a table named like bucardo.delta_myschema_mytable. This table has a column txntime timestamp with time zone not null default now() and this is the timestamp used.
Do I have to call pg_dump with something like set SESSION_REPLICATION_ROLE = 'replica';?
AFAIK, if bucardo triggers are already set in the tables, the option --disable-triggers of pg_restore should do the trick.
You can also check these articles about working with large databases and the use of session_replication_role
I am wanting to know how an AWS postgres RDS does replication where I rename schemas to "swap" them within the read/write instance of the database.
Does it replicate this action to the read-replicas by sending on the "alter schema" rename commands I gave to my read/write instance? Or after my renames, does it see wholly different sets of data in the schemas and do a whole new copy of each out to the read-replicas?
For example...
In my RDS instance I have a read/write instance of "my_mega_database" which I want to create read-replicas of for my applications to connect to.
Typically, in "my_mega_database" there are two schemas "my_data" and "my_data_old", whereby "my_data" contains data that was delivered last night, and "my_data_old" contains data from the previous night. Each contains many tables and huge amounts of data.
If I were to do the following...
ALTER SCHEMA my_data_old RENAME TO my_data_tmp;
ALTER SCHEMA my_data RENAME TO my_data_old;
ALTER SCHEMA my_data_tmp RENAME TO my_data;
... I have affectively swapped these around.
My expectation is that these actions are replicated via the postgres WAL (ie: it sends the rename commands out to the replicas) and AWS RDS replication won't try and waste time copying huge amounts of data all over the place.
Is this correct?
(Speaking about PostgreSQL here, but RDS is probably similar.)
Renaming a schema (or any other object) is a small update in a catalog table, and no data are moved. Internally PostgreSQL uses only the numeric object ID, which stays the same.
You might wrap the three statements in a transaction to make the whole magic atomic.
The same is true on the standby, it is a trivial (meta)data modification.
The only thing that might be a problem are concurrent sessions holding locks.
I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.
I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table
Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)