I am trying to get a stream of updates for certain tables from my PostgreSQL database. The regular way of getting all updates looks like this:
You create a logical replication slot
pg_create_logical_replication_slot('my_slot', 'wal2json');
And either connect to it using pg_recvlogical or making special SQL queries. This allows you to get all the actions from the database in json (if you used wal2json plugin or similar) and then do whatever you want with that data.
But in PostgreSQL 10 we have Publication/Subscription mechanism which allows us to replicate selected tables only. This is very handy because a lot of useless data is not being sent. The process looks like this:
First, you create a publication
CREATE PUBLICATION foo FOR TABLE herp, derp;
Then you subscribe to that publication from another database
CREATE SUBSCRIPTION mysub CONNECTION <connection stuff> PUBLICATION foo;
This creates a replication slot on a master database under the hood and starts listening to updates and commit them to the same tables on a second database. This is fine if your job was to replicate some tables, but want to get a raw stream for my stuff.
As I mentioned, the CREATE SUBSCRIPTION query is creating a replication slot on the master database under the hood, but how can I create one manually without the subscription and a second database? Here the docs say:
To make this work, create the replication slot separately (using the function pg_create_logical_replication_slot with the plugin name pgoutput)
According to the docs, this is possible, but pg_create_logical_replication_slot only creates a regular replication slot. Is the pgoutput plugin responsible for all the magic? If yes, then it becomes impossible to use other plugins like wal2json with publications.
What am I missing here?
I have limited experience with logical replication and logical decoding in Postgres, so please correct me if below is wrong. That being said, here is what I have found:
Publication support is provided by pgoutput plugin. You use it via plugin-specific options. It may be that other plugins have possibility to add the support, but I do not know whether the logical decoding plugin interface exposes sufficient details. I tested it against wal2json plugin version 9e962ba and it doesn't recognize this option.
Replication slots are created independently from publications. Publications to be used as a filter are specified when fetching changes stream. It is possible to peek changes for one publication, then peek changes for another publication and observe different set of changes despite using the same replication slot (I did not find it documented and I was testing on Aurora with Postgres compatibility, so behavior could potentially vary).
Plugin output seems to include all entries for begin and commit, even if transaction did not touch any of tables included in publication of interest. It does not however include changes to other tables than included in the publication.
Here is an example how to use it in Postgres 10+:
-- Create publication
CREATE PUBLICATION cdc;
-- Create slot
SELECT pg_create_logical_replication_slot('test_slot_v1', 'pgoutput');
-- Create example table
CREATE TABLE replication_test_v1
(
id integer NOT NULL PRIMARY KEY,
name text
);
-- Add table to publication
ALTER PUBLICATION cdc ADD TABLE replication_test_v1;
-- Insert example data
INSERT INTO replication_test_v1(id, name) VALUES
(1, 'Number 1')
;
-- Peak changes (does not consume changes)
SELECT pg_logical_slot_peek_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
-- Get changes (consumes changes)
SELECT pg_logical_slot_get_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
To stream changes out of Postgres to other systems, you can consider using Debezium project. It is an open source distributed platform for change data capture, which among others provides a PostgreSQL connector. In version 0.10 they added support for pgoutput plugin. Even if your use case is very different from what the project offers, you can look at their code to see how they interact with replication API.
After you have created the logical replication slot and the publication, you can create a subscription this way:
CREATE SUBSCRIPTION mysub
CONNECTION <conn stuff>
PUBLICATION foo
WITH (slot_name=my_slot, create_slot=false);
Not sure if this answers your question.
Related
I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.
Imagine dropping a subscription and recreating it from scratch. Is it possible to ignore existing data during the first synchronization?
Creating a subscription with (copy_data=false) is not an option because I do want to copy data, I just don't want to copy already existing data.
Example: There is a users table and a corresponding publication on the master. This table has 1 million rows and every minute a new row is added. Then we drop the subscription for a day.
If we recreate the subscription with (copy_data=true), replication will not start due to a conflict with already existing data. If we specify (copy_data=false), 1440 new rows will be missing. How can we synchronize the publisher and the subscriber properly?
You cannot do that, because PostgreSQL has no way of telling when the data were added.
You'd have to reconcile the tables by hand (or INSERT ... ON CONFLICT DO NOTHING).
Unfortunately PostgreSQL does not support nice skip options for conflicts yet, but I believe it will be enhanced in the feature.
Based on #Laurenz Albe answer which recommends the use of the statement:
INSERT ... ON CONFLICT DO NOTHING.
I believe that it would be better to use the following command which also will take care any possible updates on your data before you start the subscription again:
INSERT ... ON CONFLICT UPDATE SET...
Finally I have to say that both are dirty solutions as during the execution of the above statement and the creation of the subscription, new lines may have been arrived which will result in losing them until you perform again the custom sync.
I have seen some other suggested solutions using the LSN number from the Postgresql log file...
For me maybe is elegant and safe to delete all the data from the destination table and create the replication again!
I'd like to understand if the CDC enabled IBM IMS segments and IBM DB2 table sources would be able to provide both the before and after snapshot change values (like the Oracle .OLD and .NEW values in trigger) so that both could be used for further processing.
Note:
We are supposed to retrieve these values through Informatica PowerExchange and process and push to targets.
As of now, we need to know would we be able to retrieve both before snapshot and after snapshot values from IBM DB2 and IBM IMS (.OLD and .NEW as in Oracle triggers - not an exact similar example, but mentioned just as an example to understand)
Any help is much appreciated, Thanks.
I don't believe CDC captures before data in its change messages that it compiles from the DBMS log data. It's main purpose is to issue the minimum number of commands needed to replicate the data from one database to another. You'll want to take a snapshot of your replica database prior to processing the change messages if you want to preserve the state of data such that you can query it.
Alternatively for Db2, it's probably easier to work with the temporal tables feature added in Db2 10 as that allows you to define what changes should drive a snapshot. You can then access the temporal data using a temporal SQL query.
SELECT … FROM…period specification
Example trigger with old and new referencing...
CREATE TRIGGER danny117
NO CASCADE BEFORE Update ON mylib.myfile
REFERENCING NEW AS N old as O
FOR EACH ROW
-- don't let the claim change and force upper case
--just do something automatically on update blah...
BEGIN ATOMIC
SET N.claim = ucase(O.claim);
END
w.r.t PowerExchange 9.1.0 & 9.6:
Before snapshot data can't be processed via the powerexchange for DB2 database. Recently I worked on a migration project and I thought like the Oracle CDC which uses SCN numbers there should be something for db2 to start the logger from any desired point. But to my surprise Inforamtica global support confirmed that before snapshot data can't be captured by PowerExchange.
They talk about materialize and de-materialize targets which was out of my knowledge at that time, later I found out they meant to export and import of history data.
Even if you have table with CDC enanbled, you can't capture the data before snapshot from PWX.
DB2 reads capture data from the DB2-logs which has a marking for the operation like U/I/D that's enough for PowerExchange to progress.
I have a job in talend open studio which is working fine, it conects a tMSSqlinput to a tMap then tMysqlOutput, very straight forward. My problem is that i need this job running on daily basis, but only run when a new record is created or modified...any help is highly aprecciated!
It seems that you are searching for a Change Data Capture Tool for Talend.
Unfortunately it is only available on the licenced product.
To implement your need, you do have several ways. I want to show the most popular ones.
CDC from Talend
As Corentin said correctly, you could choose to use CDC (Change Data Capture) from Talend if you use the subscription version.
CDC of MSSQL
Alternatively you can check if you can activate or use CDC in your MSSQL server. This depends on your license. If it is possible, you can use the function to identify new elements and proceed them.
Triggers
Also you can create triggers on your database (if you have access to it). For example, creating a trigger for the cases INSERT, UPDATE, DELETE would help you getting the deltas. Then you could store those records separately or their IDs.
Software driven / API
If your database is connected to a software and you have developers around, you could ask for a service which identifies records on insert / update / delete and shows them to you. This could be done e.g. in a REST interface.
Delta via ID
If the primary key is an ID and it is set to autoincrement, you could also check your MySQL table for the biggest number and only SELECT those from the source which have a bigger ID than you have already got. This depends of course from the database layout.
I'm doing a research about logical decoding and I've been able to create a slot and replicate all transactions in a database to another using streaming replication protocol, and it works really well.
But I need to replicate just a single table and not all the tables on the database.
So, my question is: Does the logical decoding allows to filter the stream for a single table?
My current hint is to create a custom logical decoding output plugin, am I wrong?
Update
I've built an output plugin based on contrib/test decoding from postgresql sources and it was a good workaround. However it wasn't useful for real use cases, so I decided to take some other projects as references to fork and update.
The best for me was wal2json, so I decided to fork it and add the table filter as an option and not to hardcode the table names.
Here is the fork and this is the changeset.
How to use
First create the slot with the wal2json plugin:
pg_recvlogical -d postgres --slot test_slot --create-slot -P wal2json
Then start receiving the stream
pg_recvlogical -d postgres --slot test_slot --start -o limit-to=table_foo,table_bar -f -
Now we are ready to receive the updates on table_foo and table_bar only.
This was a really good challenge, I'm not a c developer and I know that the code needs some optimizations, but for now it works better than expected.
The current version of wal2json has these options:
* `filter-tables` - tables to exclude
* `add-tables`- tables to include
Usage:
pg_recvlogical -slot test_slot -o add-tables=myschema.mytable,myschema.mytable2
Reference: https://github.com/eulerto/wal2json#parameters
According to the documentation you can implement your own synchronous replication solutions by implementing streaming replication interface methods:
CREATE_REPLICATION_SLOT slot_name LOGICAL options
DROP_REPLICATION_SLOT slot_name
START_REPLICATION SLOT slot_name LOGICAL options
In addition to the interface above you also need to implement Logical Decoding Output plugin. In this plugin interface you need to adjust Change Callback operation, which listens to all DML operations:
The required change_cb callback is called for every individual row
modification inside a transaction, may it be an INSERT, UPDATE, or
DELETE. Even if the original command modified several rows at once the
callback will be called individually for each row.
This is the function where you want to check particular table for replication. Also be aware of the fact that Change Callback will NOT handle UNLOGGED and TEMP tables, but I guess it is not severe limitation.