Sync a Postgres table between Microservices (Projection, logical replication) - postgresql

In a microservice oriented architecture we have the need to "sync" or "project" a table, or part of it, from one to another service near real-time. Given the following scenario:
Service A
SchemaA.account
id | firstName | lastName | createdAt | deletedAt
1 | Hello | Name | 2022-07-05T15:05:39Z | Null
2 | Test | Name | 2022-07-05T16:05:39Z | Null
Service B
SchemaB.account should be synced from SchemaA.account
id | deletedAt
1 | Null
2 | Null
Terminology
"Projection" is a term primarily used in event-sourced systems, so here a bit misleading. Our source is another relational database (Postgres) as the target. I assume logical replication is the correct term here, but I may be wrong.
Requirements
As you can see I only need to replicate part of table SchemaA.account, meaning a subset of the columns.
Service B won't have to write to that table, just read. So, the sync can be one-way.
Solution should be as robust and fault-tolerant as possible. Think of service B being unavailable for some time to receive changes from Service A.
If it is a low-level database solution/tool it must be available in AWS RDS.
quick! near real-time, not scheduled syncs.
Possible solutions
I don't want to re-invent the wheel! Most flexibility, I guess, I would have when I use SNS + SQS (e.g. Service A publishes a message on data mutations to SNS, a SQS queue of Service B subscribes and adds data themselve). However, I think this creates alot of overhead.
I am currently lacking the right search terms. The term logical replication seems promising at first glance, but I am not sure if the replication tools are going to solve my cases. I don't want to replicate whole schema for backup clusters but data sync between microservices. Pglogical seems also very promising and an instruction how to enable it in AWS RDS exists.
The question really is that simple: am I on the right track or is there something obvious that I am not considering?

The approach that sounds most promising is Change Data Capture, which in Postgres leverages the logical replication protocol to capture changes to the data; these changes can then be used to propagate data updates to other services.
For example, one very commonly used tool for implementing CDC in databases like Postgres is Debezium, which has a guide (no idea if it's current, but it should at least be a start) to using Debezium with RDS Postgres. Debezium will publish the changes to Kafka (be careful about the retention settings for topics: the Kafka defaults are often a ticking time bomb for this use case) and some other component will consume the changes from Kafka and, e.g. update the table.
One benefit of this approach is that it means you can introduce a new column into the source database's table without necessarily having to change anything in the destination database: the change consumer will see the new field but it's under no obligation to propagate.

Related

Can A Postgres Replication Publication And Subscription Exist On The Same Server

I have a request asking for a read only schema replica for a role in postgresql. After reading documentation and better understanding replication in postgresql, I'm trying to identify whether or not I can create the publisher and subscriber within the same database.
Any thoughts on the best approach without having a second server would be greatly appreciated.
You asked two different question. Same database? No. Since Pub/Sub requires tables to have the same name (including schema) on both ends, you would be trying to replicate a table onto itself. Using logical replication plugins other than the built-in one might get around this restriction.
Same server? Yes. You can replicate between two databases of the same instance (but see the note in the docs about some extra hoops you need to jump through) or between two instances on the same host. So whichever of those things you meant by "same server", yes, you can.
But it seems like an odd way to do this. If the access is read only, why does it matter whether it is to a replica of the real data or to the real data itself?

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

Bidirectional Replication Design: best way to script and execute unmatched row on Source DB to multiple subscriber DBs, sequentially or concurrently?

Thank you for help or suggestion offered.
I am trying to build my own multi-master replication on Postgresql 10 in Windows, for a situation which cannot use any of the current 3rd party tools for PG multimaster replication, which can also involve another DB platform in a subscriber group (Sybase ADS). I have the following logic to create bidirectional replication, partially inspired by Bucardo's logic, between 1 publisher and 2 subscribers:
When INSERT, UPDATE, or DELETE is made on Source table, Source table Trigger adds row to created meta table on Source DB that will act as a replication transaction to be performed on the 2 subscriber DBs which subcribe to it.
A NOTIFY signal will be sent to a service, or script written in Python or some scripting language will monitor for changes in the metatable or trigger execution and be able to do a table compare or script the statement to run on each subscriber database.
***I believe that triggers on the subscribers will need to be paused to keep them from pushing their received statements to their subscribers, i.e. if node A and node B both subscribe to each other's table A, then an update to node A's table A should replicate to node B's table A without then replicating back to table A in a bidirectional "ping-pong storm".
There will be a final compare between tables and the transaction will be closed. Re-enable triggers on subscribers if they were paused/disabled when pushing transactions from step 2 addendum.
This will hopefully be able to be done bidirectionally, in order of timestamp, in FIFO order, unless I can figure out a to create child processes to run the synchronizations concurrently.
For this, I am trying to figure out the best way to setup the service logic---essentially Step 2 above, which has apparently been done using a daemon in Linux, but I have to work in Windows, making it run as, or resembling, a service/agent---or come up with a reasonably easy and efficient design to send the source DBs statements to the subscribers DBs.
Does anyone see that this plan is faulty or may not work?
Disclaimer: I don't know anything about Postgresql but have done plenty of custom replication.
The main problem with bidirectional replication is merge issues.
If the same key is used in both systems with different attributes, which one gets to push their change? If you nominate a master it's easier. Then the slave just gets overwritten every time.
How much latency can you handle? It's much easier to take the 'notify' part out and just have a five minute windows task scheduler job that inspects log tables and pushes data around.
In other words, this kind of pattern:
Change occurs in a table. A database trigger on that table notes the change and writes the PK of the table to a change log table. A ReplicationBatch column in the log table is set to NULL by default
A windows scheduled task inspects all change log tables to find all changes that happened since the last run and 'reserves' these records by setting their replication state to a replication batch number
i.e. you run a UPDATE LogTable Set ReplicationBatch=BatchNumber WHERE ReplicationState IS NULL
All records that have been marked are replicated
you run a SELECT * FROM LogTable WHERE ReplicationState=RepID to get the records to be processed
When complete, the reserved records are marked as complete so the next time around only subsequent changes are replicated. This completion flag might be in the log table or it might be in a ReplicaionBatch number table
The main point is that you need to reserve records for replication, so that as you are replicating them out, additional log records can be added in from the source without messing up the batch
Then periodically you clear out the log tables.

manually create replication slot for publication in PostgreSQL 10

I am trying to get a stream of updates for certain tables from my PostgreSQL database. The regular way of getting all updates looks like this:
You create a logical replication slot
pg_create_logical_replication_slot('my_slot', 'wal2json');
And either connect to it using pg_recvlogical or making special SQL queries. This allows you to get all the actions from the database in json (if you used wal2json plugin or similar) and then do whatever you want with that data.
But in PostgreSQL 10 we have Publication/Subscription mechanism which allows us to replicate selected tables only. This is very handy because a lot of useless data is not being sent. The process looks like this:
First, you create a publication
CREATE PUBLICATION foo FOR TABLE herp, derp;
Then you subscribe to that publication from another database
CREATE SUBSCRIPTION mysub CONNECTION <connection stuff> PUBLICATION foo;
This creates a replication slot on a master database under the hood and starts listening to updates and commit them to the same tables on a second database. This is fine if your job was to replicate some tables, but want to get a raw stream for my stuff.
As I mentioned, the CREATE SUBSCRIPTION query is creating a replication slot on the master database under the hood, but how can I create one manually without the subscription and a second database? Here the docs say:
To make this work, create the replication slot separately (using the function pg_create_logical_replication_slot with the plugin name pgoutput)
According to the docs, this is possible, but pg_create_logical_replication_slot only creates a regular replication slot. Is the pgoutput plugin responsible for all the magic? If yes, then it becomes impossible to use other plugins like wal2json with publications.
What am I missing here?
I have limited experience with logical replication and logical decoding in Postgres, so please correct me if below is wrong. That being said, here is what I have found:
Publication support is provided by pgoutput plugin. You use it via plugin-specific options. It may be that other plugins have possibility to add the support, but I do not know whether the logical decoding plugin interface exposes sufficient details. I tested it against wal2json plugin version 9e962ba and it doesn't recognize this option.
Replication slots are created independently from publications. Publications to be used as a filter are specified when fetching changes stream. It is possible to peek changes for one publication, then peek changes for another publication and observe different set of changes despite using the same replication slot (I did not find it documented and I was testing on Aurora with Postgres compatibility, so behavior could potentially vary).
Plugin output seems to include all entries for begin and commit, even if transaction did not touch any of tables included in publication of interest. It does not however include changes to other tables than included in the publication.
Here is an example how to use it in Postgres 10+:
-- Create publication
CREATE PUBLICATION cdc;
-- Create slot
SELECT pg_create_logical_replication_slot('test_slot_v1', 'pgoutput');
-- Create example table
CREATE TABLE replication_test_v1
(
id integer NOT NULL PRIMARY KEY,
name text
);
-- Add table to publication
ALTER PUBLICATION cdc ADD TABLE replication_test_v1;
-- Insert example data
INSERT INTO replication_test_v1(id, name) VALUES
(1, 'Number 1')
;
-- Peak changes (does not consume changes)
SELECT pg_logical_slot_peek_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
-- Get changes (consumes changes)
SELECT pg_logical_slot_get_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
To stream changes out of Postgres to other systems, you can consider using Debezium project. It is an open source distributed platform for change data capture, which among others provides a PostgreSQL connector. In version 0.10 they added support for pgoutput plugin. Even if your use case is very different from what the project offers, you can look at their code to see how they interact with replication API.
After you have created the logical replication slot and the publication, you can create a subscription this way:
CREATE SUBSCRIPTION mysub
CONNECTION <conn stuff>
PUBLICATION foo
WITH (slot_name=my_slot, create_slot=false);
Not sure if this answers your question.