I'm doing a research about logical decoding and I've been able to create a slot and replicate all transactions in a database to another using streaming replication protocol, and it works really well.
But I need to replicate just a single table and not all the tables on the database.
So, my question is: Does the logical decoding allows to filter the stream for a single table?
My current hint is to create a custom logical decoding output plugin, am I wrong?
Update
I've built an output plugin based on contrib/test decoding from postgresql sources and it was a good workaround. However it wasn't useful for real use cases, so I decided to take some other projects as references to fork and update.
The best for me was wal2json, so I decided to fork it and add the table filter as an option and not to hardcode the table names.
Here is the fork and this is the changeset.
How to use
First create the slot with the wal2json plugin:
pg_recvlogical -d postgres --slot test_slot --create-slot -P wal2json
Then start receiving the stream
pg_recvlogical -d postgres --slot test_slot --start -o limit-to=table_foo,table_bar -f -
Now we are ready to receive the updates on table_foo and table_bar only.
This was a really good challenge, I'm not a c developer and I know that the code needs some optimizations, but for now it works better than expected.
The current version of wal2json has these options:
* `filter-tables` - tables to exclude
* `add-tables`- tables to include
Usage:
pg_recvlogical -slot test_slot -o add-tables=myschema.mytable,myschema.mytable2
Reference: https://github.com/eulerto/wal2json#parameters
According to the documentation you can implement your own synchronous replication solutions by implementing streaming replication interface methods:
CREATE_REPLICATION_SLOT slot_name LOGICAL options
DROP_REPLICATION_SLOT slot_name
START_REPLICATION SLOT slot_name LOGICAL options
In addition to the interface above you also need to implement Logical Decoding Output plugin. In this plugin interface you need to adjust Change Callback operation, which listens to all DML operations:
The required change_cb callback is called for every individual row
modification inside a transaction, may it be an INSERT, UPDATE, or
DELETE. Even if the original command modified several rows at once the
callback will be called individually for each row.
This is the function where you want to check particular table for replication. Also be aware of the fact that Change Callback will NOT handle UNLOGGED and TEMP tables, but I guess it is not severe limitation.
Related
Working on an ETL process that requires me to pull data from one postgres table and update data to another Postgres table in a seperate environment (same columns names). Currently, I am running the python job in a windows EC2 instance, and I am using pangres upsert library to update existing rows and insert new rows.
However, my organization wants me to move the python ETL script in Managed Apache Airflow on AWS.
I have been learning DAGs and most of the tutorials and articles are about querying data from postgres table using hooks or operators.
However, I am looking to understand how to update existing table A incrementally (i.e. upsert) using new records from table B (and ignore/overwrite existing matching rows).
Any chunk of code (DAG) that explains how to perform this simple task would be extremely helpful.
In Apache Airflow, operations are done using operators. You can package any Python code into an operator, but your best bet is always to use a pre-existing open source operator if one already exists. There is an operator for Postgres (https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html).
It will be hard to provide a complete example of what you should write for your situation, but it sounds to be like the best approach for you to take here is to take any SQL present in your Python ETL script and use it with the Postgres operator. The documentation I linked should be a good example.
They demonstrate inserting data, reading data, and even creating a table as a pre-requisite step. Just like how in a Python script, lines execute one at a time, in a DAG, operators execute in a particular order, depending on how they're wired up, like in their example:
create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date
In their example, populating the pet table won't happen until the create pet table step succeeds, etc.
Since your use case is about copying new data from one table to another, a few tips I can give you:
Use a scheduled DAG to copy the data over in batches. Airflow isn't meant to be used a streaming system for many small pieces of data.
Use the "logical date" of the DAG run (https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html) in your DAG to know the interval of data that run should process. This works well for your requirement that only new data should be copied over during each run. It will also give you repeatable runs in case you need to fix code, then re-run each run (one batch a time) after pushing your fix.
I am trying to get a stream of updates for certain tables from my PostgreSQL database. The regular way of getting all updates looks like this:
You create a logical replication slot
pg_create_logical_replication_slot('my_slot', 'wal2json');
And either connect to it using pg_recvlogical or making special SQL queries. This allows you to get all the actions from the database in json (if you used wal2json plugin or similar) and then do whatever you want with that data.
But in PostgreSQL 10 we have Publication/Subscription mechanism which allows us to replicate selected tables only. This is very handy because a lot of useless data is not being sent. The process looks like this:
First, you create a publication
CREATE PUBLICATION foo FOR TABLE herp, derp;
Then you subscribe to that publication from another database
CREATE SUBSCRIPTION mysub CONNECTION <connection stuff> PUBLICATION foo;
This creates a replication slot on a master database under the hood and starts listening to updates and commit them to the same tables on a second database. This is fine if your job was to replicate some tables, but want to get a raw stream for my stuff.
As I mentioned, the CREATE SUBSCRIPTION query is creating a replication slot on the master database under the hood, but how can I create one manually without the subscription and a second database? Here the docs say:
To make this work, create the replication slot separately (using the function pg_create_logical_replication_slot with the plugin name pgoutput)
According to the docs, this is possible, but pg_create_logical_replication_slot only creates a regular replication slot. Is the pgoutput plugin responsible for all the magic? If yes, then it becomes impossible to use other plugins like wal2json with publications.
What am I missing here?
I have limited experience with logical replication and logical decoding in Postgres, so please correct me if below is wrong. That being said, here is what I have found:
Publication support is provided by pgoutput plugin. You use it via plugin-specific options. It may be that other plugins have possibility to add the support, but I do not know whether the logical decoding plugin interface exposes sufficient details. I tested it against wal2json plugin version 9e962ba and it doesn't recognize this option.
Replication slots are created independently from publications. Publications to be used as a filter are specified when fetching changes stream. It is possible to peek changes for one publication, then peek changes for another publication and observe different set of changes despite using the same replication slot (I did not find it documented and I was testing on Aurora with Postgres compatibility, so behavior could potentially vary).
Plugin output seems to include all entries for begin and commit, even if transaction did not touch any of tables included in publication of interest. It does not however include changes to other tables than included in the publication.
Here is an example how to use it in Postgres 10+:
-- Create publication
CREATE PUBLICATION cdc;
-- Create slot
SELECT pg_create_logical_replication_slot('test_slot_v1', 'pgoutput');
-- Create example table
CREATE TABLE replication_test_v1
(
id integer NOT NULL PRIMARY KEY,
name text
);
-- Add table to publication
ALTER PUBLICATION cdc ADD TABLE replication_test_v1;
-- Insert example data
INSERT INTO replication_test_v1(id, name) VALUES
(1, 'Number 1')
;
-- Peak changes (does not consume changes)
SELECT pg_logical_slot_peek_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
-- Get changes (consumes changes)
SELECT pg_logical_slot_get_binary_changes('test_slot_v1', NULL, NULL, 'publication_names', 'cdc', 'proto_version', '1');
To stream changes out of Postgres to other systems, you can consider using Debezium project. It is an open source distributed platform for change data capture, which among others provides a PostgreSQL connector. In version 0.10 they added support for pgoutput plugin. Even if your use case is very different from what the project offers, you can look at their code to see how they interact with replication API.
After you have created the logical replication slot and the publication, you can create a subscription this way:
CREATE SUBSCRIPTION mysub
CONNECTION <conn stuff>
PUBLICATION foo
WITH (slot_name=my_slot, create_slot=false);
Not sure if this answers your question.
We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.
When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.
Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?
Thanks!
There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.
There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.
It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.
You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.
There is an option to separate the data and creating index in postgresql while taking pg_dump.
Here pre-data refers to Schema, post-data refers to index and triggers.
From the docs,
--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more
than once to select multiple sections. The default is to dump all
sections.
The data section contains actual table data, large-object contents,
and sequence values. Post-data items include definitions of indexes,
triggers, rules, and constraints other than validated check
constraints. Pre-data items include all other data definition items.
May be this would help :)
I have the choice to Sync ES with latest changes on my Postgres DB
1- Postgres Listen / Notify :
I should create a trigger -> use pg_notify -> and create listener in a separated service.
2- Async queries to ES :
I can update ElasticSearch asynchronously after a change on DB. ie:
model.save().then(() => {model.saveES() }).catch()
Which one will scale best ?
PS: We tried zombodb in production but it doesn’t goes well, it slows down the production.
as you are asking for the ways, I assume you want to know the possibilities to apply the better architecture, I would like you to propose an advice given by confluent:
here https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
I recommend you consider https://github.com/debezium/debezium. It has Postgresql support and implements the change capture model proposed in other posts instead of the dual write model.
debezium benefits:
low latency change streaming
stores changes in a replicated log for durability
emits only write events (creates, updates, deletes) which can be consumed and piped into other systems.
UPD. Here is a simple github repository, which shows how it works
As there is no support for user defined functions or stored procedures in RedShift, how can i achieve UPSERT mechanism in RedShift which is using ParAccel, a PostgreSQL 8.0.2 fork.
Currently, i'm trying to achieve UPSERT mechanism using IF...THEN...ELSE... statement
e.g:-
IF NOT EXISTS(SELECT...WHERE(SELECT..))
THEN INSERT INTO tblABC() SELECT... FROM tblXYZ
ELSE UPDATE tblABC SET.,.,.,. FROM tblXYZ WHERE...
which is giving me error. As i'm writing this code independently without including it in function or SP's.
So, is there any solution to achieve UPSERT.
Thanks
You should probably read this article on upsert by depesz. You can't rely on SERIALIABLE for this since, AFAIK, ParAccel doesn't support full serializability support like in Pg 9.1+. As outlined in that post, you can't really do what you want purely in the DB anyway.
The short version is that even on current PostgreSQL versions that support writable CTEs it's still hard. On an 8.0 based ParAccel, you're pretty much out of luck.
I'd do a staged merge. COPY the new data to a temporary table on the server, LOCK the destination table, then do an UPDATE ... FROM followed by an INSERT INTO ... SELECT. Doing the data uploads in big chunks and locking the table for the upserts is reasonably in keeping with how Redshift is used anyway.
Another approach is to externally co-ordinate the upserts via something local to your application cluster. Have all your tools communicate via an external tool where they take an "insert-intent lock" before doing an insert. You want a distributed locking tool appropriate to your system. If everything's running inside one application server, it might be as simple as a synchronized singleton object.