I have 2 postgresql 12 DB servers, say A and B. A is the main DB.
B consists of some Foreign tables pointing to A and some materialized views built with those foreign table joins. The materialized views refresh nightly and with increasing data, refresh over FDWs are taking awfully long as SQLs over FDWs can’t parallelized.
I wanted to know if -
a logical replica ( which gives me the ability to have only few tables replicated) can have some of its own objects ( mat views in my case, so that the refresh does not have to pull and join tables over FDW)
For those familiar with Oracle’s Golden Gate, is there anything similar for postgres? i.e log based not trigger based? open source would be better!
Thanks
It sounds like logical replication would indeed be the solution for you:
You can replicate individual tables with it, which you should not modify on B, but otherwise B is a normal database where you can have other tables.
Logical replication works by parsing the transaction log, just like you want. So all data modifications are replicated incrementally.
The replicated tables on B will be duplicates of the tables on A, so they are physically present on B (with foreign tables, there are no data on B, and accessing the foreign tables will actually access data on A). So there is no immediate need for materialized views.
Note that there are some limitations to logical replication. Most notable, ALTER TABLE and other DDL statements are not replicated.
Related
I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.
How postgreSQL handle the multiple concurrent requests to foreign tables?
If two data consumers want to access the same foreign table, do they have to wait and execute the query sequentially, or concurrency of queries is supported?
The following answer is mostly for the foreign data wrapper for PostgreSQL, postgres_fdw.
If you need information about other foreign data wrappers, that will vary with the implementation of the foreign data wrapper and the capabilities of the underlying data store. For example, to have concurrent (read) requests with file_fdw, you need a file system that allows two processes to open the file for reading simultaneously.
Concurrency of queries against the same foreign table is just like for local tables. It is the remote server that handles the SQL statements, locks modified rows until the transaction finishes, and similar.
So there can be arbitrarily many concurrent readers, and readers won't block writers and vice versa.
If you run UPDATEs or DELETEs with WHERE conditions than cannot be pushed down to the foreign server (check the execution plan), it can happen that you have more locks than when using a local table.
Imagine a query like this:
UPDATE remote_tab SET col = 0 WHERE <complicated condition that is true for only one row>;
On a local table, this would only lock a single row.
If the condition is too complicated to be pushed down to the foreign server, postgres_fdw will first run a query like this:
SELECT ctid, col FROM remote_tab FOR UPDATE;
That will retrieve and lock all rows of the table.
Then the WHERE condition will be applied locally, and the resulting row is updated on the foreign server:
UPDATE remote_tab SET col = 0 WHERE ctid = ...;
So in this case, concurrency and performance can suffer quite a lot.
We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.
I need some expert advice on Postgres
I have few tables in my database that can grow huge, may be a hundred million records and have to implement some sort of data archiving in place. Say I have a subscriber table and subscriber_logs table. The subscriber_logs table will grow huge with time, affecting performance. I wanted to create a separate table called archive_subscriber_logs and create a scheduled task which will read from subscriber_logs and insert the data into archive_subscriber_logs, then delete the dumped data from subscriber_logs.
But my concern is, should I create the archive_subscriber_logs in the same database or in a different database. The problem with storing in a different db is the foreign key constraints that already exists on the main tables.
Anyone can suggest whether same db or different db is preferable? Or any other solutions?
Consider table partitioning, which is implemented in Postgres using table inheritance. This will improve performance on very large tables. Of course you would do measurements first to make sure it is worth implementing. The details are in the excellent Postgres documentation.
Using separate databases is not recommended because you won't be able to have foreign key constraints easily.
I have two databases, db1 and db2,
each has a able. The two databases have the same data in each table.
For the databases remain the same, I need a trigger that causes, when inserting a new data table in db1, this same data is inserted into db2 table simultaneously, thus keeping the two databases always equal.
There tables are equal.
Google Slony, that's it, now you have triggers and replicas.
The other way to do it, would be the streaming replication.