Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?) - postgresql

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?

Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

Related

PostgreSQL database design on multiple disks

Currently I've one physical machine with few SSD disks and PostgreSQL fresh installation:
I'll load ~1-2Tb of data in few distinct tables (they've not interconnection between themselves) where each table comprises distinct data entity.
I think about two approaches:
Create DB (with corresponding table for data entity) on each disk for each entity.
Create one DB but store each table for corresponding data entity on separate disks.
So, my questions is as follows: what approach is preferred and which can be achieved with less cost?
Eagerly waiting for your advices, comrades
You can answer the question yourself.
Are the data used by the same application?
Are the data from these tables joined?
Should these tables always be started and stopped together and have the same PostgreSQL version?
If yes, then they had best be stored together in a single database. Create three logical volumes that is striped across your SSDs: one for the data, one for pg_wal, one for the logs.
If not, you might be best off with a database or a database cluster per table.

Does PostgreSQL support replicating only a subset of the publishing columns?

I've been reading about logical replication in PostgreSQL, which seems to be a very good solution for sharing a small number of tables among several databases. My case is even simpler, as my subscribers will only use source tables in a read-only fashion.
I know that I can add extra columns to a subscribed table in the subscribing node, but what if I only want to import a subset of the whole set of columns of a source table? Is it possible or will it throw an error?
For example, my source table product, has a lot of columns, many of them irrelevant to my subscriber databases. Would it be feasible to create replicas of product with only the really needed columns at each subscriber?
The built in publication/subscription method does not support this. But the logical replication framework also supports any other decoding plugin you can write (or get someone else to write) and install, so you could make this happen that way. It looks like pglogical already supports this ("Selective replication of table columns at publisher side", but I have never tried to use this feature myself).
As of v15, PostgreSQL supports publishing a table partially, indicating which columns must be replicated out of the whole list of columns.
A case like this can be done now:
CREATE PUBLICATION users_filtered FOR TABLE users (user_id, firstname);
See https://www.postgresql.org/docs/15/sql-createpublication.html

(How) Is it possible to convert tables into foreign tables in Postgres?

We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.

Postgresql archiving old data

I need some expert advice on Postgres
I have few tables in my database that can grow huge, may be a hundred million records and have to implement some sort of data archiving in place. Say I have a subscriber table and subscriber_logs table. The subscriber_logs table will grow huge with time, affecting performance. I wanted to create a separate table called archive_subscriber_logs and create a scheduled task which will read from subscriber_logs and insert the data into archive_subscriber_logs, then delete the dumped data from subscriber_logs.
But my concern is, should I create the archive_subscriber_logs in the same database or in a different database. The problem with storing in a different db is the foreign key constraints that already exists on the main tables.
Anyone can suggest whether same db or different db is preferable? Or any other solutions?
Consider table partitioning, which is implemented in Postgres using table inheritance. This will improve performance on very large tables. Of course you would do measurements first to make sure it is worth implementing. The details are in the excellent Postgres documentation.
Using separate databases is not recommended because you won't be able to have foreign key constraints easily.

Restore PostgreSQL dump with new primary key values

I've got a problem with a PostgreSQL dump / restore. We have a production appliaction running with PostgresSQL 8.4. I need to create some values in the database in the testing environment and then import just this chunk of data into the production environment. The data is generated by the application and I need to use this approach because it needs testing before going into production.
Now that I described the environment, here is my problem:
In the testing database, I leave nothing but the data I need to move to the production database. The data is spread across multiple tables linked with foreign keys with multiple levels (like a tree). I then use pg_dump to export the desired tables into binary format.
When I try to import, the database will correctly import the root table entries with new primary key values, but does not import any of the data from the other tables. I believe that the problem is that foreign keys on child tables no longer recognizes the new primary keys.
Is there a way to achieve such an import which will update all the primary key values of all affected tables in the tree to correct serial (auto increment) values automatically and also update all foreign keys according to these new primary key values?
I have and idea how to do this with assistance of programming language while connected to both databases, but that would be very problematic to achieve for me since I don't have direct access to customers production server.
Thanks in advance!
That one seems to me like a complex migration issue. You can create PL/pgSQL migration scripts with inserts and use returning to get serials and use as foreign keys for other tables up the tree. I do not know the structure of your tree but in some cases reading sequence values in advance into arrays may be required due to complexity or performance reasons.
Other approach can be to examine production sequence values and estimate sequence values that will not be used in the near future. Fabricate test data in the test environment to have serial values that will not collide with production sequence values. Then load that data into the prod database and adjust sequence values of the prod environment so that test sequence values will not be used. It will leave a gap in your ID sequence so you must examine whether anything (like other processes) rely on the sequence values to be continuos.