PostgreSQL database design on multiple disks

PostgreSQL database design on multiple disks - postgresql

Currently I've one physical machine with few SSD disks and PostgreSQL fresh installation:
I'll load ~1-2Tb of data in few distinct tables (they've not interconnection between themselves) where each table comprises distinct data entity.
I think about two approaches:
Create DB (with corresponding table for data entity) on each disk for each entity.
Create one DB but store each table for corresponding data entity on separate disks.
So, my questions is as follows: what approach is preferred and which can be achieved with less cost?
Eagerly waiting for your advices, comrades

You can answer the question yourself.
Are the data used by the same application?
Are the data from these tables joined?
Should these tables always be started and stopped together and have the same PostgreSQL version?
If yes, then they had best be stored together in a single database. Create three logical volumes that is striped across your SSDs: one for the data, one for pg_wal, one for the logs.
If not, you might be best off with a database or a database cluster per table.

Related

PostgreSQL: What is the fastest way to backup/restore individual tables or table partitions (data+indexes)

I'm migrating from a proprietary dbms to PG. In the proprietary dbms, "offlining" and "onlining" data partitions is a very lightweight operation. I'm looking to implement similar functionality with PG by backup and restore of individual table (partitions). Obviously I need to avoid a performance regression. So my question is what the fastest way is of:
Backing up a table (partition), both data and indexes
Taking the table offline (meaning that the data is now gone from the database)
Restoring the table (partition), both data and indexes
Once I have some advice I can design more targeted performance comparisons. Thanks in advance for any pointers.

What is fast and needs to be fast is adding or removing a partition (ALTER TABLE ... ATTACH/DETACH PARTITION).
After you have detached the partition you are in no great hurry to backup/export the data. This can be done comfortably with pg_dump.
Similarly, importing the data for a table that is to become a new partition is normally not time critical.
If you need this to happen faster (for example, you want the old partition to be visible in another database as soon as it is detached in the old one), you could use logical replication to replicate the partition to another PostgreSQL database before you detach it. As soon as replication has caught up, you detach or drop the original partition and attach the copy in the other database.

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?

Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.

That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.

It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

How to create read replicas from multiple postgres databases into a single database?

I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?

Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)

Postgresql archiving old data

I need some expert advice on Postgres
I have few tables in my database that can grow huge, may be a hundred million records and have to implement some sort of data archiving in place. Say I have a subscriber table and subscriber_logs table. The subscriber_logs table will grow huge with time, affecting performance. I wanted to create a separate table called archive_subscriber_logs and create a scheduled task which will read from subscriber_logs and insert the data into archive_subscriber_logs, then delete the dumped data from subscriber_logs.
But my concern is, should I create the archive_subscriber_logs in the same database or in a different database. The problem with storing in a different db is the foreign key constraints that already exists on the main tables.
Anyone can suggest whether same db or different db is preferable? Or any other solutions?

Consider table partitioning, which is implemented in Postgres using table inheritance. This will improve performance on very large tables. Of course you would do measurements first to make sure it is worth implementing. The details are in the excellent Postgres documentation.
Using separate databases is not recommended because you won't be able to have foreign key constraints easily.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse