PostgreSQL data warehouse: create separate databases or different tables within the same database? - postgresql

We seek to run cross-table queries and perform different type of merges. For cross-database queries we need to establish connection every time.

So to answer your question,
we should create multiple(different)tables within same database.
because cross database operation is not supported(for ex. In short you can't do the join on 2 tables in different database)
But if you want to segregate your data within same database you can create different schemas/layer .and create your tables under that.
for ex.
**1st load landingLayer.tablename
2nd transformation goodDataLayer.tablename
3rd transformation widgetLayer.tablename**

Related

PostgreSQL database design on multiple disks

Currently I've one physical machine with few SSD disks and PostgreSQL fresh installation:
I'll load ~1-2Tb of data in few distinct tables (they've not interconnection between themselves) where each table comprises distinct data entity.
I think about two approaches:
Create DB (with corresponding table for data entity) on each disk for each entity.
Create one DB but store each table for corresponding data entity on separate disks.
So, my questions is as follows: what approach is preferred and which can be achieved with less cost?
Eagerly waiting for your advices, comrades
You can answer the question yourself.
Are the data used by the same application?
Are the data from these tables joined?
Should these tables always be started and stopped together and have the same PostgreSQL version?
If yes, then they had best be stored together in a single database. Create three logical volumes that is striped across your SSDs: one for the data, one for pg_wal, one for the logs.
If not, you might be best off with a database or a database cluster per table.

How to replicate rows into different tables of different database in postgresql?

I use postgresql. I have many databases in a server. There is one database which I use the most say 'main'. This 'main' has many tables inside it. And also other databases have many tables inside them.
What I want to do is, whenever a new row is inserted into 'main.users' table I wish to insert the same data into 'users' table of other databases. How shall I do it in postgresql? Similarly I wish to do the same for all actions like UPDATE, DELETE etc.,
I had gone through the "logical replication" concept as suggested by you. In my case I know the source db name up front and I will come to know the target db name as part of the query. So it is going to be dynamic.
How to achieve this? is there any db concept available in postgresql? Or I welcome all other possible ways as well. Please share me some idea on this.
If this is all on the same Postgres instance (aka "cluster"), then I would recommend to use a foreign table to access the tables from the "main" database in the other databases.
Those foreign tables look like "local" tables inside each database, but access the original data in the source database directly, so there is no need to synchronize anything.
Upgrade to a recent PostgreSQL release and use logical replication.
Add a trigger on the table in the master database that uses dblink to access and write the other databases.
Be sure to consider what should be done if the row alreasdy exists remotely, or if the rome server is unreachable.
Also not that updates propogated usign dblink are not rolled back if the inboking transaction is rolled back

Insert data into remote DB tables from multiple databases through trigger or replication or foreign data wrapper

I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.