Does a Postgres Server only store data on one machine? - postgresql

I'm new to databases and reading the Postgres documentation, it seems to mention that data is stored on disk, which seems to imply that data is only stored on one machine. Is that correct?

Yes, your understanding is correct.
PostgreSQL does not offer a distributed solution (e.g. shared nothing). There are forks (Greenplum, Postgres-XL) and extension (Citus) that can distribute storage across multiple servers, but it's not available natively inside the "vanilla" PostgreSQL version.
You can access and write data on different Postgres servers through a foreign data wrapper, but that's not exactly the same as a proper distributed solution (e.g. foreign tables don't participate correctly in transactions)

Related

Postgres architecture for one machine with several apps

I have one machine on which several applications are hosted. Applications work on separated data and don't interact - each application only needs access to its own data. I want to use PostgreSQL as RDBMS. Which one of the following is best and why?
One global Postgres sever, one global database, one schema per application.
One global Postgres server, one database per application.
One Postgres server per application.
Feel free to suggest additional architectures if you think they would be better than the ones above.
The questions you need to ask yourself: does any application ever need to access data from another application (in the same SQL statement). If you can can answer that with a clear NO, then you should at least go for separate databases. Cross-database queries aren't that straight-forward in Postgres, so if the different applications do need a lot of data from other applications, then solution 1 might be deployment layout to think about. If this would only concern very few tables, then using foreign data wrappers with different databases might still be a better solution.
Solution 2 and 3 are more or less the same from the perspective of each application. One thing to keep in mind when deciding between 2 and 3 is availability. Some configuration changes to Postgres require a restart of the service. Is an outage of all applications acceptable in that case, even though the change was only necessary for one?
But you can always start with option 2 and then move database to different servers later.
Another question to ask is if all applications always use the same (major) Postgres version With solution 2 you must make sure that all applications are compatible with a new Postgres version if one of them wants to upgrade e.g. because of new features that the application wants to use.
Solution 1 is stupid : a SQL schema is not a database. Use SQL schema for one application that have multiple "parts" like "Poduction", "sales", "marketing", "finances"...
While the final volume of the data won't be too heavy and the number of user won't be too much, use only one PG cluster to facilitate administration tasks
If the volume of data or the number of user increases, it will be time to separates your different databases on new distinct PG clusters....

DBLINK vs Postgres_FDW, which one may provide better performance?

I have a use case to distribute data across many databases on many servers, all in postgres tables.
From any given server/db, I may need to query another server/db.
The queries are quite basic, standard selects with where clauses on standard fields.
I have currently implemented postgres_FDW, (I'm, using postgres 9.5), but I think the queries are not using indexes on the remote db.
For this use case (a random node may query N other nodes), which is likely my best performance choice based on how each underlying engine actually executes?
The Postgres foreign data wrapper (postgres_FDW) is newer to
PostgreSQL so it tends to be the recommended method. While the
functionality in the dblink extension is similar to that in the
foreign data wrapper, the Postgres foreign data wrapper is more SQL
standard compliant and can provide improved performance over dblink
connections.
Read this article for more detailed info: Cross Database queryng
My solution was simple: I upgraded to Postgres 10, and it appears to push where clauses down to the remote server.

How to create read replicas from multiple postgres databases into a single database?

I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)

Is it possible to archive WAL files for one PostgreSQL database within a single instance or must I create a second instance?

I run a couple of PostgreSQL databases (9.3), one of which does not need archiving the other of which I'd rather run in WAL archive mode by can get away with not.
I now have a need for a data which is archived.
As far as I can tell the setting is on an instance basis, so I wouldn't be able to just choose which databases to archive and which not, which would indicate that I will need to create a new PostgreSQL instance.
Am I missing something?
Also, FWIW, will I be able to create database links between databases on the two instances?
Thanks, --sw
You cannot to choose database for archiving - only all (or none) in PostgreSQL instance can be archived. There are not any pother possibility now.
You can send query to other PostgreSQL instance via dblink extension or with Foreign Data Wrappers API. FDW API should be preferred, although dblink has some usage still.

Replicate selected postgresql tables between two servers?

What would be the best way to replicate individual DB tables from a Master postgresql server to a slave machine? It can be done with cron+rsync, or with whatever postgresql might have build in, or some sort of OSS tool, but so far the postgres docs don't seem to cover how to do table replication. I'm not able to do a full DB replication because some tables have license->IP stuff connected to it, and I can't replicate those on the slave machine. I don't need instant replication, hourly would be acceptable as well.
If I need to just rsync, can someone help identify what files within the /var/lib/pgsql directory would need to be synced, or how I would know what tables they are.
Starting with Postgres 10, logical replication is built into Postgres! This is often a better solution than external solutions. The Postgres docs are great and easy to follow. It's very easy. See the quick setup docs, which in essense boils down to running this:
-- On publisher DB
CREATE PUBLICATION mypub FOR TABLE users, departments;
-- On subscriber DB
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION mypub;
You might want to try Bucardo, which is an open source software to synchronize rows between tables even if they are in a remote location. It's a very simple software, and it is capable of creating one-way synchronization relationships as well.
Check out http://bucardo.org/wiki/Bucardo
You cannot get anything useful by copying individual tables files in the data directory. If you want to replicate selected tables, there are a number of good options.
http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling