Per doc of postgres,
Databases are physically separated and access control is managed at the connection level.
Is there any further details that how Postgres achieves this physical isolation? Are those files used to store data in the backend totally separate?
Is there any further details that how postgres archives this physical isolation? Are those files used to store data in the backend totally separate?
Yes. Each table is stored as a separate file (actually, multiple files). Different databases are in different directories. Indexes, etc are also in one or more separate files.
However, there's a lot of shared state. Some system tables are shared between all databases. The write-ahead log (WAL) is also shared, as is the commit log (pg_clog). So you cannot just extract one database's files and attach it to another PostgreSQL instance. They're meaningless without some of the shared files.
Related
I'm new to databases and reading the Postgres documentation, it seems to mention that data is stored on disk, which seems to imply that data is only stored on one machine. Is that correct?
Yes, your understanding is correct.
PostgreSQL does not offer a distributed solution (e.g. shared nothing). There are forks (Greenplum, Postgres-XL) and extension (Citus) that can distribute storage across multiple servers, but it's not available natively inside the "vanilla" PostgreSQL version.
You can access and write data on different Postgres servers through a foreign data wrapper, but that's not exactly the same as a proper distributed solution (e.g. foreign tables don't participate correctly in transactions)
I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)
I have a mobile/web project, using pg9.3 as database, and linux as server.
The data won't be huge, but as time goes on, the data increase.
For long term considering, I want to know about:
Questions:
1. Is it necessary for me to create tablespace for my database, or just use the default one?
2. If I create new tablespace, what is the proper location on linux to create the folder, and why?
3. If I don't create it now, and wait until I have to, till then, will it be easy for me to migrate db with data to new tablespace?
Just use the default tablespace, do not create new tablespaces. Tablespaces are only useful if you have multiple physical disks, so you can define which data is stored on which physical disk. The directory where your data is located is not that important for the workings of postgres, so if you only have one disk it is useless to use tablespaces
Should your data grow beyond the capacity of 1 disk, you will have to perform a full data migration anyway to move it to another physical disk, so you can configure tablespaces at that time
The idea behind defining which data is located on which disk (with tablespaces) is that you can do things like putting a big table which is hardly used on a slow disk, and putting this very intensively used table on a separated faster disk. But I assume you're not there yet, so don't over complicate things
I'm trying to set up a distributed processing environment,
with all of the data sitting in a single shared network drive.
I'm not going to write anything to it, and just be reading from it,
so we're considering write-protecting the network drive as well.
I remember when I was working with MSSQL,
I could back up databases to a DVD and load it directly as a read-only database.
If I can do something like that in Postgres,
I should be able to give it an abstraction like a read-only DVD,
and all will be good.
Is something like this possible in Postgres,
if not, any alternatives? (MySQL? sqlite even?)
Or if that's not possible is there some way to specify a shared file system?
(Make it know that other processes are reading from it as well?)
For various reasons, using a parallel dbms is not possible,
and I need two DB processes running parallel...
Any help is greatly appreciated.
Thanks!!
Write-protecting the data directory will cause PostgreSQL to fail to start, as it needs to be able to write postmaster.pid. PostgreSQL also needs to be able to write temporary files and tablespaces, set hint bits, manage the visibility map, and more.
In theory it might be possible to modify the PostgreSQL server to support running on a read-only database, but right now AFAIK this is not supported. Don't expect it to work. You'll need to clone the data directory for each instance.
If you want to run multiple PostgreSQL instances for performance reasons, having them fighting over shared storage would be counter-productive anyway. If the DB is small enough to fit in RAM it'd be OK ... but in that case it's also easy to just clone it to each machine. If the DB isn't big enough to be cached in RAM then both DB instances would be I/O bottlenecked and unlikely to perform any better than (probably slightly worse than) a single DB not subject to storage contention.
There's some chance that you could get it to work by:
Moving the constant data into a new tablespace onto read-only shared storage
Taking a basebackup of the database, minus the newly separated tablespace for shared data
Copying the basebackup of the DB to read/write private storage on each host that'll run a DB
Mounting the shared storage and linking the tablespace in place where Pg expects it
Starting pg
... at least if you force hint-bit setting and VACUUM FREEZE everything in the shared tablespace first. It isn't supported, it isn't tested, it probably won't work, there's no benefit over running private instances, and I sure as hell wouldn't do it, but if you really insist you could try it. Crashes, wrong query results, and other bizarre behaviour are not unlikely.
I've never tried it, but it may be possible to run postgres with a data dir which is mostly on a RO file system if all your use is indeed read-only. You will need to be sure to disable autovacuum. I think even read activity may generate xlog mutation, so you will probably have to symlink the pg_xlog directory onto a writeable file system. Sometimes read queries will spill to disk for large sorts or other temp requirements, so you should also link base/pgsql_tmp to a writeable disk area.
As Richard points out there are visibility hint bits in the data heap. May want to try VACUUM FULL FREEZE ANALYZE on the db before putting it on the RO file system.
"Is something like this possible in Postgres, if not, any alternatives? (MySQL? sqlite even?)"
I'm trying to figure out if I can do this with postgres as well, to port over a system from sqlite. I can confirm that this works just fine with sqlite3 database files on a read-only NFS share. Sqlite does work nicely for this purpose.
When done with sqlite, we cut over to a new directory with new sqlite files whenever there are updates. We don't ever insert into the in-use database. I'm not sure if inserts would pose any problems (with either database). Caching read-only data at the OS level could be an issue if another database instance mounted the dir read-write. This is something I would ideally like to be able to do.
I'm interested to get the physical locations of tables, views, functions, data/content available in the tables of PostgreSQL in Linux OS. I've a scenario that PostgreSQL could be installed in SD-Card facility and Hard-Disk. If I've tables, views, functions, data in SD, I want to get the physical locations of the same and merge/copy into my hard-disk whenever I wish to replace the storage space. I hope the storage of database should be in terms of plain files architecture.
Also, is it possible to view the contents of the files? I mean, can I access them?
Kevin and Mike already provided pointers where to find the data directory. For the physical location of a table in the file system, use:
SELECT pg_relation_filepath('my_table');
Don't mess with the files directly unless you know exactly what you are doing.
A database as a whole is represented by a subdirectory in PGDATA/base:
If you use tablespaces it gets more complicated. Read details in the chapter Database File Layout in the manual:
For each database in the cluster there is a subdirectory within
PGDATA/base, named after the database's OID in pg_database. This
subdirectory is the default location for the database's files; in
particular, its system catalogs are stored there.
...
Each table and index is stored in a separate file. For ordinary
relations, these files are named after the table or index's filenode
number, which can be found in pg_class.relfilenode.
...
The pg_relation_filepath() function shows the entire path (relative to
PGDATA) of any relation.
Bold emphasis mine.
The manual about the function pg_relation_filepath().
The query show data_directory; will show you the main data directory. But that doesn't necessarily tell you where things are stored.
PostgreSQL lets you define new tablespaces. A tablespace is a named directory in the filesystem. PostgreSQL lets you store individual tables, indexes, and entire databases in any permissible tablespace. So if a database were created in a specific tablespace, I believe none of its objects would appear in the data directory.
For solid run-time information about where things are stored on disk, you'll probably need to query pg_database, pg_tablespace, or pg_tables from the system catalogs. Tablespace information might also be available in the information_schema views.
But for merging or copying to your hard disk, using these files is almost certainly a Bad Thing. For that kind of work, pg_dump is your friend.
If you're talking about copying the disk files as a form of backup, you should probably read this, especially the section on Continuous Archiving and Point-in-Time Recovery (PITR):
http://www.postgresql.org/docs/current/interactive/backup.html
If you're thinking about trying to directly access and interpret data in the disk files, bypassing the database management system, that is a very bad idea for a lot of reasons. For one, the storage scheme is very complex. For another, it tends to change in every new major release (issued once per year). Thirdly, the ghost of E.F. Codd will probably haunt you; see rules 8, 9, 11, and 12 of Codd's 12 rules.