Is Postgres designed to write to shared data stores? - postgresql

Here's an important tip about volume sharing in docker:
Multiple containers can also share one or more data volumes. However,
multiple containers writing to a single shared volume can cause data
corruption. Make sure your applications are designed to write to
shared data stores.
In this context, does Postgres designed to write to shared data stores?
In other words, is it safe to run multiple Postgres containers (possibly with different minor versions) working with same database files located at the data volume?

You cannot have multiple PostgreSQL installations run against the same shared data files, this is a sure recipe for data corruption.
If your need is to update PostgreSQL without downtime, you'll need to use a replication solution that works between different major PostgreSQL versions so that you can first build a copy of the database with the new version an then switch over quickly in a controlled fashion. This still causes a small outage that has to be handled by the application.
Replication solutions that can be used are external replication tools like Slony-I or logical replication. Logical replication is fairly new, it will ship with PostgreSQL v10 (which won't help you with a current upgrade problem), but you can use it with pglogical from PostgreSQL 9.4 on.

Related

How to change default location for storing Rundeck Projects

I am using Rundeck Docker Container. It was running well for 2 months and suddenly it crashed. I lost all the data wrt a project that I had created using CLI. Is there a way to change the default path to store all project related data including job definitions, resources etc?
Out of the box, Rundeck stores the project/jobs data on their internal H2 database, this database is only for testing purposes and probably will crash with a lot of data (storing projects at filesystem is deprecated right now), the best approach is to use a "real" database like MySQL, PostgreSQL, or Oracle, in that way Rundeck stores all project/jobs data on a robust backend.
Check this MySQL, PostgreSQL and Oracle Docker environment examples.
Of course, having a backup policy for your instance would be ideal to keep safe all your instance data.

pg_dump from Azure Postgres Service which contains large data set

We are facing well-known pg_dumps effeciency problems in terms of velocity. We currently have a Azure hosted PostgreSQL, which holds our resources that are being created/updated by SmileCDR. Somehow after three months it is getting larger due to saving FHIR objects. Now, we want to have brand new environment; in that case persistent data in PostgreSQL has to be ripped out and new database has to be initiated with old data set.
Please be advised.
pg_dump consumes relative much more time, almost a day. How can we speed up backup-restore process?
What kind of alternatives that we could use and apply whereas pg_dump to achieve the goal?
Important notes;
Flyway utilized by SmileCDR to make versioning in PostgreSQL.
Everything has to be copied from old one to new one.
PostgreSQL version is 11, 2vCores, 100GB storage.
FHIR objects are being kept in PostgreSQL.
Some kind of suggestions like multiple jobs, without compress, directory format have been practiced but it didn't affect significantly.
Since you put yourself in the cage of database hosting, you have no alternative to pg_dump. They make it deliberately hard for you to get your data out.
The best you can do is a directory format dump with many processes; that will read data in parallel wherever possible:
pg_dump -F d -j 6 -f backupdir -h ... -p ... -U ... dbname
In this example, I specified 6 processes to run in parallel. This will speed up processing unless you have only one large table and all the others are quite small.
Alternatively, you may use smileutil with the synchronize-fhir-servers command, bulk export API on the system level, subscription mechanism. Just a warning that these options may be too slow to migrate the 100Gb database.
As you mentioned, if you can replicate the Azure VM that may be the fastest.

Using multiple PostgeSQL servers with a single shared network data directory?

How does PostgreSQL handle running multiple servers on different machines using a shared data directory? Does it automatically handle this under-the-hood without problems? Is it possible, but requiring some special configuration? Or is this a bad idea in general?
I'm doing some data science on high performance machine cluster, where I submit jobs, the job is run by a random machine, and each machine has access to a shared network drive. Currently, I'm using SQLite, where this use-case works fine. A single shared SQLite database file can handle multiple connections from different machines without trouble.
I'm now attempting to switch over to PostgreSQL. Intercommunication between the machines of the cluster is surprisingly not straightforward. So while the immediate solution should be having one server which all the other machines connect to, this might not end up being practical. Ideally, I could just continue doing what I've been doing with the SQLite setup. That is, have each machine run it's own PostgreSQL server, which then connects to the shared databases.
No, no, no and yes.
A PostgreSQL installation ("cluster" is the term used in the manuals) expects to be in charge of all of its files. It carefully coordinates access between multiple processes accessing those files. You are supposed to access PostgreSQL in a client/server manner over a socket (unix if local, tcp if not).
This is not supported with PostgreSQL. It will lead to corruption and data loss. If you can't simplify your networking, then you best stick to SQLite. (Assuming it is actually safe with SQLite, something I haven't verified)

Do I gain anything by using "proper" replicas for a read-only MongoDB database?

I have a web-app that depends on a read-only MongoDB database. Through trial and error, I discovered that by far the fastest way to run the ETL pipeline that populates the database is to run a local copy of MongoDB, populate the database, stop the database, and tarball the state directory.
To deploy a high-availability "cluster," I create multiple instances (or containers) running the app, each with access to a copy of the state in locally mounted storage. Putting these behind a load balancer with regular health checks and autoscaling (or in a Kubernetes cluster as a ReplicaSet), I get isolation, redundancy, easy rollbacks (using versioned storage), and easy setup in virtually any environment.
The key idea here is that because the database is read-only, it is in a sense a "stateless" application. Thus, I can treat it like any other static provider of information
There are many apparent advantages to this setup. Nevertheless, I have always had a nagging feeling that I was missing something. Given a read-only context, is there still some reason why it might be better to run a "proper" MongoDB cluster?
If you don't mind outages when the single node goes down and you don't mind taking the system down during upgrades then this is probably an ok deployment. You might get a safer dump and restore using mongodump and mongorestore rather than tar but apart from that this setup should work for a read-only deployment.

How to quickly mirror a Mongo database?

What's a quick and efficient way to transfer a large Mongo database?
I want to transfer a 10GB production Mongo 3.4 database to a staging environment for testing. I used the mongodump/mongorestore tools to test this transfer to my localhost, but it took over 8 hours and consumed a massive amount of CPU and memory, which is something I'd like to avoid in the future. The database doesn't have any indexes, so the mongodump option to exclude indexes doesn't increase performance.
My staging environment will mostly be read-only, but it will still need to write occasionally, so it can't be setup as a permanent read replica of production.
I've read about [replication sets][1], but they seem very complicated to setup and designed for permanent mirroring of a primary to two or more secondaries. I've read some posts about people hacking this to be temporary, so they can do a one-time mirroring, but I can't find any reliable documentation since this isn't the intended usage of the feature. All the guides I've read also say you need at least 3 servers, which seems unintuitive since I only have 2 (production and staging) and don't want to create a third.
Several options exist today (2020-05-06).
Copy Data Directory
If you can take the system offline you can copy the data directory from one host to another then set the configuration to point to this directory and start up the new mongod.
Mongomirror
Mongomirror (https://docs.atlas.mongodb.com/import/mongomirror/) is intended to be a tool to migrate from on-premises to Atlas, but this tool can be leveraged to copy data to another on-premises host. Beware, this connection requires SSL configurations on source and target to transfer.
Replicaset
MongoDB has built-in High Availability features using a replica set model (https://docs.mongodb.com/manual/tutorial/deploy-replica-set/). It is not overly complicated and works very well. This option allows the original system to stay online while replication does its magic. Once the replication completes reconfigure the replica set to be a single node replica set referring only to the new host and shut down the original host. This configuration is referred to as a single-node replica set. Having a single node replica set offers benefits over a stand-alone installation in that the replica set underpinnings (oplog) are the basis for other features such as change streams (https://docs.mongodb.com/manual/changeStreams/)
Backup and Restore
As you mentioned you can use mongodump/mongorestore. There is a point in time where the backup must be restored. During this time it is expected the original system is offline and not accepting any additional writes. This method is robust but has downtime associated with it. You could use mongoexport/mongoimport to use a JSON file as an intermediate step but this is not recommended as BSON data types could be lost in translation.
Per Mongo documentation, you should be able to cp/rsync files for creating a backup (if you are able to halt write ops temporarily on your production setup - or if you do this during a maintenance window)
https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files
Back Up with cp or rsync
If your storage system does not support snapshots, you can copy the files >directly using cp, rsync, or a similar tool. Since copying multiple files is not >an atomic operation, you must stop all writes to the mongod before copying the >files. Otherwise, you will copy the files in an invalid state.
Backups produced by copying the underlying data do not support point in time >recovery for replica sets and are difficult to manage for larger sharded >clusters. Additionally, these backups are larger because they include the >indexes and duplicate underlying storage padding and fragmentation. mongodump, >by contrast, creates smaller backups.
FYI - for replica sets, the third "server" is an arbiter which exists to break the tie when electing a new primary. It does not consume as many resources as the primary/secondaries. Since you are looking to creating a staging environment, i would not recommend creating a replica set that includes production and staging env. Your primary instance could switch over to the staging instance and clients who are meant to access production instance will end up reading/writing from staging instance.