Using multiple PostgeSQL servers with a single shared network data directory? - postgresql

How does PostgreSQL handle running multiple servers on different machines using a shared data directory? Does it automatically handle this under-the-hood without problems? Is it possible, but requiring some special configuration? Or is this a bad idea in general?
I'm doing some data science on high performance machine cluster, where I submit jobs, the job is run by a random machine, and each machine has access to a shared network drive. Currently, I'm using SQLite, where this use-case works fine. A single shared SQLite database file can handle multiple connections from different machines without trouble.
I'm now attempting to switch over to PostgreSQL. Intercommunication between the machines of the cluster is surprisingly not straightforward. So while the immediate solution should be having one server which all the other machines connect to, this might not end up being practical. Ideally, I could just continue doing what I've been doing with the SQLite setup. That is, have each machine run it's own PostgreSQL server, which then connects to the shared databases.

No, no, no and yes.
A PostgreSQL installation ("cluster" is the term used in the manuals) expects to be in charge of all of its files. It carefully coordinates access between multiple processes accessing those files. You are supposed to access PostgreSQL in a client/server manner over a socket (unix if local, tcp if not).

This is not supported with PostgreSQL. It will lead to corruption and data loss. If you can't simplify your networking, then you best stick to SQLite. (Assuming it is actually safe with SQLite, something I haven't verified)

Related

Can two postgres service share one common PGDATA folder, one at a time

Can I share data between two postgres services in separate machines (PGDATA folder will be in a shared location) while only one service will run at a time?
PostgreSQL has a number of ways to make sure that you cannot start two postmaster processes on the same data directory, but if you mount a filesystem on two machines, these mechanisms will fail. So you would have to make very sure that you don't start servers on both machines; that would lead to data corruption. Moreover, you'd have to make sure that the remote file system is reliable. A Windows network share isn't, for example.
So, all in all, my only recommendation is "don't do that". For high availability, use a proven shared-nothing architecture like Patroni.

PostgreSQL Multi master Synchronisation

I have a scenario as follows,
One cloud server is running an application with PGSQL as DB
Multiple local servers are running with same application with PGSQL as DB
User may access the cloud server for read/write data
User may access any of the local server to read/write data
What I need is synchronisation between all these databases. The synchronisation can be done live if connectivity is available, or immediately when connectivity is available.
Please guide me with some inputs, where can i start from.
Rethink your requirements.
Multimaster replication is full of pitfalls, and it is easy to get your databases out of sync unless you plan carefully. You'd probably be better off with a single master node.
That said, you could look at BDR by 2ndQuadrant which provides such functionality.

Is Postgres designed to write to shared data stores?

Here's an important tip about volume sharing in docker:
Multiple containers can also share one or more data volumes. However,
multiple containers writing to a single shared volume can cause data
corruption. Make sure your applications are designed to write to
shared data stores.
In this context, does Postgres designed to write to shared data stores?
In other words, is it safe to run multiple Postgres containers (possibly with different minor versions) working with same database files located at the data volume?
You cannot have multiple PostgreSQL installations run against the same shared data files, this is a sure recipe for data corruption.
If your need is to update PostgreSQL without downtime, you'll need to use a replication solution that works between different major PostgreSQL versions so that you can first build a copy of the database with the new version an then switch over quickly in a controlled fashion. This still causes a small outage that has to be handled by the application.
Replication solutions that can be used are external replication tools like Slony-I or logical replication. Logical replication is fairly new, it will ship with PostgreSQL v10 (which won't help you with a current upgrade problem), but you can use it with pglogical from PostgreSQL 9.4 on.

Learning NOSQL databases using a single machine?

In relational databases I would just pop in W3Schools tutorial, install mysql in my machine and start practicing! How can I learn non relational databases in a similar way? In most tutorials I read that these databases work with multiple nodes and data centers.
Does this mean that I will be unable to learn and practice, say Cassandra, using my own single pc?
You do it just like you do it with mySQL: You set up a database on your local machine and start experimenting.
Most database systems which focus on sharding and clustering also work as a stand-alone instance. But when you want to test these features specifically, you can often run multiple instances on the same machine. When you also want to try how they behave when they run on different machines, you can use a virtualization software like VMWare or VirtualBox to set up a bunch of virtual machines and build your virtual datacenter on your desktop.
(I would recommend VMWare for business use and VirtualBox for home use)
I'm a big fan of MongoDB. It's the NoSQL equivalent of MySQL.
Go to the Try It Out link on their home page and you can actually use it in a live session on their website - no download, no configuration, no hassle! Just use it and learn the basics.
Here's the quick start for Cassandra. http://wiki.apache.org/cassandra/GettingStarted
I don't see any reason you couldnt run that from local host. I think the point is that you Can scale these nosql solutions. Might want to check out mongodb or couchdb as well. Easy set up and both are great nosql solutions in my experience.
I would strongly suggest using something like Amazon EC2 for testing NoSQL solutions. You can definitely install a technology like MongoDB locally and create a replica set, but you should definitely put these on different physical machines if you can.
I have installed things like AppFabric, Couchbase and Mongo locally and created clusters and they always work really well locally. It's very easy because the networking part of it always goes smoothly.
Once you introduce two physical machines and a stronger network partition things get difficult.
You can create instances on EC2 for free last I checked if you use their Micro instances. You'll learn a lot.

What's the memcached server

I'm a beginner in learning memcached. The memcached server confused me most. Can I see it as a single server computer just like web server? I'm also confused about the relationship between memcached server and client, are they located at different computers?
I agree with most things #phihag has answered but I must clarify some things.
Memcached stores data according to a key (phihag called it an id, not to be confused with database ids). Data can be of various sizes so you can store small bits (like 1 record pulled from the database) or you can store huge chunks of data (like hundreds of records, or entire finished html pages).
Memcached is not typically used on the same machine as the application server, the reason for this is because it is designed to be used via TCP (it would be accessible via sockets if it were designed to work on the same server) and it was designed as a pooling server.
The pooling part is interesting - you can have 10 machines running Memcached each allocating a maximum 10GB of ram for this purpose. 10*10 = 100GB ram space.
When you write a value into Memcached only one (randomly or via some algorithms) of the servers stores it. When you try to read a value from Memcached only the server that has stored it will send it to you.
So indeed you can put all database/memcached/application/fileserver on the same machine, and typically you do that for you development sandbox. But you also can put each on a separate machine and any other combination of the two.
If you only need one Memcached server you will probably be OK with hosting it on the same machine as the application code.
If you start using a front-end cache server such as varnish or you configure NginX as a front-end cache server, you will have to configure some Memcached servers to store the data that these front-end cache servers are caching.
If you distribute your database into multiple servers and file servers into a CDN, that means that your application handles a lot of data in a short period of time so you'll need a lot of RAM space that couldn't be available in one single application server.
And since extending a memory pool for a Memcached server is as easy as adding the IP of the new server to the list, you will be scaling horizontally as in many servers (which is Memcached's true typical use).
The memcached server is a program which manages the data that memcached stores (not to be confused with a machine, which may also be called server). In theory, it can run on any computer. However, it is typically run on the same machine that the main application runs on.
The application then uses its memcached client to talk to the memcached server and ask for cached content. This is faster than querying data from a traditional database because
A memcached server just maps IDs to values, and never needs to scan an entire table
The memcached protocol is simpler. The server doesn't need to parse SQL or so, and the client doesn't need to craft it.
Since memcached does not require the reliability of a database (think of backups, fault isolation, clustering, security etc.), it can be run on the same machine that the application runs on. While you could run a database on the same machine that the applications runs on, doing so is frowned upon for the above reasons.