Postgres Patroni and etcd on the Same Machine - postgresql

Assuming I have 2 postgres servers (1 master and 1 slave) and I'm using Patroni for high availability
1) I intend to have three-machine etcd cluster. Is it OK to use the 2 postgres machines also for etcd + another server, or it is preferable to use machines that are not used by Postgres?
2) What are my options of directing the read request to the slave and the write requests to the master without using pgpool?
Thanks!

yes, it is the best practice to run etcd on the two PostgreSQL machines.
the only safe way to do that is in your application. The application has to be taught to read from one database connection and write to another.
There is no safe way to distinguish a writing query from a non-writing one; consider
SELECT delete_some_rows();
The application also has to be aware that changes will not be visible immediately on the replica.
Streaming replication is of limited use when it comes to scaling...

Related

Zalando operator- load balance read-write pgbouncer

I have installed Postgres cluster using zalando operator.
Also enabled pgbouncer for replicas and master.
But I would like to combine or load balance replicase and master connections,
So that read requests can be routed to read replicas and write requests can be routed to master.
Can anyone help me out in achieving this.
Thanks in advance.
Tried enabling pgbouncer.
pgbouncer is getting enabled either to master or to slave.
But I need a single point where it can route read requests to slaves and write requests to master.
There is no safe way to distinguish reading and writing statements in PostgreSQL. pgPool tries to do that, but I think any such solution is flaky. You will have to teach your application to direct reads and writes to different data sources.
I don't think Pgbouncer provides any out of the box way to load balance read and write queries. An alternative to that is the use of pgpool as a connection pooler. Pgpool provides a mode known as load_balance_mode which you can turn it on and it will try to load balance queries and send write queries to master and read queries to replica. You can read more about the load_balance_mode here

High number of connections to Airflow Metadata DB

I tried to find information about the number of connections that Airflow establishes with the metadata database instance (Postgres in my case).
By running select * from pg_stat_activity I realized it creates at least 7 connections whose states change between idle and idle in transaction. The queries are registered as COMMIT or SELECT 1 (mostly). This was using the LocalExecutor on Airflow 2.1, but I tested with an installation of Airflow 1.10 having the same results.
Is anyone aware of where these connections come from? And, is there a way (and a reason) to change this?
Yes. Airflow will Open big number of connections - basically every process it creates will almost for sure open at least one connection. This is "known" characteristics of Apache Airflow.
If you are using MySQL - this is not a big issue as MySQL is good in handling multiple connections (it multiplexes incoming connnections via threads). Postgres uses process-per-connection approach which is much more resource-hungry.
The recommended way to handle that (Postgres is the most stable backend for Airflow) is to use PGBouncer to proxy such connections to Postgres.
In our Official Helm Chart, PGBouncer is used by default, when Postgres is used. https://airflow.apache.org/docs/helm-chart/stable/index.html
I Highly recommend this approach.

Check postgresql replication

I have created a replicated Postgresql database (Master - Slave). I did this with an already existing Ansible Playbook (Role) , which I don't fully understand yet. The cluster currently consists of only 2 databases on different VMs.
So I want to test this replication now.
Unfortunately I have little experience with Postgresql.
How can I control whether they connect stable?
If the slave really takes over the task if the master should fail?
Many thanks for any information, tips & tricks.
Postgresql v. 9.6
Official PostgreSQL does not yet support automatic failover (Although there are multiple third-party projects which support this feature). Therefore if the deployment you have mentioned is only official PostgreSQL, after master failure, none of replicas take over the write task. But they can answer read queries if they are configured as hot_standby.
If you want to check the state of replication, in master you can check out pg_stat_replication in master.
Also these official docs would help you understand Postgres streaming replication & failover better:
https://www.postgresql.org/docs/9.6/warm-standby.html#STREAMING-REPLICATION
https://www.postgresql.org/docs/9.6/warm-standby-failover.html

Clustering PostgreSQL clusters

This will be confusing for some due to poor terminology choices by the PostgreSQL folks, but please bear with me...
We have a need to be able to support multiple PostgreSQL (PG) clusters, and cluster them on multiple servers using, e.g. repmgr. For example, to support both server availability and also PITR for each PG cluster. A single PG cluster per server is too expensive in many cases, so we multi-tenant (small) customers on separate PG clusters, for data separation, recovery, etc., but also want to be able to support HA via replication/fail-over.
The closest analogy for a PG cluster is a SQL Server instance - each can host multiple DB's, has its own port, etc. Like SQL Server, you can run multiple instances (PG clusters) on the same server, and set up replication for each.
Basic repmgr setup is no problem - that seems fairly clear in the single PG cluster model. But, is there any recommended/supported approach to multiple PG clusters using repmgr? I can kind of imagine faking repmgr into thinking each PG cluster is in effect a separate repmgr cluster (with separate repmgr.conf, connection info/port). But, I'm not yet sure that will work.
I'd typically expect to fail-over all PG clusters on the same server - not one at a time.
I recognize this may not be the best idea in all cases, but am mostly exploring what's possible. I have some alternatives, but this is closest to our current single-node model.
To clarify, I need to support many thousands of customers across many server clusters. Ideally, each cluster uses the same repmgr DB (in the main PG cluster, e.g.), and essentially stands alone from the other server clusters.
Thanks...
Answering my own question, but I hope someone eventually posts a better answer, as I otherwise quite like repmgr. In the end, it appears repmgr just isn't suitable for multiple PG clusters (instances), as there is an implied relationship between the repmgr cluster connection strings and the PG cluster (port). Thus, you'd essentially have to create a separate repmgr environment (DB) for every clustered PG cluster/instance, losing a lot of the operational simplicity that repmgr brings to the table.
I will investigate a more generic solution using Corosync/Pacemaker/etc., as at least in that case, the virtual cluster IP handling is built-in to the solution, and doesn't require additional software/resources to pull off.
I'm sure I'm probably over-simplifying things, but it seems like repmgr was tantilizingly close to solving much of the problem, had it allowed the repmgr DB to be fully independent of the PG cluster and allow each repmgr cluster to specify its own connection info, not (only) the connection info for the repmgr DB itself.

Can I keep two mongo databases synced?

I have an app that can run in offline mode. If offline it uses a local mongo database, if it has a data connection it will use a remote mongo database.
Is there an easy way to sync these two databases and make sure they both have the union of their collections and documents?
EDIT: Effectively there are two databases that could both have insertions and deletions happening on them that aren't happening on the other. At fixed points in time I would like to have both databases show the union of them both.
For example over a period of time.
DB1.insert(A)
DB1.insert(B)
DB2.insert(C)
DB1.remove(A)
RUN SYNC
DB1 = DB2 = {B, C}
EDIT2: Been doing some reading. It's not the intended purpose but could they be set up as slaves replica sets of the remote and used that way? Problem is that I think replicas need to have a replica hosts must be accessible by way of resolvable DNS. Not sure how the remote could access local host.
You could use replica set but MongoDB doesn’t support master-master replication. Let's assume if you have setup like this:
two nodes with priority 1 which will be used as remote servers
single arbiter to ensure majority if one of remotes dies
5 local dbs with priority set as 0
When your application goes offline, it will stay secondary so you won't be able to perform writes. When you go online it will sync changes from remote dbs but you still need some way of syncing local changes. One of dealing with could be using local fallback db which will be used for writes when you are offline. When you go online, you push all new records to master. A little bit trickier could be dealing with updates but it is doable.
Another problem is that it won't scale up if you'll need to add more applications. If I remember correctly, there is a 12 nodes per replica set limit. For small cluster DNS resolution could be solved by using ssh tunnels.
Another way of dealing with a problem could be using small restful service and document timestamps. Whenever app is online it can periodically push local inserts to remote and pull data from remote db.