Periodic RDS Postgresql Replication Delays - postgresql

I have been observing that my PostgreSQL read replica shows periodic delay for replication lags. The lag seems to build to up to 30-40 minutes and then automatically goes down to 0. There is a correlation with CPU Utilization but it's nowhere close to CPU limit.
Read traffic comes from a reporting software called DOMO. DOMO periodically copies a large chunk of data & full tables into its warehouse.
Here's AWS Cloudwatch graph. The red line shows Replication Lag in seconds. The blue line shows the CPU load.
Lag vs CPU
Lag vs Network Out
Lag vs Read IOPS
Lag vs Write IOPS
Cloud: Amazon RDS
Instance Size: db.m3.2xlarge
PostgresSQL version: 9.3
Postgres Settings:
Shared Buffers (Set by RDS) = 7.3 GB (956978 * 8KB)
Updates
Tried setting Shared Buffers to 1GB (didn't help)
Updates June, 5 2017
I created a branch new replica for my database and pointed the reporting software (DOMO) at it. Things in the new instance look stable for now. The old replica which has no read traffic now is stable as well. Beginning to suspect some type of AWS config issue or something to do what remaining artifacts in the database (vacuum?).

RDS read replica lag metric isn't updated when there's nothing to replicate. If master database has no changes to replicate, then replica would only be updated on time-forced so called checkpoint - periodic sync of data from write ahead log to the tables.
This would cause the graph to look like above. To see the real graph data you'd have to generate some traffic on the master, for example update some special sequence every minute or even every second - depending how much resolution you need.
Also WAL-generation log of master and network utilization on replica graphs would be interesting - the alternative explanation would be that there are too much traffic (IO or network) for replica to handle and it can only catch-up when traffic stops.

Related

AWS RDS with Postgres : Is OOM killer configured

We are running load test against an application that hits a Postgres database.
During the test, we suddenly get an increase in error rate.
After analysing the platform and application behaviour, we notice that:
CPU of Postgres RDS is 100%
Freeable memory drops on this same server
And in the postgres logs, we see:
2018-08-21 08:19:48 UTC::#:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.
But since we're on RDS, we cannot access system logs /var/log messages to confirm.
So can somebody:
confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?
I didn't find the answer here:
http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html
In terms of memory allocation, that's the recommendation:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory, check the ReadIOPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadIOPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadIOPS, your working set was not almost completely in memory.
Continue to scale up until ReadIOPS no longer drops dramatically after
a scaling operation, or ReadIOPS is reduced to a very small amount.
For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.
Also, that's their recommendation to troubleshoot possible OS issues:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console, or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring, see Enhanced
Monitoring
There's a lot of good recommendations there, including query tuning.
Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:
Aurora features a distributed, fault-tolerant, self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas, point-in-time recovery, continuous backup to Amazon S3,
and replication across three Availability Zones.
EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object, and from that connection you
create statements. When you execute the statments you get a result
set.
Now, every one of these objects needs to be closed, but if you close
statement, the entry set is closed, and if you close the connection
all the statements are closed and their result sets.
We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.
The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the PostgreSQL database.
My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed, so the result set they returned never closed as well. This
meant the database didn't free the relevant cursor resources in the
DB, and since it was over a huge table it took a lot of RAM.
Hope it helps!
TLDR: If you need PostgreSQL on AWS and you need rock solid stability, run PostgreSQL on EC2 (for now) and do some kernel tuning for overcommitting
I'll try to be concise, but you're not the only one who has seen this and it is a known (internal to Amazon) issue with RDS and Aurora PostgreSQL.
OOM Killer on RDS/Aurora
The OOM killer does run on RDS and Aurora instances because they are backed by linux VMs and OOM is an integral part of the kernel.
Root Cause
The root cause is that the default Linux kernel configuration assumes that you have virtual memory (swap file or partition), but EC2 instances (and the VMs that back RDS and Aurora) do not have virtual memory by default. There is a single partition and no swap file is defined. When linux thinks it has virtual memory, it uses a strategy called "overcommitting" which means that it allows processes to request and be granted a larger amount of memory than the amount of ram the system actually has. Two tunable parameters govern this behavior:
vm.overcommit_memory - governs whether the kernel allows overcommitting (0=yes=default)
vm.overcommit_ratio - what percent of system+swap the kernel can overcommit. If you have 8GB of ram and 8GB of swap, and your vm.overcommit_ratio = 75, the kernel will grant up to 12GB or memory to processes.
We set up an EC2 instance (where we could tune these parameters) and the following settings completely stopped PostgreSQL backends from getting killed:
vm.overcommit_memory = 2
vm.overcommit_ratio = 75
vm.overcommit_memory = 2 tells linux not to overcommit (work within the constraints of system memory) and vm.overcommit_ratio = 75 tells linux not to grant requests for more than 75% of memory (only allow user processes to get up to 75% of memory).
We have an open case with AWS and they have committed to coming up with a long-term fix (using kernel tuning params or cgroups, etc) but we don't have an ETA yet. If you are having this problem, I encourage you to open a case with AWS and reference case #5881116231 so they are aware that you are impacted by this issue, too.
In short, if you need stability in the near term, use PostgreSQL on EC2. If you must use RDS or Aurora PostgreSQL, you will need to oversize your instance (at additional cost to you) and hope for the best as oversizing doesn't guarantee you won't still have the problem.

Slow replication recovery due to communication problems

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.
From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.
When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.
Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.
Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.

Mongo Db secondary setup

From last 1 week I am trying to setup replica set for my one node mongodb (3.4.2 version) but facing multiple issues. My primary node currently have around 650 gb of data and every day it is growing by 90 gb. First time I added new secondary node with empty data directory after almost a day it failed with too much of lag in oplog issue. Next time I tried manually copying data. After copy when restarted secondary it started giving me the error that I cannot synch from primary (There was not connection problem I was able to ping). I again retried manual copy procedure but this time it failed with below error. As wired tiger issue is with specific collection file. I copied that file again and retried but it failed again with same issue. Can someone please help me in setting up secondary. Everyday it is becoming more difficult as data is growing and I cannot keep primary down for long time (During manual copy I stop all writes in primary).
2017-03-02T16:08:16.315+0000 E STORAGE [initandlisten] WiredTiger error (-31802) [1488470896:315136][17051:0x7ffdbd3d7dc0], file:mcse.45trace/collection-16-7756455024301269277.wt, WT_SESSION.open_cursor: /app/data/mcse.45trace/collection-16-7756455024301269277.wt: handle-read: pread: failed to read 4096 bytes at offset 86474874880: WT_ERROR: non-specific WiredTiger error
2017-03-02T16:08:16.315+0000 I - [initandlisten] Invariant failure: ret resulted in status UnknownError: -31802: WT_ERROR: non-specific WiredTiger error at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 95
If you can solve that first problem with the replication lag, then you will probably get everything running OK. Take a look at the Troubleshooting Replica Sets guide, it has some useful suggestions:
Possible causes of replication lag include:
Network Latency
Check the network routes between the members of your set to ensure that there is no packet loss or network routing issue.
Use tools including ping to test latency between set members and traceroute to expose the routing of packets network endpoints.
Disk Throughput
If the file system and disk device on the secondary is unable to flush data to disk as quickly as the primary, then the secondary will have difficulty keeping state. Disk-related issues are incredibly prevalent on multi-tenant systems, including virtualized instances, and can be transient if the system accesses disk devices over an IP network (as is the case with Amazon’s EBS system.)
Use system-level tools to assess disk status, including iostat or vmstat.
Concurrency
In some cases, long-running operations on the primary can block replication on secondaries. For best results, configure write concern to require confirmation of replication to secondaries. This prevents write operations from returning if replication cannot keep up with the write load.
Use the database profiler to see if there are slow queries or long-running operations that correspond to the incidences of lag.
Appropriate Write Concern
If you are performing a large data ingestion or bulk load operation that requires a large number of writes to the primary, particularly with unacknowledged write concern, the secondaries will not be able to read the oplog fast enough to keep up with changes.
To prevent this, request write acknowledgement write concern after every 100, 1,000, or another interval to provide an opportunity for secondaries to catch up with the primary.
For more information see:
• Write Concern
• Replica Set Write Concern
• Oplog Size
WiredTiger error (-31802) file:xxx.wt
This could be related to corrupted .wt files (e.g. WiredTiger.wt/WiredTiger.turtle) as per SERVER-31076 bug report.
Try running:
mongod --repair --dbpath /path/to/data/db
Also make sure all data/db files have the right read and write permission.

Autoscaling limited by RDS connection

I have some nightly jobs that are running on EC2 and the number of machines is scaled by the number of messages in SQS. My process requires reads from a Postgres RDS database. Now these are the issues I am facing.
Not able to scale beyond a certain number because of the unavailability of connections.
I tried creating a connection pool using pgbouncer, and tried with different settings as well, but it's missing a lot of data on the resultant set.
Make your postgresql RDS install multi AZ. Then you can make read replicas on demand and scale read performance with your load.
To answer the comments:
Some extra "plumbing" is required to make the connections to the read replica. Maybe route53 dynamically updated records as the scaling happens or something like haproxy
The reason I mention multi AZ is that this would help prevent downtime during an auto scaling event bringing up the read replica
It would be simpler (but more costly) to permanently bring up a read replica and use DNS round robin to share the load
See https://aws.amazon.com/blogs/aws/amazon-rds-announcing-read-replicas/ for information on read replicas

High CPU Utilisation on AWS RDS - Postgres

Attempted to migrate my production environment from Native Postgres environment (hosted on AWS EC2) to RDS Postgres (9.4.4) but it failed miserably. The CPU utilisation of RDS Postgres instances shooted up drastically when compared to that of Native Postgres instances.
My environment details goes here
Master: db.m3.2xlarge instance
Slave1: db.m3.2xlarge instance
Slave2: db.m3.2xlarge instance
Slave3: db.m3.xlarge instance
Slave4: db.m3.xlarge instance
[Note: All the slaves were at Level 1 replication]
I had configured Master to receive only write request and this instance was all fine. The write count was 50 to 80 per second and they CPU utilisation was around 20 to 30%
But apart from this instance, all my slaves performed very bad. The Slaves were configured only to receive Read requests and I assume all writes that were happening was due to replication.
Provisioned IOPS on these boxes were 1000
And on an average there were 5 to 7 Read request hitting each slave and the CPU utilisation was 60%.
Where as in Native Postgres, we stay well with in 30% for this traffic.
Couldn't figure whats going wrong on RDS setup and AWS support is not able to provide good leads.
Did anyone face similar things with RDS Postgres?
There are lots of factors, that maximize the CPU utilization on PostgreSQL like:
Free disk space
CPU Usage
I/O usage etc.
I came across with the same issue few days ago. For me the reason was that some transactions was getting stuck and running since long time. Hence forth CPU utilization got inceased. I came to know about this, by running some postgreSql monitoring command:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
This command shows the time from which a transaction is running. This time should not be greater than one hour. So killing the transaction which was running from long time or that was stuck at any point, worked for me. I followed this post for monitoring and solving my issue. Post includes lots of useful commands to monitor this situation.
I would suggest increasing your work_mem value, as it might be too low, and doing normal query optimization research to see if you're using queries without proper indexes.