AWS RDS: high write IOPS - postgresql

We are using RDS instance of type m4.xlarge in ap-south-1 region using Postgres-9.6 Engine.
Yesterday, the write IOPS skyrocketed suddenly. Here are some details around that time. Around 20GB space got filled in 1 and a half hour and then got free automatically. The number of connections also went crazy up to 1500.
Write IOPS Graph,
Free Storage Space Graph,
There is no way our application could have written so much data - 20 GB. Even if it had written, how it got automatically deleted?
Is there any way of knowing what happened here? Has anyone else faced this issue?

Related

AWS RDS Storage increasing unexpectedly

I’m using RDS PostgreSQL with the free tier configuration (t2.micro instance, 20 GB of general purpose database storage (SSD) and 20 GB of backups storage).
I’ve created a database with a few tables and the current usage is below 200 MB.
The issue is that the storage has been increasing day by day at a rate of 1 GB per day since I’ve created the database. As a workaround I’ve tried to stop this behavior by decreasing the backup retention period from 7 to 0 days. From yesterday on, no more backups were stored. However, the storage kept on growing at almost the same rate.
I have no clue what's going on and I don't know what to do to stop the consumption and avoid exceeding the free tier storage.
Apart from that, I also don't know which storage is the one that has been increasing (data or backup storage) because the AWS platform only says that the increase is in the RDS Storage (it doesn't distinguish between both types of storages). As I’m not storing a lot of data in the database I suspect the problem is in the backups and snapshots storage and not in the data storage itself.
Thanks in advance!
Update 2022-09-13
The curious fact is that I have a budget threshold of 8 Gb/month and on the same day every month the threshold is exceeded (in my case, on the 13th of each month). But at the end of the month it kinda resets by itself (it returns to zero). I don't know what is happening. Does anyone know where that temporary Gbs usage comes from?

GCP CloudSQL (PostgreSQL) Crash During Stored Procedure Execution and Failover

I have a stored procedure in GCP CloudSQL (PostgreSQL v9.0.23). It works find in lower environments; but when it runs in Production (with significantly more volume), it crashes the DB itself which results in a Failover.
When we checked the metrics, what we found out is that the memory is more than 90% just before it crashes (15 GB out of the 16GB instance memory). Also the Read / Writes are very high >1000 Ops per second.
The SP does some select and insert statements. Any suggestions to improve this situation helps.
Thanks in advance.
As you have mentioned that the Cloud SQL instance is running smoothly with a small amount of workload but crashing with the Production environment where more intensive workloads are there, it seems the issue is with the instance size. So I would suggest you increase the instance size as per your need.
Also you have mentioned that the memory usage is 15 GB out of 16 GB which amounts to nearly 94%. As per this document your Cloud SQL instance will not be covered under Cloud SQL SLA if memory usage is over 90% for more than 6 hours of duration. So I would suggest you keep the memory usage within 90%. Also I would suggest keeping the CPU utilization as mentioned in this document. To know when your instance reaches any threshold I will suggest you set a monitoring alert for that metrics as mentioned here.
If increasing your instance size doesn’t help I would recommend you to create a support ticket with Google Cloud Support so that they can investigate in detail.

Downtime if increasing memory for Google Cloud SQL

I would like to increase memory for my Google Cloud SQL instance but I am unable to find the downtime for the same. On this page https://cloud.google.com/sql/faq I can find downtime for storage (zero downtime) and CPU (few minutes) but nothing for memory.
Can somebody help me understand if there is downtime associated if I want to increase my RAM for the instance.
So I upgraded my RAM and it seems that the downtime is about 2 minutes, roughly the amount to restart my underlying machine.
I hope this helps somebody

"frozen" Google compute engine instance with PostgreSQL

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
Debian 9 with PG 11.9
Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.
I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time.
As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.
At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again.
I only resized my disk, now my instance disk is 20GB and is working fine.
Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.
If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.
Changing a machine type
1.- Go to the VM Instances page.
2.- In the Name column, click your instance.
From the instance details page, complete the following steps:
a) Click the Stop button to stop the instance, if you have not stopped it yet.
b) After the instance stops, click the Edit button at the top of the page.
c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.
d) Save your changes and start again your VM Instance.
You can resize your disk following this guide or with the following command:
gcloud compute disks resize DISK_NAME --size DISK_SIZE
Or with the Console:
Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.
After you resize the disk, you must resize the file system so that the operating system can access the additional space.
Note: Do not resize boot disks beyond 2 TB because this is the limit.
Edit1
You mentioned that the logs don’t show information about the issue when the instance is frozen.
Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.
For Debian, this logs should be in the following path:
/var/log/kern.log
Also the messages log could help
/var/log/messages
You can obtain more information about the logs in this link.
Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).
You can consult this URL to get more information.
Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage
Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.
This link explains better this configuration.
Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.
Edit2
I have found an article that describes a similar problem and it said that:
When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.
And this is the recommendation they give.
When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.
You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.

Do we need Provisioned IOPS for RDS instance that's using 60 IOPS according to monitoring?

We have PostgreSQL instance serving tens of r/w queries per second.
Instance type: db.m3.2xlarge
Instance Provisioned IOPS (SSD): 1000
Instance storage size: 100GB , Database size is about 5-10GB.
It is serving 100s of simultaneous clients with read-write queries. Yet, when we look at Cloudwatch Monitoring it shows IOPS in range of 20-60.
And Read iOPS is around 0!
This can't be right with 100s of connections and clients performing read/write queries all the time?
The Postgres configuration is standard, we did not turn off fsync.
Is the cache so effective that IOPS is not a factor with database size of 5GB?
Or AWS monitoring console wrong?
Paying for 1000 IOPS cost extra $300 for this db instance.
And minimum IOPS you can buy is 1000.
I am wondering if we can do without IOPS?
Or AWS monitoring is not correct?
Or 20 IOPS we're having now will kill the server performance if we have non-IOPS server?
Or with 5GB database it mostly fits in cache and IOPS is not a factor?
#CraigRinger is correct. If your dataset is small enough to fit entirely in memory, you won't need provisioned IOPS since insert/update traffic and logs are the only consuming IOPS.
But in case someone finds this topic, here's what CloudWatch looks like when you've exhausted your GP2 credits. As you can see there the Read and Write IOPS charts don't tell us much, but the read/write latency charts show massive spikes.
For context, these are 2 weeks of a PostgreSQL read replica used for analytics. The switch from 100GB GP2 (300 Base IOPS, $11.50/mo) to 100GB io1 (1000 IOPS, $112.50/mo) happens about 2/3 way through these charts (no more latency spikes). The cheaper option would've been to just up the quantity of GP2 storage. Provisioned IOPS are outrageously overpriced, but predictable behavior during heavy workloads in this instance made sense.
Your DB is almost entirely cached in RAM. (You can confirm this with use of the pg_buffercache extension). Those IOPS numbers are entirely to be expected. I would expect this server to be just fine without provisioned IOPS.
If you restart the instance it'll be slow for a little while as it builds the cache back up, but 5GB isn't much for that. Also, having provisioned iops actually makes this worse, because as well as setting a minimum I/O rate, piops sets the maximum too. It's a target rate not a minimum.
By contrast, regular volumes can burst to much higher read rates than piops volumes, so they'll perform better when you're warming the cache back up after a restart.
BTW:
Restarting the database won't slow it much, as it only has to read data from the OS's disk cache back into shared_buffers. It's only if you restart the whole machine that you'll see a slowdown for a while. If you want to simulate this without a restart, you can use Linux's drop_caches feature:
echo 1 | sudo tee -a /proc/sys/vm/drop_caches
This is actually worse than the situation after a restart because it evicts binaries and libraries from memory too. The system will chug very heavily at first, as it reads the very frequently accessed binaries and libraries it's executing back into RAM. Then you'll start to see cache recovery behaviour like you would after a restart.
Also, you have too many connections configured. Install pgbouncer, put it in front of the database, and reduce your max_connections. You'll get better performance.