Postgresql Aurora DB freeable_memory - postgresql

I have a question regarding the freeable memory for AWS Aurora Postgres.
We recently wanted to create an index on one of our dbs and the db died and made a failover to the slave which all worked fine. It looks like the freeable memory dropped by the configured 500mb of maintenance_work_mem and by that went to around 800mb of memory - right after that the 32gig instance died.
1) I am wondering if the memory that is freeable is the overall system memory and if a low memory here could invoke the system oom killer on the AWS Aurora instance? So we may want to plan in more head room for operational tasks and the running of autovacuum jobs to not encounter this issue again?
2) The actual work of the index creation should then have used the free local storage as far as I understood, so the size of the index shouldn't have mattered, right?
Thanks in advance,
Chris

Regarding 1)
Freeable Memory from (https://forums.aws.amazon.com/thread.jspa?threadID=209720)
The freeable memory includes the amount of physical memory left unused
by the system plus the total amount of buffer or page cache memory
that are free and available.
So it's freeable memory across the entire system. While MySQL is the
main consumer of memory on the host we do have internal processes in
addition to the OS that use up a small amount of additional memory.
If you see your freeable memory near 0 or also start seeing swap usage
then you may need to scale up to a larger instance class or adjust
MySQL memory settings. For example decreasing the
innodb_buffer_pool_size (by default set to 75% of physical memory) is
one way example of adjusting MySQL memory settings.
That also means that if the memory gets low its likely to impact the process in some form. In this thread (https://forums.aws.amazon.com/thread.jspa?messageID=881320&#881320) e.g. it was mentioned that it caused the mysql server to restart.
Regarding 2)
This is like it is described in the documentation (https://aws.amazon.com/premiumsupport/knowledge-center/postgresql-aurora-storage-issue/) so I guess its right and the size shouldn't have mattered.
Storage used for temporary data and logs (local storage). All DB
temporary files (for example, logs and temporary tables) are stored in
the instance local storage. This includes sorting operations, hash
tables, and grouping operations that are required by queries.
Each Aurora instance contains a limited amount of local storage that
is determined by the instance class. Typically, the amount of local
storage is twice the amount of memory on the instance. If you perform
a sort or index creation operation that requires more memory than is
available on your instance, Aurora uses the local storage to fulfill
the operation.

Related

GCP CloudSQL (PostgreSQL) Crash During Stored Procedure Execution and Failover

I have a stored procedure in GCP CloudSQL (PostgreSQL v9.0.23). It works find in lower environments; but when it runs in Production (with significantly more volume), it crashes the DB itself which results in a Failover.
When we checked the metrics, what we found out is that the memory is more than 90% just before it crashes (15 GB out of the 16GB instance memory). Also the Read / Writes are very high >1000 Ops per second.
The SP does some select and insert statements. Any suggestions to improve this situation helps.
Thanks in advance.
As you have mentioned that the Cloud SQL instance is running smoothly with a small amount of workload but crashing with the Production environment where more intensive workloads are there, it seems the issue is with the instance size. So I would suggest you increase the instance size as per your need.
Also you have mentioned that the memory usage is 15 GB out of 16 GB which amounts to nearly 94%. As per this document your Cloud SQL instance will not be covered under Cloud SQL SLA if memory usage is over 90% for more than 6 hours of duration. So I would suggest you keep the memory usage within 90%. Also I would suggest keeping the CPU utilization as mentioned in this document. To know when your instance reaches any threshold I will suggest you set a monitoring alert for that metrics as mentioned here.
If increasing your instance size doesn’t help I would recommend you to create a support ticket with Google Cloud Support so that they can investigate in detail.

"frozen" Google compute engine instance with PostgreSQL

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
Debian 9 with PG 11.9
Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.
I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time.
As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.
At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again.
I only resized my disk, now my instance disk is 20GB and is working fine.
Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.
If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.
Changing a machine type
1.- Go to the VM Instances page.
2.- In the Name column, click your instance.
From the instance details page, complete the following steps:
a) Click the Stop button to stop the instance, if you have not stopped it yet.
b) After the instance stops, click the Edit button at the top of the page.
c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.
d) Save your changes and start again your VM Instance.
You can resize your disk following this guide or with the following command:
gcloud compute disks resize DISK_NAME --size DISK_SIZE
Or with the Console:
Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.
After you resize the disk, you must resize the file system so that the operating system can access the additional space.
Note: Do not resize boot disks beyond 2 TB because this is the limit.
Edit1
You mentioned that the logs don’t show information about the issue when the instance is frozen.
Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.
For Debian, this logs should be in the following path:
/var/log/kern.log
Also the messages log could help
/var/log/messages
You can obtain more information about the logs in this link.
Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).
You can consult this URL to get more information.
Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage
Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.
This link explains better this configuration.
Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.
Edit2
I have found an article that describes a similar problem and it said that:
When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.
And this is the recommendation they give.
When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.
You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.

How to limit memory size in mongo v3.4

In v4.0.12 or later, wiredTigerMaxCacheOverflowSizeGB option is available to specify the maximum size of "lookaside table" file.
Is there any parameter which can limit the memory usage in mongo v3.4?
wiredTigerMaxCacheOverflowSizeGB limits disk usage, not memory usage.
For memory usage on 3.4 I use:
--wiredTigerCacheSizeGB 0.25
WiredTiger memory usage: see here
Tell MongoDB how much memory exists in the system: see here
Note that generally, limiting the memory available to the database (such as through system-level configuration) is not useful because if such a limit is reached, the database process will typically immediately terminate. Instead one generally would either:
Understand how much memory is required for the workloads being executed, and provision that much memory for the database, or
Limit workloads to use the memory which is available (for example through adding indexes or sharding the data).
Limiting database memory is a database anti-pattern. The primary goal of a database is to keep your working set in memory. if you really want limit memory consumption run MongoDB inside a container with a memory cap.

Is this disk read speed to be expected (Amazon EBS)?

our Amazon EBS backed instance has slowed down considerably (maybe shifted physical host?).
I've checked the instance using top and the CPU use is very low when the process is activated (like 1%). Using iotop I have monitored the disk read speed of postgresql. When there is only one postgresql thread running it's reporting about a 5M/s read speed. Is this rather slow or is this in the parameters of usual disk read speeds?
Thanks
5MB/s its more or less the typical for one single hard drive. I mean sequential accesses of course. If you have only one hard disk then your CPU its fine since one hard disk its not enough to stress out your CPU probably, if you are not reaching any more speed than that even with constant queries then your hard disk its the bottleneck.

PostgreSQL In Memory Database

I want to run my PostgreSQL database server from memory. The reason is that on my new server, I have 24 GB of memory, and hardly any of it is used.
I know I can run this command to make a ramdisk:
mdmfs -s 1024m md2 /mnt
And I could theoretically have PostgreSQL store its data there. But the problem with this is that if the server crashes or reboots, the data will be gone.
Basically, I want the database to be loaded in memory at all times so that it does not have to go to the hard disk drive to read every record, since I have TONS of memory and since memory is faster than hard disk drives.
Is there a way to do this while also having PostgreSQL write to disk so I don't lose any data in case the server goes down? Or is there a way to cache all data in memory?
I'm now using streaming replication which is async. This means my MASTER could be running all in memory, with the separate SLAVE instance using traditional disk.
A machine restart would involve stopping the SLAVE, copying the postgresql data back into ramdisk and then restarting the MASTER followed by the SLAVE. This would be an interesting possibility which compares well with something like REDIS, but with the advantage of redundancy / hotstandby / backup / sql / rich toolset etc.
have you seen the Server Configuration manual chapter? check it out, then google postgresql memory tuning.
I have to believe that Postgres is written in such a way as to take full advantage of available RAM in the server. As you may have guessed by now, there's no reliable way to do this outside of Postgres.
Within Postgres, transactions assure that all operations are atomic, so if the power goes down while you are writing to a Postgres database, you will only lose that particular operation, and not the entire database.
The answer is caching. Look into adding memory to the server, then tuning PostgreSQL to maximize memory usage. Also, the file system cache will help with this, doing some of it automatically. You will be able to speed up performance, almost as if it were in memory except for the first hit, while not having to manage it yourself, and being able to have a database larger than the physical memory.