I have a postgres9.5 installed on a Linux machine and for the last few days, it is continuously showing a huge CPU usage(80%-90%). I try to check pg_stat_activity table to find long-running queries or sessions but there is not anything to blame for and also I check that I have indexed on all my table but still CPU usage is a spike up from the Postgres process. Is there any way to figure out the reason?
Related
We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
Debian 9 with PG 11.9
Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.
I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time.
As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.
At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again.
I only resized my disk, now my instance disk is 20GB and is working fine.
Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.
If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.
Changing a machine type
1.- Go to the VM Instances page.
2.- In the Name column, click your instance.
From the instance details page, complete the following steps:
a) Click the Stop button to stop the instance, if you have not stopped it yet.
b) After the instance stops, click the Edit button at the top of the page.
c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.
d) Save your changes and start again your VM Instance.
You can resize your disk following this guide or with the following command:
gcloud compute disks resize DISK_NAME --size DISK_SIZE
Or with the Console:
Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.
After you resize the disk, you must resize the file system so that the operating system can access the additional space.
Note: Do not resize boot disks beyond 2 TB because this is the limit.
Edit1
You mentioned that the logs don’t show information about the issue when the instance is frozen.
Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.
For Debian, this logs should be in the following path:
/var/log/kern.log
Also the messages log could help
/var/log/messages
You can obtain more information about the logs in this link.
Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).
You can consult this URL to get more information.
Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage
Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.
This link explains better this configuration.
Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.
Edit2
I have found an article that describes a similar problem and it said that:
When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.
And this is the recommendation they give.
When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.
You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.
We're experiencing some constant outages in our back-end that seem to correlate with peaks of high CPU usage for our Cloud SQL Postgres instance (v9.6)
Taking a look to the cloudsql.googleapis.com/postgres.log, those high CPU peaks seems to also correlate to when the database is running an automatic vacuum of table cloudsqladmin.public.heartbeat
We haven't found any documentation on what this table is and why is running autovacuum so often (our own tables doesn't seem to be affected by it).
Is this normal? Should we tune the values for the autovacuum? Thanks in advance.
By looking at your graphs there is no correlation between the CPU and the cloudsqladmin.public.heartbeat autovacuum.
Lets start by what the cloudsqladmin.public.heartbeat table is, this is a table used by the Cloud SQL High Availability process, this is better explained here:
Each second, the primary instance writes to a system database as a
heartbeat signal.
So the table is used internally to keep track of your instance's health. The autovacuum is triggered based on the doc David shared.
Now, if the Vacuum process generated the CPU spike, you would see the spike every minute/second.
So, straight answers to your questions:
Is this normal? : Yes, the autovacuum and the cloudsqladmin.public.heartbeat table are completely normal from a Cloud SQL internal perspective, they should not impact in any way the Instance.
Should we tune the values for the autovacuum? : No need for that, as mentioned, this process is not the one impacting the CPU Instance, you can hide the similar logs including "cloudsqladmin.public.heartbeat" and analyze the ones left on the time the Spike was presented.
It is worth looking at the backup processes triggered too (there could be one on the same time) Cloud SQL > Instance Details > Backups, but of course, that's a different topic than the one described here :) .
Here's a recommendation that seems very relevant to your situation: https://www.netiq.com/documentation/cloud-manager-2-5/ncm-install/data/vacuum.html
We have a large table (1.6T) and deleted 60% of the records, and want to reclaim that space for the OS and file system. We're running PostgreSQL 9.4 (we're stuck on that pending a major software upgrade).
We need that space, as we're down to 100GB and when materialized views are refreshed we're running out of space on the server.
I tried running VACUUM(FULL, ANALYZE, VERBOSE) schema.tablename and let it run for 24 hours last weekend, but had to cancel it to get the server back online.
I'm running it again this weekend, after deleting the indexes (I'm hoping that will speed it up so it will finish). So far there is no output or indication of progress. I created a tablespace on another SSD array and set it up as temp space using temp_tablespaces = 'name_of_other_tablespaces', but du -chs shows it is still empty.
The query shows active, but since disk usage isn't increasing it just feels like it's just sitting there, making no noise and pretending it's not there.
This is on a server with 512GB of RAM and a RAID 10 array of very fast enterprise SSDs. Is there any way to get progress and know that something is actually happening and that it's working? Any guesses as to duration, or other suggestions?
I found out what was happening, by finally noticing that it was waiting for an autovacuum process to finish, which never happened (autovacuum: VACUUM pg_toast.pg_toast_nnnnn (to prevent wraparound)). Once I killed that the VACUUM ran quite quickly and cleared up over 1TB of space. Time to celebrate!
I have a materialized view based on a very complex (and not very efficient) query. This materialized view is used for BI/visualization. It often takes ~4 minutes to complete the refresh, which is good enough for my needs. Running ANALYZE shows total cost of 2,116,446 with 137,682 rows and 1,976 width.
However, sometimes refresh materialize view XXX just never completes. Looking at the top processes (top in ubuntu), the process will use 100% CPU and 8.1% (of the server's 28GB memory) for a while... then all of a sudden it just disappears from the top list. It usually happens after ~4-5 minutes (although statement_timeout config is disabled). The postgres client just keeps waiting forever, and the view never refreshes.
Running the query behind the view directly (i.e. SELECT ...) will fail as well (same issue).
I'm using version 9.5. I've tried to increase effective_cache_size, shared_buffers, and work_mem in postgres config, but still the same result.
Sometimes, after several attempts, the refresh command will complete successfully. But it's unpredictable and currently just wouldn't work even after multiple attempts / db restarts.
Any suggestions to what might be the problem?
All,
I am running CentOS 6.0 with Postgresql 8.4 and can't seem to figure out how to prevent so much disc swap from occurring. I have 12 gigs of RAM and 4 processors and I am doing some simple updates (1 table at a time). I thought for a minute that the inserts happening in parallel from a script I wrong was causing the large memory usage but when I saw the simple update causing it too I basically threw in the towel and decided to ask for help.
I pasted the conf file here. http://pastebin.com/e0jdBu0J
You can see that I set the buffers relatively low and the connection amounts high. The DB service will not start if I set the shared buffers any higher than 64 megs. Anyone have an idea what may be causing this for me?
Thanks,
Adam
If you're going into swap, increasing shared_buffers will make the problem worse; you'll be taking RAM away from the part that's running out and swapping, instead dedicating memory to the database caching. It's worth fixing SHMMAX etc. just on general principle and for later tuning work, but that's not going to help with this problem.
Guessing at the identify of your memory gobbling source is a crapshoot. Far better to look at data from "top -c" and ps to find which processes are using a lot of it. It's possible for a really bad query to consume way more memory than it should. If you see memory use spike up for a PostgreSQL process running something, check the process ID against the information in pg_stat_tables to see what it's doing.
There are a couple of things that can cause this sort of issue that often surprise people. If you are doing a large number of row updates in a single transaction, and there are foreign key checks or triggers involved, that can run out of memory. The queue of things to check in each of those cases is kept in RAM, and can be surprisingly big.
There are two problems with your PostgreSQL settings that might be related. Databases don't actually work very well if you have a lot more active connections than cores in the server; best performance is normally 2 to 3 active clients per core. And all sorts of things go wrong once you've got more than a few hundred connection. There is some connections^2 behavior that gets ugly there performance wise, and there are some memory issues too. If you really need 1250 connections, you should be using a connection pooler such as pgBouncer or pgpool-II.
And effective_io_concurrency = 1000 is way too high for any hardware on the planet. Useful values for that in a small multiple of how many disks you have in the server. I have no idea what happens as far as memory usage goes when you set it that high, but it's not been tested very well at that range. Normal settings more like 1 to 25. The parameters outlined at Tuning Your PostgreSQL Server are much more important than it is; the concurrency value only impacts one particular type of table scan.
Centos 6 seems to have a very conservative shmmax as a default
Set your shared buffers to that recommended by postgres tuning resources
see for explanation and how to set.
To experiment you can (as root) use sysctl -w kernel.shmmax = n
where n is the value that the startup error message that postgres is trying to allocate on startup. When you identify the value you wish to use permanently then set that in /etc/sysctl.conf