We're experiencing some constant outages in our back-end that seem to correlate with peaks of high CPU usage for our Cloud SQL Postgres instance (v9.6)
Taking a look to the cloudsql.googleapis.com/postgres.log, those high CPU peaks seems to also correlate to when the database is running an automatic vacuum of table cloudsqladmin.public.heartbeat
We haven't found any documentation on what this table is and why is running autovacuum so often (our own tables doesn't seem to be affected by it).
Is this normal? Should we tune the values for the autovacuum? Thanks in advance.
By looking at your graphs there is no correlation between the CPU and the cloudsqladmin.public.heartbeat autovacuum.
Lets start by what the cloudsqladmin.public.heartbeat table is, this is a table used by the Cloud SQL High Availability process, this is better explained here:
Each second, the primary instance writes to a system database as a
heartbeat signal.
So the table is used internally to keep track of your instance's health. The autovacuum is triggered based on the doc David shared.
Now, if the Vacuum process generated the CPU spike, you would see the spike every minute/second.
So, straight answers to your questions:
Is this normal? : Yes, the autovacuum and the cloudsqladmin.public.heartbeat table are completely normal from a Cloud SQL internal perspective, they should not impact in any way the Instance.
Should we tune the values for the autovacuum? : No need for that, as mentioned, this process is not the one impacting the CPU Instance, you can hide the similar logs including "cloudsqladmin.public.heartbeat" and analyze the ones left on the time the Spike was presented.
It is worth looking at the backup processes triggered too (there could be one on the same time) Cloud SQL > Instance Details > Backups, but of course, that's a different topic than the one described here :) .
Here's a recommendation that seems very relevant to your situation: https://www.netiq.com/documentation/cloud-manager-2-5/ncm-install/data/vacuum.html
Related
Update after some research, it seems this question was incorrect - the 100% was representing all cores, not a single core, making the whole question moot. My sincere apologies to the community.
On PostgreSQL 10, PostGIS 2.5.2, without any data modifications (SELECT queries only), I have 40 identical GIS queries running in parallel (with different params), each taking ~20-500ms. Server has lots of RAM, NVME SSDs.
The CPU usage consistently shows 100% of a single core, implying that all queries are stuck waiting for something that cannot execute in parallel, but I am not sure how to find it.
Examining pg_stat_activity multiple times shows that all queries are active, and their wait_event could be one of these cases:
wait_event is NULL for all
a few ClientRead and lock_manager, NULL everything else
a lot of lock_manager, and a few ClientRead and NULLs.
Is there a way to figure out what may be causing this?
That is surprising, as reading queries never lock on anything short of an ACCESS EXCLUSIVE lock that is required by operations like DROP TABLE, TRUNCATE, ALTER TABLE and similar statements.
Perhaps the locks are “light-weight locks” on internal PostgreSQL data structures, which are usually only held for a very short time. I don't know what in a PostGIS query could have high contention on such internal locks, but then you didn't show the statement or its execution plan, nor did you show the exact lock events.
If you have several concurrent queries that each take a long time like 500ms, the definitely should be running in parallel.
Apart from the possibilities of some internal lock contention, I can think of two explanations:
Most of the queries are short enough that a single core suffices to process all the queries. Each connection spends most of its time waiting for the client.
The system is I/O bound, so that most of the CPUs just twiddle their thumbs. That would be indicated by a CPU iowait% of 10 or more.
I wanna have brainstorm with you guys all about scaling option that DB2 have. Hope can helped me to resolve the problem.
I need to scale my DB2 database to anticipate flash crowd transaction to database server. My database can only serve around 200 transaction++ per sec in application term not database tps before my database totally stalled and out of cpu.
What are you guys think, if I want to increase to reach 2000++ or 10 times before, what options i have to scale my database?
Recently i read about pureScale feature. Its look promising but its not flexible solution by mean it just can be deploy on IBM System X and ours is not. Are there other solution like pureScale in shared-everything approach?
The second option maybe database partition. Is database partition or shared-nothing approach can help resolve my problem? Can add processing power to my system?
Thanks and regards,
Fritz
Before you worry about how to scale up (more hardware in 1 server) or out (more servers), look at how to tune your database. Buying your way out of a performance problem is almost always more expensive than spending time to find and fix the performance problem.
Assuming that the process(es) consuming CPU on your database server are the database engine, then high CPU activity and low I/O activity is indicative that you're doing a LOT of reads, but they are just all in memory. Scanning a huge table is still in inefficient, even if that table is stored completely in memory (buffer pools).
Find the SQL statements that are using the most CPU. Look at the explain plans, and figure out how to make them more efficient. There are LOTS of resources on the web for database performance tuning.
All,
I am running CentOS 6.0 with Postgresql 8.4 and can't seem to figure out how to prevent so much disc swap from occurring. I have 12 gigs of RAM and 4 processors and I am doing some simple updates (1 table at a time). I thought for a minute that the inserts happening in parallel from a script I wrong was causing the large memory usage but when I saw the simple update causing it too I basically threw in the towel and decided to ask for help.
I pasted the conf file here. http://pastebin.com/e0jdBu0J
You can see that I set the buffers relatively low and the connection amounts high. The DB service will not start if I set the shared buffers any higher than 64 megs. Anyone have an idea what may be causing this for me?
Thanks,
Adam
If you're going into swap, increasing shared_buffers will make the problem worse; you'll be taking RAM away from the part that's running out and swapping, instead dedicating memory to the database caching. It's worth fixing SHMMAX etc. just on general principle and for later tuning work, but that's not going to help with this problem.
Guessing at the identify of your memory gobbling source is a crapshoot. Far better to look at data from "top -c" and ps to find which processes are using a lot of it. It's possible for a really bad query to consume way more memory than it should. If you see memory use spike up for a PostgreSQL process running something, check the process ID against the information in pg_stat_tables to see what it's doing.
There are a couple of things that can cause this sort of issue that often surprise people. If you are doing a large number of row updates in a single transaction, and there are foreign key checks or triggers involved, that can run out of memory. The queue of things to check in each of those cases is kept in RAM, and can be surprisingly big.
There are two problems with your PostgreSQL settings that might be related. Databases don't actually work very well if you have a lot more active connections than cores in the server; best performance is normally 2 to 3 active clients per core. And all sorts of things go wrong once you've got more than a few hundred connection. There is some connections^2 behavior that gets ugly there performance wise, and there are some memory issues too. If you really need 1250 connections, you should be using a connection pooler such as pgBouncer or pgpool-II.
And effective_io_concurrency = 1000 is way too high for any hardware on the planet. Useful values for that in a small multiple of how many disks you have in the server. I have no idea what happens as far as memory usage goes when you set it that high, but it's not been tested very well at that range. Normal settings more like 1 to 25. The parameters outlined at Tuning Your PostgreSQL Server are much more important than it is; the concurrency value only impacts one particular type of table scan.
Centos 6 seems to have a very conservative shmmax as a default
Set your shared buffers to that recommended by postgres tuning resources
see for explanation and how to set.
To experiment you can (as root) use sysctl -w kernel.shmmax = n
where n is the value that the startup error message that postgres is trying to allocate on startup. When you identify the value you wish to use permanently then set that in /etc/sysctl.conf
I have recently had to install slony (version 2.0.2) at work. Everything works fine, however, my boss would like to lower the cpu usage on slave nodes during replication. Searching on the net does not reveal any blatantly obvious answers to this. Any suggestions that would help reduce CPU usage (or spread the update out over a longer period) would be very much appreciated!
Have you looked into general PostgreSQL tuning here? The server can waste a lot of CPU cycles doing redundant work if it's not given enough resources to work with, and the default config is extremely small. Tuning Your PostgreSQL Server is a useful guide here, shared_buffers and checkpoint_segments are the two parameters you might get some significant improvement from on a slave (many of the rest only really help for improving query time).
Magnus might be right, this could very well just be a symptom of the fact that your database has very high traffic. Slony effectively multiplies the resource usage of any given DML operation: not only is data CRUD'ed to the replication master, but every time that happens, a Slony trigger (think of it as a change listener) generates an identical transaction and forwards it to the Slon process, which runs it on other members of the cluster.
However, there are two other possible explanations/solutions to this issue:
A possible solution might be to run the slon processes on a separate machine from your database hosts. Even if you have a single-master/single-slave replication scheme, it is advantageous in terms of stability, role-segregation, and performance (that’s you) to run the slon replication daemons on a physically different set of hardware (on the same LAN segment, ideally). There is nothing about Slony that says it has to run on the same machine as a given database host, so putting it in a different location (think “traffic controller”) might relieve some of the resource load on your database hosts. This is also a good idea in terms of both machine stability and scalability.
There's also a chance that this is only a temporary problem caused by the fact that you recently started using Slony. When you first subscribe a new node to a replication set, that node (and, to some extent, its parent) experiences VERY heavy CPU load (and possibly disk load as well) during the subscription process. I'm not sure how it works under the covers, but, depending on how much data was already on the node subscribed, Slony will either check the master’s data against every single piece of data present on the slave in tables that are replicated, and copy data down to the slave if it is missing or different. These are potentially CPU-intensive operations. Especially in large databases, the process of subscription can take a very long time (it took over a day for me, but our database is over 20GB), during which CPU load will be very high. A simple way to see what Slony is up to is to use pgAdmin’s Server Status viewer, which, while limited, will give you some useful info here. If there are a lot of “prepare table for replication” or “cleanup table after replication” operations in progress on the node that has a high CPU load, it’s probably because a subscription isn’t complete. pgAdmin’s status viewer isn’t too informative, however; there are more reliable ways of checking subscription progress using Slony directly. Section 4.7.6.4 in the Slony log-analysis documentation might help with that, as would reading the doc for SUBSCRIBE SET (pay special attention to the boxed warning message, and the "Dangerous/Unintuitive Behavior" section. A simple yet definitive hack to tell whether a set is still in the process of subscriptions is to run a MERGE SET and try to merge it with an empty (or not) other set. MERGE SET will fail with a "subscriptions in progress" error if subscription is still running. However, that hack won't work on Slony 2.1; MERGE SET will just wait until subscriptions are finished.
The best way to reduce the CPU usage would be to put less data into the database :-)
Other than that, you can experiment with sync_interval. It may be what you're looking for.
I have a C++ application which is making use of PostgreSQL 8.3 on Windows. We use the libpq interface.
We have a multi-threaded app where each thread opens a connection and keeps using without PQFinish it.
We notice that for each query (especially the SELECT statements) postgres.exe memory consumption would go up. It goes up as high as 1.3 GB. Eventually, postgres.exe crashes and forces our program to create a new connection.
Has anyone experienced this problem before?
EDIT: shared_buffer is currently set to be 128MB in our conf. file.
EDIT2: a workaround that we have in place right now is to call PQfinish for every transaction. But then, this slows down our processing a bit since establishing a connection every time is quite slow.
In PostgreSQL, each connection has a dedicated backend. This backend not only holds connection and session state, but is also an execution engine. Backends aren't particularly cheap to leave lying around, and they cost both memory and synchronization overhead even when idle.
There's an optimum number of actively working backends for any given Pg server on any given workload, where adding more working backends slows things down rather than speeding it up. You want to find that point, and limit the number of backends to around that level. Unfortunately there's no magic recipe for this, it mostly involves benchmarking - on your hardware and with your workload.
If you need more connections than that, you should use a proxy or pooling system that allows you to separate "connection state" from "execution engine". Two popular choices are PgBouncer and PgPool-II . You can maintain light-weight connections from your app to the proxy/pooler, and let it schedule the workload to keep the database server working at its optimum load. If too many queries come in, some wait before being executed instead of competing for resources and slowing down all queries on the server.
See the postgresql wiki.
Note that if your workload is read-mostly, and especially if it has items that don't change often for which you can determine a reliable cache invalidation scheme, you can also potentially use memcached or Redis to reduce your database workload. This requires application changes. PostgreSQL's LISTEN and NOTIFY will help you do sane cache invalidation.
Many database engines have some separation of execution engine and connection state built in to the core database engine's design. Sybase ASE certainly does, and I think Oracle does too, but I'm not too sure about the latter. Unfortunately, because of PostgreSQL's one-process-per-connection model it's not easy for it to pass work around between backends, making it harder for PostgreSQL to do this natively, so most people use a proxy or pool.
I strongly recommend that you read PostgreSQL High Performance. I don't have any relationship/affiliation with Greg Smith or the publisher*, I just think it's great and will be very useful if you're concerned about your DB's performance.
* ... well, I didn't when I wrote this. I work for the same company now.
The memory usage is not necessarily a problem. PostgreSQL uses shared memory for some caching, and this memory does not count towards the size of the process memory usage until it's actually used. The more you use the process, the larger parts of the shared buffers will be active in it's address space.
If you have a large value for shared_buffers, this will happen. If you have it too large, the process can run out of address space and crash, yes.
The problem is probably that you don't close the transaction,
In PostgreSQL even if you do only selects without DML it runs in transaction which need to be rollback.
By adding rollback at the end of the transaction will reduce your memory problem