Postgres queries intermittently running extremely slowly - postgresql

We have some queries that are running extremely slowly intermittently in our production environment. These are JSONB intersection queries which normally return in milliseconds, but are taking 30-90 seconds.
We have tried to look at co-occurring server conditions such as RAM, CPU and query load, but there is nothing obvious. This affects a very small minority of queries - probably less than 1%. This does not appear to be a query optimization issue as the affected queries themselves are varied and in some cases very simple.
We've reproduced the same environment as far as possible on a staging server and loaded it heavily and the issue does not occur.
Can anyone suggest possible steps to investigate what is occurring in Postgres when this happens, or anything else we should consider? We have been working on this for over a week and are running out of ideas.

It is difficult to guess the cause of that problem; one explanation would be locks.
You should use auto_explain to investigate the problem.
In postgresql.conf, use the following settings:
# log if somebody has to wait for a lock for more than one second
log_lock_waits = on
# log slow statements with their parameters
log_min_duration_statement = 1000
# log the plans of slow statements
shared_preload_libraries = 'auto_explain'
# configuration for auto_explain
auto_explain.log_nested_statements = on
auto_explain.log_min_duration = 1000
Then restart PostgreSQL.
Now all statements that exceed one second will have their plan dumped in the PostgreSQL log, so all you have to do is to wait for the problem to happen again, so that you can analyze it.
You can also get EXPLAIN (ANALYZE, BUFFERS) output if you set
auto_explain.log_buffers = on
auto_explain.log_analyze = on
That would make the log much more valuable, but it will slow down processing considerably, so I'd be reluctant to do it on a production system.

Related

btreepage and MessageQueueSend wait events in DB

Team,
Can someone provide me more context on waitevents for btreepage and MessageQueueSend.
Whenever the query executing these two events are showing in top list at same time autovacuum was triggering many of the toast tables same time.
Verified query execution plan of the query, its taking indexes scan and it took 1 sec.
Can you provide more details about these events .
This seems to be resource contention.
With the default setting of autovacuum_max_workers there can never be more than three autovacuum workers at the same time, so if you messed with that setting, that may be your problem.
If three autovacuum workers are enough to impact CPU or I/O performance, get stronger hardware.
More detailed advice is impossible to give, since you are telling us almost nothing about your system and your configuration.

Impact of log_duration in PostgreSQL performance

All,
I'm trying to understand the impact of PostgreSQL server setting log_duration. Would it cause any performance issues setting it to ON. I was trying to setup a confluence server with Postgres backend. When this value is set as ON, the response of the service is slow, whereas when I set it to OFF, it is fast.
log_min_duration_statement is set to -1 => No query texts should be logged.
Queries:
Why it is slow when it is set to ON
Logging => does it log in any log file? If yes, where to find the same in server
There is another feature pg_stat_statements.track, which can be set to TOP, ALL. Why setting this is not giving performance issue. It is also tracking queries. But, it gives total_time and calls, but not individual.
Could some one help me understand the impact.
refer to this, let me know if you need more info:
log min duration
This allows logging a sample of statements, without incurring excessive
log traffic (which may impact performance). This can be useful when
analyzing workloads with lots of short queries.
The sampling is configured using two new GUC parameters:
log_min_duration_sample - minimum required statement duration
log_statement_sample_rate - sample rate (0.0 - 1.0)
Only statements with duration exceeding log_min_duration_sample are
considered for sampling. To enable sampling, both those GUCs have to
be set correctly.
The existing log_min_duration_statement GUC has a higher priority, i.e.
statements with duration exceeding log_min_duration_statement will be
always logged, irrespectedly of how the sampling is configured. This
means only configurations
log_min_duration_sample < log_min_duration_statement
do actually sample the statements, instead of logging everything.
The best way to know is to measure it.
For simple select-only queries, such as pgbench -S, the impact is quite high. I get 22,000 selects per second with it off, and 13,000 with it on. For more complex select queries or for DML, the percentage impact will be much less.
Formatting and printing log messages, and the IPC to the log collector, take time. pg_stat_statements doesn't do that for every query, it just increments some counters.
Of course log_duration logs to a log file (or pipe), otherwise there wouldn't be any point in turning it on. The location is highly configurable, look in postgresql.conf for things like log_destination, log_directory, and log_filename. (But if you don't know where to find the logs, why would you even bother to turn this logging on in the first place?)

PostgreSQL limits itself to a single core CPU usage (debugging lock issue?)

Update after some research, it seems this question was incorrect - the 100% was representing all cores, not a single core, making the whole question moot. My sincere apologies to the community.
On PostgreSQL 10, PostGIS 2.5.2, without any data modifications (SELECT queries only), I have 40 identical GIS queries running in parallel (with different params), each taking ~20-500ms. Server has lots of RAM, NVME SSDs.
The CPU usage consistently shows 100% of a single core, implying that all queries are stuck waiting for something that cannot execute in parallel, but I am not sure how to find it.
Examining pg_stat_activity multiple times shows that all queries are active, and their wait_event could be one of these cases:
wait_event is NULL for all
a few ClientRead and lock_manager, NULL everything else
a lot of lock_manager, and a few ClientRead and NULLs.
Is there a way to figure out what may be causing this?
That is surprising, as reading queries never lock on anything short of an ACCESS EXCLUSIVE lock that is required by operations like DROP TABLE, TRUNCATE, ALTER TABLE and similar statements.
Perhaps the locks are “light-weight locks” on internal PostgreSQL data structures, which are usually only held for a very short time. I don't know what in a PostGIS query could have high contention on such internal locks, but then you didn't show the statement or its execution plan, nor did you show the exact lock events.
If you have several concurrent queries that each take a long time like 500ms, the definitely should be running in parallel.
Apart from the possibilities of some internal lock contention, I can think of two explanations:
Most of the queries are short enough that a single core suffices to process all the queries. Each connection spends most of its time waiting for the client.
The system is I/O bound, so that most of the CPUs just twiddle their thumbs. That would be indicated by a CPU iowait% of 10 or more.

Postgresql explicit VACUUM vs. auto-VACUUM: Differences? Recommendations?

Quick question from a PostgreSQL (relative) newb:
We run a batch process that, as its final step, deletes most of the previous batches.
Disk space is a concern, so we need to ensure that PostgreSQL cleans up after itself.
Other than forcing PostgreSQL to garbage-collect faster, is there any difference between explicitly calling VACUUM at the end of the batch vs. letting the auto-VACUUM daemon handle it? Is there any reason to recommend one approach vs. the other?
Thanks!
Way back when there was one vacuum, and it was full and blocking. Then PostgreSQL guys added non-blocking vacuum. But you still had to schedule it yourself.
Then some genius made a daemon that ran vacuum automatically for you when the tables needed it. It uses the exact same vacuum command you or I would use, but has a lot of settings, especially default ones, that make it run slower and less intrusively. Primarily these settings are for worker threads (default 3), delay cost (20ms for autovac, 0ms for regular vac) and autovacuum cost delay limit (-1, i.e. use system setting which is 200).
Therefore, regular vacuum is VERY aggressive with no cost delay, and will run as hard and fast as your IO subsystem will let it. It basically competes with your regular workload for IO bandwidth.
Generally you can do one of two things in your situation:
One: Make autovacuum more aggressive. By lowering the autovacuum_vacuum_cost_delay from 20 to something in the 2 to 5 range it will run much faster but still not get in the way too much.
Two: Run regular vacuums by hand. Since regular vacuums, by default, have no cost_delay, this will be the fastest but also the most distruptive.
Decision is yours based on usage patterns etc.

Postgres causing swapping on CentOS

All,
I am running CentOS 6.0 with Postgresql 8.4 and can't seem to figure out how to prevent so much disc swap from occurring. I have 12 gigs of RAM and 4 processors and I am doing some simple updates (1 table at a time). I thought for a minute that the inserts happening in parallel from a script I wrong was causing the large memory usage but when I saw the simple update causing it too I basically threw in the towel and decided to ask for help.
I pasted the conf file here. http://pastebin.com/e0jdBu0J
You can see that I set the buffers relatively low and the connection amounts high. The DB service will not start if I set the shared buffers any higher than 64 megs. Anyone have an idea what may be causing this for me?
Thanks,
Adam
If you're going into swap, increasing shared_buffers will make the problem worse; you'll be taking RAM away from the part that's running out and swapping, instead dedicating memory to the database caching. It's worth fixing SHMMAX etc. just on general principle and for later tuning work, but that's not going to help with this problem.
Guessing at the identify of your memory gobbling source is a crapshoot. Far better to look at data from "top -c" and ps to find which processes are using a lot of it. It's possible for a really bad query to consume way more memory than it should. If you see memory use spike up for a PostgreSQL process running something, check the process ID against the information in pg_stat_tables to see what it's doing.
There are a couple of things that can cause this sort of issue that often surprise people. If you are doing a large number of row updates in a single transaction, and there are foreign key checks or triggers involved, that can run out of memory. The queue of things to check in each of those cases is kept in RAM, and can be surprisingly big.
There are two problems with your PostgreSQL settings that might be related. Databases don't actually work very well if you have a lot more active connections than cores in the server; best performance is normally 2 to 3 active clients per core. And all sorts of things go wrong once you've got more than a few hundred connection. There is some connections^2 behavior that gets ugly there performance wise, and there are some memory issues too. If you really need 1250 connections, you should be using a connection pooler such as pgBouncer or pgpool-II.
And effective_io_concurrency = 1000 is way too high for any hardware on the planet. Useful values for that in a small multiple of how many disks you have in the server. I have no idea what happens as far as memory usage goes when you set it that high, but it's not been tested very well at that range. Normal settings more like 1 to 25. The parameters outlined at Tuning Your PostgreSQL Server are much more important than it is; the concurrency value only impacts one particular type of table scan.
Centos 6 seems to have a very conservative shmmax as a default
Set your shared buffers to that recommended by postgres tuning resources
see for explanation and how to set.
To experiment you can (as root) use sysctl -w kernel.shmmax = n
where n is the value that the startup error message that postgres is trying to allocate on startup. When you identify the value you wish to use permanently then set that in /etc/sysctl.conf