I have some PSQL queries running on RDS. 90% of the time these queries will run fine. However occasionally these queries will randomly timeout and not execute. I have enabled logging and auto_explain however auto_explain will only log query plans for queries that complete. If I increase the statement_timeout the queries will still continue to timeout at random intervals with no explanation.
Has anyone seen this issue before or have any idea how to analyse queries that do not complete?
Related
I'm doing a count * query that executes correctly but for some reason when I attempt to export it to CSV, I'm encountering a SQL Error [40001]. Any ideas what the issue could be?
You are running a long query on a standby server, and some modifications that are replicated from the primary server conflict with your query. In particular, VACUUM removed some old row versions that your query still might want to use.
PostgreSQL has to make a choice: either delay applying the changes from the primary, or cancel the query that blocks replication.
How PostgreSQL behaves is determined by the parameter max_standby_streaming_delay. The default value gives the query 30 seconds to finish before it is canceled.
You have three options:
Retry the query and hope it succeeds this time.
Increase max_standby_streaming_delay on the standby.
The risk you are running is that replication will fall behind.
Set the parameter hot_standby_feedback to on on the standby, then the primary won't VACUUM row versions the standby might still need.
The risk you are running is table bloat on the primary, because autovacuum cannot do its job.
We have a mongodb cluster with 5 PSA replica sets and one sharded collection. About 3,5 TB of data, 2 billion docs on primaries. Average insert rate: 300rps. Average select rate: 1000rps. Mongodb version 4.0.6. Collection has only one extra unique index, all read queries use one of the indexes (no long running queries).
PROBLEM. Sometimes (4 times in a last 2 month) one of the nodes stops responding to the queries with specified read concern or write concern. The same query without read/write concern executes successfully regardless of doing it locally or through mongos. These queries never execute, no errors, no timeouts even when restarting mongos, that initiate the query. No errors in mongod logs, no errors in system logs. Restart of this node fixes the problem. Mongodb sees such broken node as normal, rs.status() shows that everything is ok.
Have no idea how to reproduce this problem, much more intense load testing have no results.
We would appreciate any help and suggestions.
We have some queries that are running extremely slowly intermittently in our production environment. These are JSONB intersection queries which normally return in milliseconds, but are taking 30-90 seconds.
We have tried to look at co-occurring server conditions such as RAM, CPU and query load, but there is nothing obvious. This affects a very small minority of queries - probably less than 1%. This does not appear to be a query optimization issue as the affected queries themselves are varied and in some cases very simple.
We've reproduced the same environment as far as possible on a staging server and loaded it heavily and the issue does not occur.
Can anyone suggest possible steps to investigate what is occurring in Postgres when this happens, or anything else we should consider? We have been working on this for over a week and are running out of ideas.
It is difficult to guess the cause of that problem; one explanation would be locks.
You should use auto_explain to investigate the problem.
In postgresql.conf, use the following settings:
# log if somebody has to wait for a lock for more than one second
log_lock_waits = on
# log slow statements with their parameters
log_min_duration_statement = 1000
# log the plans of slow statements
shared_preload_libraries = 'auto_explain'
# configuration for auto_explain
auto_explain.log_nested_statements = on
auto_explain.log_min_duration = 1000
Then restart PostgreSQL.
Now all statements that exceed one second will have their plan dumped in the PostgreSQL log, so all you have to do is to wait for the problem to happen again, so that you can analyze it.
You can also get EXPLAIN (ANALYZE, BUFFERS) output if you set
auto_explain.log_buffers = on
auto_explain.log_analyze = on
That would make the log much more valuable, but it will slow down processing considerably, so I'd be reluctant to do it on a production system.
I am trying to measure the load that various databases living on the same Postgres server are incurring, to determine how to best split them up across multiple servers. I devised this query:
select
now() as now,
datname as database,
usename as user,
count(*) as processes
from pg_stat_activity
where state = 'active'
and waiting = 'f'
and query not like '%from pg_stat_activity%'
group by
datname,
usename;
But there were surprisingly few active processes!
Digging deeper I ran a simple query that returns 20k rows and took 5 seconds to complete, according to the client I ran it from. When I queried pg_stat_activity during that time, the process was idle! I repeated this experiment several times.
The Postgres documentation says active means
The backend is executing a query.
and idle means
The backend is waiting for a new client command.
Is it really more nuanced than that? Why was the process running my query was not active when I checked in?
If this approach is flawed, what alternatives are there for measuring load at a database granularity than periodically sampling the number of active processes?
your expectations regarding active, idleand idle in transaction are very right. The only explanation I can think of is a huge delay in displaying data client side. So the query indeed finished on server and session is idle and yet you don't see the result with client.
regarding the load measurement - I would not rely on number of active sessions much. Pure luck to hit the fast query in active state. Eg hypothetically you can check pg_stat_activity each second and see one active session, but between measurement one db was queried 10 times and another once - yet none of those numbers will be seen. Because they were active between executions. And this 10+1 active states (although mean that one db is queried 10times more often) do not mean you should consider load at all - because cluster is so much not loaded, that you can't even catch executions. But this unavoidably mean that you can catch many active sessions and it would not mean that server is loaded indeed.
so at least take now()-query_start to your query to catch longer queries. Or even better save execution time for some often queries and measure if it degrades over time. Or better select pid and check resources eaten by that pid.
Btw for longer queries look into pg_stat_statements - looking how they change over time can give you some expectations on how the load changes
Newrelic shows I/O pikes of up to 100% reads.
The rest of the time it's close to 0.
It doesn't coincide in time with our cron jobs (database backup, analyze, etc.), also almost no match with autovacuum (pg_stat_all_tables.last_autovacuum).
Querying pg_stat_activity also shows nothing but our simple queries sent by Ruby application
select *
from pg_stat_activity
where query_start between '2017-09-25 09:00:00' and '2017-09-25 11:00:00';
And these queries are not even among our slow queries in log file.
Please advise how to identify what processes cause such I/O load.