Newrelic shows I/O pikes of up to 100% reads.
The rest of the time it's close to 0.
It doesn't coincide in time with our cron jobs (database backup, analyze, etc.), also almost no match with autovacuum (pg_stat_all_tables.last_autovacuum).
Querying pg_stat_activity also shows nothing but our simple queries sent by Ruby application
select *
from pg_stat_activity
where query_start between '2017-09-25 09:00:00' and '2017-09-25 11:00:00';
And these queries are not even among our slow queries in log file.
Please advise how to identify what processes cause such I/O load.
Related
I have a postgres9.5 installed on a Linux machine and for the last few days, it is continuously showing a huge CPU usage(80%-90%). I try to check pg_stat_activity table to find long-running queries or sessions but there is not anything to blame for and also I check that I have indexed on all my table but still CPU usage is a spike up from the Postgres process. Is there any way to figure out the reason?
Team,
Can someone provide me more context on waitevents for btreepage and MessageQueueSend.
Whenever the query executing these two events are showing in top list at same time autovacuum was triggering many of the toast tables same time.
Verified query execution plan of the query, its taking indexes scan and it took 1 sec.
Can you provide more details about these events .
This seems to be resource contention.
With the default setting of autovacuum_max_workers there can never be more than three autovacuum workers at the same time, so if you messed with that setting, that may be your problem.
If three autovacuum workers are enough to impact CPU or I/O performance, get stronger hardware.
More detailed advice is impossible to give, since you are telling us almost nothing about your system and your configuration.
Update after some research, it seems this question was incorrect - the 100% was representing all cores, not a single core, making the whole question moot. My sincere apologies to the community.
On PostgreSQL 10, PostGIS 2.5.2, without any data modifications (SELECT queries only), I have 40 identical GIS queries running in parallel (with different params), each taking ~20-500ms. Server has lots of RAM, NVME SSDs.
The CPU usage consistently shows 100% of a single core, implying that all queries are stuck waiting for something that cannot execute in parallel, but I am not sure how to find it.
Examining pg_stat_activity multiple times shows that all queries are active, and their wait_event could be one of these cases:
wait_event is NULL for all
a few ClientRead and lock_manager, NULL everything else
a lot of lock_manager, and a few ClientRead and NULLs.
Is there a way to figure out what may be causing this?
That is surprising, as reading queries never lock on anything short of an ACCESS EXCLUSIVE lock that is required by operations like DROP TABLE, TRUNCATE, ALTER TABLE and similar statements.
Perhaps the locks are “light-weight locks” on internal PostgreSQL data structures, which are usually only held for a very short time. I don't know what in a PostGIS query could have high contention on such internal locks, but then you didn't show the statement or its execution plan, nor did you show the exact lock events.
If you have several concurrent queries that each take a long time like 500ms, the definitely should be running in parallel.
Apart from the possibilities of some internal lock contention, I can think of two explanations:
Most of the queries are short enough that a single core suffices to process all the queries. Each connection spends most of its time waiting for the client.
The system is I/O bound, so that most of the CPUs just twiddle their thumbs. That would be indicated by a CPU iowait% of 10 or more.
I have some PSQL queries running on RDS. 90% of the time these queries will run fine. However occasionally these queries will randomly timeout and not execute. I have enabled logging and auto_explain however auto_explain will only log query plans for queries that complete. If I increase the statement_timeout the queries will still continue to timeout at random intervals with no explanation.
Has anyone seen this issue before or have any idea how to analyse queries that do not complete?
I am trying to measure the load that various databases living on the same Postgres server are incurring, to determine how to best split them up across multiple servers. I devised this query:
select
now() as now,
datname as database,
usename as user,
count(*) as processes
from pg_stat_activity
where state = 'active'
and waiting = 'f'
and query not like '%from pg_stat_activity%'
group by
datname,
usename;
But there were surprisingly few active processes!
Digging deeper I ran a simple query that returns 20k rows and took 5 seconds to complete, according to the client I ran it from. When I queried pg_stat_activity during that time, the process was idle! I repeated this experiment several times.
The Postgres documentation says active means
The backend is executing a query.
and idle means
The backend is waiting for a new client command.
Is it really more nuanced than that? Why was the process running my query was not active when I checked in?
If this approach is flawed, what alternatives are there for measuring load at a database granularity than periodically sampling the number of active processes?
your expectations regarding active, idleand idle in transaction are very right. The only explanation I can think of is a huge delay in displaying data client side. So the query indeed finished on server and session is idle and yet you don't see the result with client.
regarding the load measurement - I would not rely on number of active sessions much. Pure luck to hit the fast query in active state. Eg hypothetically you can check pg_stat_activity each second and see one active session, but between measurement one db was queried 10 times and another once - yet none of those numbers will be seen. Because they were active between executions. And this 10+1 active states (although mean that one db is queried 10times more often) do not mean you should consider load at all - because cluster is so much not loaded, that you can't even catch executions. But this unavoidably mean that you can catch many active sessions and it would not mean that server is loaded indeed.
so at least take now()-query_start to your query to catch longer queries. Or even better save execution time for some often queries and measure if it degrades over time. Or better select pid and check resources eaten by that pid.
Btw for longer queries look into pg_stat_statements - looking how they change over time can give you some expectations on how the load changes