What could be causing high cpu in google spanner databases? (unresponsive table) - gcloud

I'm facing an issue where 2 out of 10 spanner databases are showing a high CPU usage (above 40%) whereas the others are around %1 each, with almost identical or more data.
I notice one of our tables has become "unresponsive" no queries work against it. We shutdown all apps that connect to those dbs, and we also deleted all current sessions using gcloud sessions list and then gcloud session delete.
However the table is still unresponsive. A simple select like select id from mytable where name = 'test' is not responding (when tested from an app, and also from gcloud web interface), it only happens with that table, which has only a few columns with normal data and no more than 2000 records. We identified the query that could have been the source of the problem, however the table seems to be locked (only count(*) without any where clause works).
I was wondering if there is any way to "unlock" the table, kill those "transactions" that might be causing the issue, or restart those specific spanner databases, or in the worst case scenario restarting the spanner instance.
I have seen the monitoring high cpu documentation, but even if we can identify the cpu is high, we don't really know how to restart or make it back to normal before reviewing the query/ies that could have caused the issue (if that was the case).

High CPU can be caused by user initiated queries, from different types of operations. It is important to notice that your instance is where you allocate resources to be used by the underlying Cloud Spanner databases. This means, that if all of your databases are in the same instance and if some of your databases are hogging the CPU, all your other databases will struggle.
In terms of a locked table, I would be very surprised if a deadlock is the problem here, since Spanner mitigates those issues using "wound-wait" algorithm. What I suspect might be happening is that a long time is necessary to perform the query in that table and it times out. It would be nice to investigate a bit more on your problem:
How long did you wait for your query on the problematic table (before you deemed to be stuck)? It might be a problem where your query just takes long and you are timing out before getting the results. It might be a problem where there are hotspots in your table and transactions are getting aborted often, preventing you from getting results.
What error did you get when performing the query on the table? The error code can tell you more about what might be happening.
Have you tried doing a stale read on the table to see if any data is returned? If lock contention is the problem, this should succeed and return results faster for you. Thus, you can further investigate the lock usage with the statistics table (as shown below).
Inspect query statistics: you can list the queries with the highest CPU usage, find the average latency for a query and find out the queries that timeout the most. There is much more you can do, as seen here. You can also view the query statistics in cloud console. What I would verify is, by reducing the overall CPU, does your query come back without any issues? You might need more resources if so. Or you might need to reduce hotspots in your database.
Inspect Lock statistics: you can investigate lock conflicts in your rows and tables. I think that an interesting query for your problem would be Discovering which row keys and columns had long lock wait times during the selected period. You can then see if your query is reading those row keys and columns and act accordingly.

Related

Time-consuming queries to PostgreSQL Database are duplicated

After restoring db-server from snapshot something strange started happening with our database. Basically it can be described as all time-consuming queries are seems to be duplicated. At least as pg_stat_activity shows it
These lines are almost equal except for their PIDs and client addresses.
Usually I'd think that that's just a mistake of dev team (multiple equal queries at a time in code, cron misconfiguration, etc), but one of those time-consuming selects comes from PowerBI which I believe to be quite reliable in terms of loading data.
Has anybody ever stumbled upon this problem?
Turned out that's the way pg_stat_activity shows parallel workers processing single query. You can make sure that's the case by getting backend_type of these records.

PostgreSQL: Backend processes are active for a long time

now I am hitting a very big road block.
I use PostgreSQL 10 and its new table partitioning.
Sometimes many queries don't return and at the time many backend processes are active when I check backend processes by pg_stat_activity.
First, I thought theses process are just waiting for lock, but these transactions contain only SELECT statements and the other backend doesn't use any query which requires ACCESS EXCLUSIVE lock. And these queries which contain only SELECT statements are no problem in terms of plan. And usually these work well. And computer resources(CPU, memory, IO, Network) are also no problem. Therefore, theses transations should never conflict. And I thoughrouly checked the locks of theses transaction by pg_locks and pg_blocking_pids() and finnaly I couldn't find any lock which makes queries much slower. Many of backends which are active holds only ACCESS SHARE because they use only SELECT.
Now I think these phenomenon are not caused by lock, but something related to new table partition.
So, why are many backends active?
Could anyone help me?
Any comments are highly appreciated.
The blow figure is a part of the result of pg_stat_activity.
If you want any additional information, please tell me.
EDIT
My query dosen't handle large data. The return type is like this:
uuid UUID
,number BIGINT
,title TEXT
,type1 TEXT
,data_json JSONB
,type2 TEXT
,uuid_array UUID[]
,count BIGINT
Because it has JSONB column, I cannot caluculate the exact value, but it is not large JSON.
Normally theses queries are moderately fast(around 1.5s), so it is absolutely no problem, however when other processes work, the phenomenon happens.
If statistic information is wrong, the query are always slow.
EDIT2
This is the stat. There are almost 100 connections, so I couldn't show all stat.
For me it looks like application problem, not postresql's one. active status means that your transaction still was not commited.
So why do you application may not send commit to database?
Try to review when do you open transaction, read data, commit transaction and rollback transaction in your application code.
EDIT:
By the way, to be sure try to check resource usage before problem appear and when your queries start hanging. Try to run top and iotop to check if postgres really start eating your cpu or disk like crazy when problem appears. If not, I will suggest to look for problem in your application.
Thank you everyone.
I finally solved this problem.
I noticed that a backend process holded too many locks. So, when I executed the query SELECT COUNT(*) FROM pg_locks WHERE pid = <pid>, the result is about 10000.
The parameter of locks_per_transactions is 64 and max_connections is about 800.
So, if the number of query that holds many locks is large, the memory shortage occurs(see calculation code of shared memory inside PostgreSQL if you are interested.).
And too many locks were caused when I execute query like SELECT * FROM (partitioned table). Imangine you have a table foo that is partitioned and the number of the table is 1000. And then you can execute SELECT * FROM foo WHERE partion_id = <id> and the backend process will hold about 1000 table locks(and index locks). So, I change the query from SELECT * FROM foo WHERE partition_id = <id> to SELECT * FROM foo_(partitioned_id). As the result, the problem looks solved.
You say
Sometimes many queries don't return
...however when other processes work, the phenomenon happens. If statistic
information is wrong, the query are always slow.
They don't return/are slow when directly connecting to the Postgres instance and running the query you need, or when running the queries from an application? The backend processes that are running, are you able to kill them successfully with pg_terminate_backend($PID) or does that have issues? To rule out issues with the statement itself, make sure statement_timeout is set to a reasonable amount to kill off long-running queries. After that is ruled out, perhaps you are running into a case of an application hanging and never allowing the send calls from PostgreSQL to finish. To avoid a situation like that, if you are able to (depending on OS) you can tune the keep-alive time: https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-KEEPALIVES-IDLE (by default is 2 hours)
Let us know if playing with any of that gives any more insight into your issue.
Sorry for late post, As #Konstantin pointed out, this might be because of your application(which is why I asked for your EDIT2). Adding a few excerpts,
table partition has no effect on these locks, that is a totally different concept and does not hold up locks in your case.
In your application, check if the connection properly close() after read() and is in finally block (From Java perspective). I am not sure of your application tier.
Check if SELECT..FOR UPDATE or any similar statement is written erroneously recently which is causing this.
Check if any table has grown in size recently and the column is not Indexed. This is very important and frequent cause of select statements running for some minutes. I'd also suggest using timeouts for select statements in your application. https://www.postgresql.org/docs/9.5/gin-intro.html This can give you a headstart.
Another thing that is fishy to me is the JSONB column, maybe your Jsonb values are pretty long, or the queries are unnecessarily selecting JSONB value even if not required?
Finally, If you don't need some special features of Jsonb data type, then you use JSON data type which is faster (magical maximum, sometimes 50x!)
It looks like the pooled connections not getting closed properly and a few queries might be taking huge time to respond back. As pointed out in other answers, it is the problem with the application and could be connection leak. Most possibly, it might be because of pending transactions over some already pending and unresolved transactions, leading to a number of unclosed transactions.
In addition, PostgreSQL generally has one or more "helper" processes like the stats collector, background writer, autovaccum daemon, walsender, etc, all of which show up as "postgres" instances.
One thing I would suggest you check in which part of the code you have initiated the queries. Try to DRY run your queries outside the application and have some benchmarking of queries performance.
Secondly, you can keep some timeout for certain queries if not all.
Thirdly, you can do kill the idle transactions after certain timeouts by using:
SET SESSION idle_in_transaction_session_timeout = '5min';
I hope it might work. Cheers!

In Postgres, what does pg_stat_database.xact_commit actually mean?

I'm trying to understand SELECT xact_commit FROM pg_stat_database; According to docs, it is "Number of transactions in this database that have been committed". But I turned on logging all queries (log_min_duration = 0) and it seems there are other things besides that can affect xact_commit than just a query. For example, connecting a psql client or typing BEGIN; will increase it by various values. There is a step in my application that runs a single query (as confirmed by the log), but consistently increases the counter by 15-20. Does anyone know anything more specific about what is counted in xact_commit, or if there is a way to count only actual queries?
pg_stat_database.xact_commit really is the number of commits in the database (remember that every statement that is not run in a transaction block actually runs in its own little transaction, so it will cause a commit).
The mystery that remains to be solved is why you see more commits than statements, which seems quite impossible (For example, BEGIN starts a transaction, so by definition it cannot increase xact_commit).
The solution is probably that database activity statistics are collected asynchronously: they are sent to the statistics collector process via an UDP socket, and the statistics collector eventually updates the statistics.
So my guess is that the increased transaction count you see is actually from earlier activities.
Try keeping the database absolutely idle for a while and then try again, then you should see a slower increase.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

PostgreSQL. Slow queries in log file are fast in psql

I have an application written on Play Framework 1.2.4 with Hibernate(default C3P0 connection pooling) and PostgreSQL database (9.1).
Recently I turned on slow queries logging ( >= 100 ms) in postgresql.conf and found some issues.
But when I tried to analyze and optimize one particular query, I found that it is blazing fast in psql (0.5 - 1 ms) in comparison to 200-250 ms in the log. The same thing happened with the other queries.
The application and database server is running on the same machine and communicating using localhost interface.
JDBC driver - postgresql-9.0-801.jdbc4
I wonder what could be wrong, because query duration in the log is calculated considering only database processing time excluding external things like network turnarounds etc.
Possibility 1: If the slow queries occur occasionally or in bursts, it could be checkpoint activity. Enable checkpoint logging (log_checkpoints = on), make sure the log level (log_min_messages) is 'info' or lower, and see what turns up. Checkpoints that're taking a long time or happening too often suggest you probably need some checkpoint/WAL and bgwriter tuning. This isn't likely to be the cause if the same statements are always slow and others always perform well.
Possibility 2: Your query plans are different because you're running them directly in psql while Hibernate, via PgJDBC, will at least sometimes be doing a PREPARE and EXECUTE (at the protocol level so you won't see actual statements). For this, compare query performance with PREPARE test_query(...) AS SELECT ... then EXPLAIN ANALYZE EXECUTE test_query(...). The parameters in the PREPARE are type names for the positional parameters ($1,$2,etc); the parameters in the EXECUTE are values.
If the prepared plan is different to the one-off plan, you can set PgJDBC's prepare threshold via connection parameters to tell it never to use server-side prepared statements.
This difference between the plans of prepared and unprepared statements should go away in PostgreSQL 9.2. It's been a long-standing wart, but Tom Lane dealt with it for the up-coming release.
It's very hard to say for sure without knowing all the details of your system, but I can think of a couple of possibilities:
The query results are cached. If you run the same query twice in a short space of time, it will almost always complete much more quickly on the second pass. PostgreSQL maintains a cache of recently retrieved data for just this purpose. If you are pulling the queries from the tail of your log and executing them immediately this could be what's happening.
Other processes are interfering. The execution time for a query varies depending on what else is going on in the system. If the queries are taking 100ms during peak hour on your website when a lot of users are connected but only 1ms when you try them again late at night this could be what's happening.
The point is you are correct that the query duration isn't affected by which library or application is calling it, so the difference must be coming from something else. Keep looking, good luck!
There are several possible reasons. First if the database was very busy when the slow queries excuted, the query may be slower. So you may need to observe the load of the OS at that moment for future analysis.
Second the history plan of the sql may be different from the current session plan. So you may need to install auto_explain to see the actual plan of the slow query.