AWS RDS Postgres random reads incredibly slow - postgresql

We are running a
r5.4xlarge RDS PG 10.5 instance with 10.000 provisioned IOPS.
When runing a query like
select count(f3) from table
where f1 = 'x' and f2 between x and y;
it will take >15 seconds to return for ~20k values.
There is and index covering (f1,f2). F3 is not on the index. Explain analyse shows that the time is spent on an INDEX SCAN.
It seems that whenever the required data is not in cache and has to be read from disk we get those slow response times.
We achieve the same results with both, the default parameter groups and changing some values as proposed here https://pgtune.leopard.in.ua/#/
Are we having fundamentally wrong expectations towards RDS here, or are we missing some essential detail in the configuration?

Related

What could be causing high cpu in google spanner databases? (unresponsive table)

I'm facing an issue where 2 out of 10 spanner databases are showing a high CPU usage (above 40%) whereas the others are around %1 each, with almost identical or more data.
I notice one of our tables has become "unresponsive" no queries work against it. We shutdown all apps that connect to those dbs, and we also deleted all current sessions using gcloud sessions list and then gcloud session delete.
However the table is still unresponsive. A simple select like select id from mytable where name = 'test' is not responding (when tested from an app, and also from gcloud web interface), it only happens with that table, which has only a few columns with normal data and no more than 2000 records. We identified the query that could have been the source of the problem, however the table seems to be locked (only count(*) without any where clause works).
I was wondering if there is any way to "unlock" the table, kill those "transactions" that might be causing the issue, or restart those specific spanner databases, or in the worst case scenario restarting the spanner instance.
I have seen the monitoring high cpu documentation, but even if we can identify the cpu is high, we don't really know how to restart or make it back to normal before reviewing the query/ies that could have caused the issue (if that was the case).
High CPU can be caused by user initiated queries, from different types of operations. It is important to notice that your instance is where you allocate resources to be used by the underlying Cloud Spanner databases. This means, that if all of your databases are in the same instance and if some of your databases are hogging the CPU, all your other databases will struggle.
In terms of a locked table, I would be very surprised if a deadlock is the problem here, since Spanner mitigates those issues using "wound-wait" algorithm. What I suspect might be happening is that a long time is necessary to perform the query in that table and it times out. It would be nice to investigate a bit more on your problem:
How long did you wait for your query on the problematic table (before you deemed to be stuck)? It might be a problem where your query just takes long and you are timing out before getting the results. It might be a problem where there are hotspots in your table and transactions are getting aborted often, preventing you from getting results.
What error did you get when performing the query on the table? The error code can tell you more about what might be happening.
Have you tried doing a stale read on the table to see if any data is returned? If lock contention is the problem, this should succeed and return results faster for you. Thus, you can further investigate the lock usage with the statistics table (as shown below).
Inspect query statistics: you can list the queries with the highest CPU usage, find the average latency for a query and find out the queries that timeout the most. There is much more you can do, as seen here. You can also view the query statistics in cloud console. What I would verify is, by reducing the overall CPU, does your query come back without any issues? You might need more resources if so. Or you might need to reduce hotspots in your database.
Inspect Lock statistics: you can investigate lock conflicts in your rows and tables. I think that an interesting query for your problem would be Discovering which row keys and columns had long lock wait times during the selected period. You can then see if your query is reading those row keys and columns and act accordingly.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

Why would there be varying response times for the same query?

Is there a reason why the same query executed a number of times have huge variance in response times? from 50% - 200% what the projected response time is? They range from 6 seconds to 20 seconds even though it is the only active query in the database.
Context:
Database on Postgres 9.6 on AWS RDS (with Provisioned IOPS)
Contains one table comprising five numeric columns, indexed on id, holding 200 million rows
The query:
SELECT col1, col2
FROM calculations
WHERE id > 0
AND id < 100000;
The query's explain plan:
Bitmap Heap Scan on calculation (cost=2419.37..310549.65 rows=99005 width=43)
Recheck Cond: ((id > 0) AND (id <= 100000))
-> Bitmap Index Scan on calculation_pkey (cost=0.00..2394.62 rows=99005 width=0)
Index Cond: ((id > 0) AND (id <= 100000))
Is there any reasons why a simple query like this isn't more predictable in response time?
Thanks.
When you see something like this in PostgreSQL EXPLAIN ANALYZE:
(cost=2419.37..310549.65)
...it doesn't mean the cost is between 2419.37 and 310549.65. These are in fact two different measures. The first value is the startup cost, and the second value is the total cost. Most of the time you'll care only about the total cost. The times that you should be concerned with startup cost is when that component of the execution plan is in relation to (for example) an EXISTS clause, where only the first row needs to be returned (so you only care about startup cost, not total, as it exits almost immediately after startup).
The PostgreSQL documentation on EXPLAIN goes into greater detail about this.
A query may be (and should be, excluding special cases) more predictable in response time when you are a sole user of the server. In the case of a cloud server, you do not know anything about the actual server load, even if your query is the only one performed on your database, because the server most likely supports multiple databases at the same time. As you asked about response time, there may be also various circumstances involved in accessing a remote server over the network.
After investigation of the historical load, we have found out that the provisioned IOPS we originally configured had been exhausted during the last set of load tests performed on the environment.
According to Amazon's documentation #http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html, after this point, Amazon does not guarantee consistency in execution times and the SLAs are no longer applicable.
We have confirmed that replicating the database onto a new instance of AWS RDS with same configuration yields consistent response times when executing the query multiple times.

Optimize PostgreSQL read-only tables

I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.

Cassandra random read speed

We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights
purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.
Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.
It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.
VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.