Redshift cluster: queries getting hang and filling up space - amazon-redshift

I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.
I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.

One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.
If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.
My team discussed this a bit more on our team blog the other day.

This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).
If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.
QMR Documentation: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html
QMR Rule Candidates query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_qmr_rule_candidates.sql

Related

Redshift vacuum is not reclaiming space

I have a Redshift cluster that consists of 2 nodes with 160 Gb disks.
I'm randomly getting "Disk full" error when running vacuum or any other query. My disk usage is 92%. I did delete more than a half of the old rows in table that is 10515 Mb in size, but even after rebooting the cluster there's no effect and table still of the same size, though count shows new number of rows. I should have a seen at lease small decrease in disk usage, but there's nothing.
Does anyone has any clues what it might be? Is deleting table in this case is the only option?
There are a few possibilities here but first let me check the facts. You have a 2 node dc2.large cluster and it is 92% disk full. This is too full and needs to lowered to be lowered to provide temp space for query execution. You have a table that is 10515 blocks in size. To address the disk space concern you deleted 1/2 of the rows in the table in question and then vacuumed the table. Once complete you didn't see any change to the cluster space nor the size of the table, not one block difference in table size. Do I have this correct?
First possibility is that the vacuum did not complete correctly. You mention that you are getting disk full messages even when vacuuming. So could it be that the vacuum you tried is not completing? You see vacuum need temp space to sort the table data and if you have a cluster that has gotten too full then the vacuum could fail. In this case you can run a delete-only vacuum that will not attempt to sort the table, just reclaim disk space. This will have a higher likelihood of success in a disk full situation.
Another possibility is that the delete of rows didn't complete correctly or wasn't committed before the vacuum was run. This will cause the vacuum to run on the full set of rows.
It is also possible that the table in question is very wide (many columns). This matters because of how Redshift stores data - each block is 1MB in size and each column needs a block for its data. This cluster has 4 slices and if this table is 1,500 columns wide (yes, that is silly wide) the table will take up 6,000 blocks to just store the first 4 rows. Then it takes no additional disk space to add rows until these blocks start to fill up. The table size will move in very large chunks and when removing rows the size may not change except in large chunks. This is unlikely to be what is happening if you are seeing EXACTLY the same number of blocks but if you are just seeing changes in blocks that are less than you expect this could be in play.
There could be some some other misunderstanding happening. A sort-only vacuum won't free up space. The node type isn't what I think it is. The table could live in S3 and be access through spectrum. But based on the description these don't seem likely.
UNSOLICITED ADVICE: You are on the right track by freeing up disk space but you need to take more action than reducing this one table. (I expect you realize this and this is just a start.) You should be operating below 70% disk full in most cases - this varies by workload and table sizes but is a good general rule. This means reducing a great deal of data on your disks or increasing your node count (and cost). Migrating some data to S3 and using Spectrum to access could be an option. If you need more storage w/o more compute you can look at the storage optimized nodes but since you are at the very smallest end of Redshift these likely aren't a win for you. You need to 1) remove unneeded data, 2) move some data to S3 and use Spectrum, or 3) add a node you your cluster.

Slow bulk read from Postgres Read replica while updating the rows we read

We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?
You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.
As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.

Slow transaction processing in PostgreSQL

I have been noticing a bad behavior with my Postrgre database, but I still can't find any solution or improvement to apply.
The context is simple, let's say I have two tables: CUSTOMERS and ITEMS. During certain days, the number of concurrent customers increase and so the request of items, they can consult, add or remove the amount from them. However in the APM I can see how any new request goes slower than the previous, pointing at the query response from those tables as the highest time consumer.
If the normal execution of the query is about 200 milliseconds, few moments later it can be about 20 seconds.
I understand the lock process in PostgreSQL as many users can be checking over the same item, they could be even affecting the values of it, but the response from the database it's too slow.
So I would like to know if there are ways to improve the performance in the database.
The first time I used PGTune to get the initial settings and it worked well. I have version 11 with 20Gb for RAM, 4 vCPU and SAN storage, the simultaneous customers (no sessions) can reach over 500.

How many (Maximum) DB2 multi row fetch cursor can be maintained in PLI/COBOL program?

How many (Maximum) DB2 multi row fetch cursor can be maintained in PLI/COBOL program as part of good performance?
I have a requirement to maintain 4 cursors in PLI program but I am concerned about number of multi fetch cursors in single program.
Is there any other way to check multi row fetch is more effective than normal cursor? I tried with 1000 records but I couldn't see the running time difference.
IBM published some information (PDF) about multi-row fetch performance when this feature first became available in DB2 8 in 2005. Their data mentions nothing about the number of cursors in a given program, only the number of rows fetched.
From this I infer the number of multi-row fetch cursors itself is not of concern from a performance standpoint -- within reason. If someone pushes the limits of reason with 10,000 such cursors I will not be responsible for their anguish.
The IBM Redbook linked to earlier indicates there is a 40% CPU time improvement retrieving 10 rows per fetch, and a 50% CPU time improvement retrieving 100+ rows per fetch. Caveats:
The performance improvement using multi-row fetch in general depends
on:
Number of rows fetched in one fetch
Number of columns fetched
(more improvement with fewer columns), data type and size of the
columns
Complexity of the fetch. The fixed overhead saved for not
having to go between the database engine and the application program
has a lower percentage impact for complex SQL that has longer path
lengths.
If the multi-row fetch reads more rows per statement, it results in
CPU time improvement, but after 10 to 100 rows per multi-row fetch,
the benefit is decreased. The benefit decreases because, if the cost
of one API overhead per row is 100% in a single row statement, it gets
divided by the number of rows processed in one SQL statement. So it
becomes 10% with 10 rows, 1% with 100 rows, 0.1% for 1000 rows, and
then the benefit becomes negligible.
The Redbook also has some discussion of how they did their testing to arrive at their performance figures. In short, they varied the number of rows retrieved and reran their program several times, pretty much what you'd expect.

MongoDB upsert operation blocks inconsistently (with syncdelay set to 0)

There is a database with 9 million rows, with 3 million distinct entities. Such a database is loaded everyday into MongoDB using perl driver. It runs smoothly on the first load. However from the second, third, etc. the process is really slowed down. It blocks for long times every now and then.
I initially realised that this was because of the automatic flushing to disk every 60 seconds, so I tried setting syncdelay to 0 and I tried the nojournalling option. I have indexed the fields that are used for upsert. Also I have observed that the blocking is inconsistent and not always at the same time for the same line.
I have 17 G ram and enough hard disk space. I am replicating on two servers with one arbiter. I do not have any significant processes running in the background. Is there a solution/explanation for such blocking?
UPDATE: The mongostat tool says in the 'res' column that around 3.6 G is used.