Slow redshift query with low cost and number of rows - amazon-redshift

I have a Redshift query that results on the following query plan being generated:
XN HashAggregate (cost=4.00..4.06 rows=1 width=213)
-> XN Hash Join DS_DIST_ALL_NONE (cost=0.02..3.97 rows=1 width=213)
-> XN Hash Join DS_DIST_NONE (cost=0.00..3.93 rows=1 width=213)
-> XN Hash (cost=0.01..0.01 rows=1 width=8)
-> XN Seq Scan on response_entities re (cost=0.00..1.96 rows=157 width=85)
-> XN Hash (cost=0.00..0.00 rows=1 width=208)
-> XN Seq Scan on response_views rv (cost=0.00..0.00 rows=1 width=208)
-> XN Seq Scan on dim_date dd (cost=0.00..0.01 rows=1 width=8)
The query wouldn't broadcast or redistribute any data, it has a very low cost, and doesn't need to read a large number of rows. It actually doesn't return any rows, and none of its steps are diskbased.
The execution details on the AWS console show this:
I'm not including the query because I'm not looking for a reason why this particular query took 3 seconds to complete. I keep seeing execution timelines similar to this one, I'm trying to understand why even though each step takes just a couple of milliseconds to complete, the query ends up taking much longer. There are no other concurrent queries being executed.
Is all this time spent just on query compilation? Is this expected? Is there something I'm missing?

Query compilation is what to seems the reason for this. This query the slow compiled segment.
select userid, xid, pid, query, segment, locus,
datediff(ms, starttime, endtime) as duration, compile
from svl_compile
where query = 26540
order by query, segment;
More information on svl_compile can be found here.
And this article explains the same issue and how to reduce number of compilations (or workarounds).

Related

LockRows plan node taking long time

I have the following query in Postgres (emulating a work queue):
DELETE FROM work_queue
WHERE id IN ( SELECT l.id
FROM work_queue l
WHERE l.delivered = 'f' and l.error = 'f' and l.archived = 'f'
ORDER BY created_at
LIMIT 5000
FOR UPDATE SKIP LOCKED );
While running the above concurrently (4 processes per second) along with a concurrent ingest at the rate of 10K records/second into work_queue, the query effectively bottlenecks on LockRow node.
Query plan output:
Delete on work_queue (cost=478.39..39609.09 rows=5000 width=67) (actual time=38734.995..38734.998 rows=0 loops=1)
-> Nested Loop (cost=478.39..39609.09 rows=5000 width=67) (actual time=36654.711..38507.393 rows=5000 loops=1)
-> HashAggregate (cost=477.96..527.96 rows=5000 width=98) (actual time=36654.690..36658.495 rows=5000 loops=1)
Group Key: "ANY_subquery".id
-> Subquery Scan on "ANY_subquery" (cost=0.43..465.46 rows=5000 width=98) (actual time=36600.963..36638.250 rows=5000 loops=1)
-> Limit (cost=0.43..415.46 rows=5000 width=51) (actual time=36600.958..36635.886 rows=5000 loops=1)
-> LockRows (cost=0.43..111701.83 rows=1345680 width=51) (actual time=36600.956..36635.039 rows=5000 loops=1)
-> Index Scan using work_queue_created_at_idx on work_queue l (cost=0.43..98245.03 rows=1345680 width=51) (actual time=779.706..2690.340 rows=250692 loops=1)
Filter: ((NOT delivered) AND (NOT error) AND (NOT archived))
-> Index Scan using work_queue_pkey on work_queue (cost=0.43..7.84 rows=1 width=43) (actual time=0.364..0.364 rows=1 loops=5000)
Index Cond: (id = "ANY_subquery".id)
Planning Time: 8.424 ms
Trigger for constraint work_queue_logs_work_queue_id_fkey: time=5490.925 calls=5000
Trigger work_queue_locked_trigger: time=2119.540 calls=1
Execution Time: 46346.471 ms
(corresponding visualization: https://explain.dalibo.com/plan/ZaZ)
Any ideas on improving this? Why should locking rows take so long in the presence of concurrent inserts? Note that if I do not have concurrent inserts into the work_queue table, the query is super fast.
We can see that the index scan returned 250692 rows in order to find 5000 to lock. So apparently we had to skip over 49 other queries worth of locked rows. That is not going to be very efficient, although if static it shouldn't be as slow as you see here. But it has to acquire a transient exclusive lock on a section of memory for each attempt. If it is fighting with many other processes for those locks, you can get a cascading collapse of performance.
If you are launching 4 such statements per second with no cap and without waiting for any previous ones to finish, then you have an unstable situation. The more you have running at one time, the more they fight each other and slow down. If the completion rate goes down but the launch interval does not, then you just get more processes fighting with more other processes and each getting slower. So once you get shoved over the edge, it might never recover on its own.
The role of concurrent insertions is probably just to provide enough noisy load on the system to give the collapse a chance to take a foothold. And of course without concurrent insertion, your deletes are doing to run out of things to delete pretty soon, at which point they will be very fast.

Efficient querying/indexing in Postgres for WHERE IN (...) and ORDER BY

I have a table of posts, and each post belongs to a classroom. I want to be able to query for most recent posts across several classrooms, like this:
SELECT * FROM posts
WHERE posts.classroom_id IN (6691, 6693, 6695, 6702)
ORDER BY date desc, created_at desc
LIMIT 30;
Unfortunately, this results in Postgres pulling and sorting tens of thousands of records - it has to get all the posts for each classroom, and sort all of them together, in order to find the 30 most recent overall.
Here's the explain+analyze:
-> Sort (cost=67525.77..67571.26 rows=18194 width=489) (actual time=9373.376..9373.381 rows=30 loops=1)
Sort Key: date DESC, created_at DESC
Sort Method: top-N heapsort Memory: 62kB
-> Bitmap Heap Scan on posts (cost=350.74..66988.42 rows=18194 width=489) (actual time=41.360..9271.782 rows=42924 loops=1)
Recheck Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
Heap Blocks: exact=29456
-> Bitmap Index Scan on optimize_finding_photos_and_tagged_posts_by_classroom (cost=0.00..346.19 rows=18194 width=0) (actual time=16.205..16.205 rows=42924 loops=1)
Index Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
Planning time: 0.216 ms
Execution time: 9390.323 ms
From various index options, the planner chose one that starts with classroom_id, which makes sense (the subsequent fields in that index are irrelevant). But it seems so inefficient that it has to gather 42,924 rows and sort them all.
It seems it could take a big shortcut by retrieving only the 30 most recent for each classroom, and then sorting those. To facilitate this, I tried adding a new index on [classroom_id, date DESC, created_at DESC], but the planner chose not to use it. Is Postgres just not quite clever enough to use the shortcut I describe? Or is there something I'm overlooking?
So, is there a better way to index or query, so that this kind of lookup can be more efficient?
One more side question: in the explain+analyze, why does the sort node take so little time? I would expect the sorting to be fairly slow/expensive.
Creating a test database...
CREATE TABLE posts( classroom_id INT NOT NULL, date FLOAT NOT NULL, foo TEXT );
INSERT INTO posts SELECT random()*100, random() FROM generate_series( 1,1500000 );
CREATE INDEX posts_cd ON posts( classroom_id, date );
CREATE INDEX posts_date ON posts( date );
VACUUM ANALYZE posts;
Note the "foo" column is there to avoid an index-only scan on posts which would be very fast on this test setup which only contains indexed columns classroom_id,date but would be useless for you since you will select other columns also.
If you have an index on date that you use for other things, like displaying the most recent posts for all classrooms, then you can use it here too:
EXPLAIN ANALYZE SELECT * FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..55.67 rows=30 width=44) (actual time=0.040..0.983 rows=30 loops=1)
-> Index Scan Backward using posts_date on posts (cost=0.29..5447.29 rows=2951 width=44) (actual time=0.039..0.978 rows=30 loops=1)
Filter: (classroom_id = ANY ('{1,2,6}'::integer[]))
Rows Removed by Filter: 916
Planning time: 0.117 ms
Execution time: 1.008 ms
This one is a bit risky since the condition on classroom is not indexed: since it will scan the date index backwards, if many classrooms that are excluded by the WHERE condition have recent posts it may have to skip lots of rows in the index before finding the requested rows. My test data distribution is random, but this query may have different performance if your data distribution is different.
Now, without the index on date.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10922.61..10922.69 rows=30 width=44) (actual time=41.038..41.049 rows=30 loops=1)
-> Sort (cost=10922.61..11028.44 rows=42331 width=44) (actual time=41.036..41.040 rows=30 loops=1)
Sort Key: date DESC
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on posts (cost=981.34..9672.39 rows=42331 width=44) (actual time=10.275..33.056 rows=44902 loops=1)
Recheck Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Heap Blocks: exact=8069
-> Bitmap Index Scan on posts_cd (cost=0.00..970.76 rows=42331 width=0) (actual time=8.613..8.613 rows=44902 loops=1)
Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Planning time: 0.145 ms
Execution time: 41.086 ms
Note I've adjusted the number of rows in the table so the bitmap scan finds about the same number as yours.
It's the same plan you had, including the Top-N heapsort which is much faster than a complete sort (and uses a lot less memory):
One more side question: in the explain+analyze, why does the sort node take so little time?
Basically what it does is only keep the top N rows in the heapsort buffer since the rest will be discarded by the LIMIT anyway, so it doesn't have to sort everything. As the rows are fetched they are pushed into the heapsort buffer (or discarded if they would be discarded by the LIMIT anyway). So the sort doesn't happen as a separate step after the data to be sorted is gathered, instead it happens while the data is gathered, which is why it takes the same time as retrieving the data.
Now, my query is a lot faster than yours, while they use the same plan. Several reasons could explain this, for example I run it on a SSD which is fast. But I think the most likely explanation is that your posts table probably contains ... posts ... which means large-ish TEXT data. This means a lot of data will have to be fetched, then discarded, keeping only 30 rows. In order to test this I just did:
UPDATE posts SET foo= 992 bytes of text
VACUUM ANALYZE posts;
...and the query is much slower, 360ms, and it says:
Heap Blocks: exact=41046
So that's probably your problem. In order to solve it, the query should not fetch large amounts of data then discard them, which means we're going to use the primary key... you must have one already but I forgot it, so here it is.
ALTER TABLE posts ADD post_id SERIAL PRIMARY KEY;
VACUUM ANALYZE posts;
DROP INDEX posts_cd;
CREATE INDEX posts_cdi ON posts( classroom_id, date, post_id );
I add the PK to the index, and drop the previous index, because I want an index-only scan in order to avoid fetching all the data from the table. Scanning only the index involves much less data since it doesn't contain the actual posts. Of course, we only get the PKs, so we have to JOIN back to the main table to get the posts, but that happens only after all the filtering is done, so it's only 30 rows.
EXPLAIN ANALYZE SELECT p.* FROM posts p
JOIN (SELECT post_id FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30) pids USING (post_id)
ORDER BY date desc LIMIT 30;
Limit (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.421 rows=30 loops=1)
-> Sort (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.419 rows=30 loops=1)
Sort Key: p.date DESC
Sort Method: quicksort Memory: 85kB
-> Nested Loop (cost=2957.71..3211.31 rows=30 width=1012) (actual time=38.108..38.329 rows=30 loops=1)
-> Limit (cost=2957.29..2957.36 rows=30 width=12) (actual time=38.092..38.105 rows=30 loops=1)
-> Sort (cost=2957.29..3067.84 rows=44223 width=12) (actual time=38.092..38.104 rows=30 loops=1)
Sort Key: posts.date DESC
Sort Method: top-N heapsort Memory: 26kB
-> Index Only Scan using posts_cdi on posts (cost=0.43..1651.19 rows=44223 width=12) (actual time=0.023..22.186 rows=44902 loops=1)
Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Heap Fetches: 0
-> Index Scan using posts_pkey on posts p (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.006 rows=1 loops=30)
Index Cond: (post_id = posts.post_id)
Planning time: 0.305 ms
Execution time: 38.468 ms
OK. Much faster now. This trick is pretty useful: when the table contains lots of data, or even lots of columns, that will have to be lugged around inside the query engine then filtered and most of it discarded, sometimes it is faster to do the filtering and sorting on only the few small columns that are actually used, then fetching the rest of the data only for the rows that remain after the filtering is done. Sometimes it is worth it to split the table in two even, with the columns used for filtering and sorting in one table, and all the rest in the other table.
To go even faster we can make the query ugly:
SELECT p.* FROM posts p
JOIN (
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=1 ORDER BY date desc LIMIT 30) a
UNION ALL
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=2 ORDER BY date desc LIMIT 30) b
UNION ALL
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=3 ORDER BY date desc LIMIT 30) c
ORDER BY date desc LIMIT 30
) q USING (post_id)
ORDER BY date desc LIMIT 30;
This exploits the fact that, if there is only one classroom_id in the WHERE condition, then postgres will use index scan backward on (classroom_id,date) directly. And since I've added post_id to it, it doesn't even need to touch the table. And since the three selects in the union have the same sort order, it combines them with a merge, which means it doesn't even need to sort or even fetch the rows that ate cut off by the outer LIMIT 30.
Limit (cost=257.97..258.05 rows=30 width=1012) (actual time=0.357..0.367 rows=30 loops=1)
-> Sort (cost=257.97..258.05 rows=30 width=1012) (actual time=0.356..0.364 rows=30 loops=1)
Sort Key: p.date DESC
Sort Method: quicksort Memory: 85kB
-> Nested Loop (cost=1.73..257.23 rows=30 width=1012) (actual time=0.063..0.319 rows=30 loops=1)
-> Limit (cost=1.31..3.28 rows=30 width=12) (actual time=0.050..0.085 rows=30 loops=1)
-> Merge Append (cost=1.31..7.24 rows=90 width=12) (actual time=0.049..0.081 rows=30 loops=1)
Sort Key: posts.date DESC
-> Limit (cost=0.43..1.56 rows=30 width=12) (actual time=0.024..0.032 rows=12 loops=1)
-> Index Only Scan Backward using posts_cdi on posts (cost=0.43..531.81 rows=14136 width=12) (actual time=0.024..0.029 rows=12 loops=1)
Index Cond: (classroom_id = 1)
Heap Fetches: 0
-> Limit (cost=0.43..1.55 rows=30 width=12) (actual time=0.018..0.024 rows=9 loops=1)
-> Index Only Scan Backward using posts_cdi on posts posts_1 (cost=0.43..599.55 rows=15950 width=12) (actual time=0.017..0.023 rows=9 loops=1)
Index Cond: (classroom_id = 2)
Heap Fetches: 0
-> Limit (cost=0.43..1.56 rows=30 width=12) (actual time=0.006..0.014 rows=11 loops=1)
-> Index Only Scan Backward using posts_cdi on posts posts_2 (cost=0.43..531.81 rows=14136 width=12) (actual time=0.006..0.014 rows=11 loops=1)
Index Cond: (classroom_id = 3)
Heap Fetches: 0
-> Index Scan using posts_pkey on posts p (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.007 rows=1 loops=30)
Index Cond: (post_id = posts.post_id)
Planning time: 0.445 ms
Execution time: 0.432 ms
The resulting speedup is pretty ridiculous. I think this should work.
To facilitate this, I tried adding a new index on [classroom_id, date DESC, created_at DESC], but the planner chose not to use it. Is Postgres just not quite clever enough to use the shortcut I describe?
It is just not clever enough. You could write it out explicitly to get the execution you envision. It is ugly, but it should be effective:
(SELECT * FROM posts WHERE classroom_id = 6691 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6693 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6695 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6697 ORDER BY date desc, created_at desc LIMIT 30)
order by date desc, created_at desc LIMIT 30;
One more side question: in the explain+analyze, why does the sort node take so little time? I would expect the sorting to be fairly slow/expensive.
CPUs are very fast, and 40,000 rows is just not very many. Unlike CPUs however, your storage is not nearly as fast, and stomping all over a very large table to collect 40,000 rows takes a lot of time. There are all kinds of ways try to address this (other than fixing the planner or rewriting your query). Get faster primary storage or more caching. Get it to use an index-only scan (is selecting * really necessary?), clustering or partitioning the table on classroom_id so that rows of the same classroom are located together, etc.
If you don't want to rearrange your data or change your hardware or rewrite your query, then maybe the simplest thing to try would be just to build an index on (date, created_at), which might lead to another plan which is less good than the perfect plan, but much better than the current plan. It could use the index to walk the data in already-ordered order, collecting rows which meet the IN condition until it has collected 30.

Slow on first query

I'm having troubles when I perform the first query on a table. Subsequent queries are much faster, even if I change the range date to look for. I assume that PostgreSQL implements a caching mechanism that allows the subsequent queries to be much faster. I can try to warmup the cache so the first user request can hit the cache. However, I think I can somehow improve the following query:
SELECT
y.id, y.title, x.visits, x.score
FROM (
SELECT
article_id, visits,
COALESCE(ROUND((visits / NULLIF(hits ,0)::float)::numeric, 4), 0) score
FROM (
SELECT
article_id, SUM(visits) visits, SUM(hits) hits
FROM
article_reports
WHERE
a.site_id = 'XYZ' AND a.date >= '2017-04-13' AND a.date <= '2017-06-28'
GROUP BY
article_id
) q ORDER BY score DESC, visits DESC LIMIT(20)
) x
INNER JOIN
articles y ON x.article_id = y.id
Any ideas on how can I improve this. The following is the result of EXPLAIN:
Nested Loop (cost=84859.76..85028.54 rows=20 width=272) (actual time=12612.596..12612.836 rows=20 loops=1)
-> Limit (cost=84859.34..84859.39 rows=20 width=52) (actual time=12612.502..12612.517 rows=20 loops=1)
-> Sort (cost=84859.34..84880.26 rows=8371 width=52) (actual time=12612.499..12612.503 rows=20 loops=1)
Sort Key: q.score DESC, q.visits DESC
Sort Method: top-N heapsort Memory: 27kB
-> Subquery Scan on q (cost=84218.04..84636.59 rows=8371 width=52) (actual time=12513.168..12602.649 rows=28965 loops=1)
-> HashAggregate (cost=84218.04..84301.75 rows=8371 width=36) (actual time=12513.093..12536.823 rows=28965 loops=1)
Group Key: a.article_id
-> Bitmap Heap Scan on article_reports a (cost=20122.78..77122.91 rows=405436 width=36) (actual time=135.588..11974.774 rows=398242 loops=1)
Recheck Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
Heap Blocks: exact=36911
-> Bitmap Index Scan on index_article_reports_on_site_id_and_article_id_and_date (cost=0.00..20021.42 rows=405436 width=0) (actual time=125.846..125.846 rows=398479 loops=1)"
Index Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
-> Index Scan using articles_pkey on articles y (cost=0.42..8.44 rows=1 width=128) (actual time=0.014..0.014 rows=1 loops=20)
Index Cond: (id = q.article_id)
Planning time: 1.443 ms
Execution time: 12613.689 ms
Thanks in advance
There are two levels of "cache" that Postgres uses:
OS file cache
shared buffers.
Important: Postgres directly controls only the second one, and relies on the first one, which is under OS' control.
First thing I would check are these two settings in postgresql.conf:
effective_cache_size – usually I set it to ~3/4 of all RAM available. Notice that it's not a setting that tells Postgres how to allocate memory, it's just "an advice" to Postgres planner telling some estimate of OS file cache size
shared_buffers – usually I set it to 1/4 of RAM size. This is allocation setting.
Also, I'd check other memory-related settings (work_mem, maintenance_work_mem) to understand how much RAM might be consumed, so will my effective_cache_size estimation be correct at most times.
But if you just turned your Postgres on, the first queries will most probably be long because there is no data in OS file cache and in shared buffers. You can check it with advanced EXPLAIN options:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
-- you will see how many buffers were fetched from disk ("read") or from cache ("hit")
Here you can find good material on using EXPLAIN: http://www.dalibo.org/_media/understanding_explain.pdf
Additionally, there is an extension aiming to solve "cold cache" problem: pg_prewarm https://www.postgresql.org/docs/current/static/pgprewarm.html
Also, working with SSD disks instead of magnetic ones will mean that disk reads will be much faster.
Have fun and well working Postgres :-)
If it is the first query after inserting several rows you must run an
ANALYZE
in all the database or over the involved tables. Try executing it at database level.

I don't understand explain result of a slow query

I have a slow query (> 1s). Here is the result of an explain analyze on that query:
Nested Loop Left Join (cost=0.42..32275.13 rows=36 width=257) (actual time=549.409..1106.044 rows=2 loops=1)
Join Filter: (answer.lt_surveyee_survey_id = lt_surveyee_survey.id)
-> Index Scan using lt_surveyee_survey_id_key on lt_surveyee_survey (cost=0.42..8.44 rows=1 width=64) (actual time=0.108..0.111 rows=1 loops=1)
Index Cond: (id = 'xxxxx'::citext)
-> Seq Scan on answer (cost=0.00..32266.24 rows=36 width=230) (actual time=549.285..1105.910 rows=2 loops=1)
Filter: (lt_surveyee_survey_id = 'xxxxx'::citext)
Rows Removed by Filter: 825315
Planning time: 0.592 ms
Execution time: 1106.124 ms
The xxxxx parts of the result are an uuid like. I did not built that database, so I have no clue right now. Here is the query:
EXPLAIN ANALYZE SELECT
lt_surveyee_survey.id
-- +Some other fields
FROM lt_surveyee_survey
LEFT JOIN answer ON answer.lt_surveyee_survey_id = lt_surveyee_survey.id
WHERE lt_surveyee_survey.id = 'xxxxx';
Your JOIN is causing the performance drop according to the EXPLAIN ANALYZE output. You can see that there are 2 different lookups, one took a couple milliseconds, and the other took half a second.
The difference is indicated by the beginning of the lines: Index Scan and Seq Scan, where Seq is short for sequential, meaning that all of the rows had to be checked by the DBMS to process the JOIN. The reason why sequential scans would occur is a missing index on the column being checked (answer.lt_surveyee_survey_id in your case).
Adding an index should solve the performance issue.

Select max(sort_key) from tbl_5billion_rows taking too long

I have redshift table with 5 billion rows which is going to grow alot in near future. When I run a simple query select max(sort_key) from tbl it takes 30 sec. I have only one sort key in the table.I have run vacuum and analyze on the table recently. The reason I am worried about 30 sec is, I use max(sort_key) multiple times in my subquery. Is there anything I am missing?
Output Explain select max(sort_key) from tbl
XN Aggregate (cost=55516326.40..55516326.40 rows=1 width=4)
-> XN Seq Scan on tbl (cost=0.00..44413061.12 rows=4441306112 width=4)
Output Explain select sort_key from tbl order by sortkey desc limit 1
XN Limit (cost=1000756095433.11..1000756095433.11 rows=1 width=4)
-> XN Merge (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Merge Key: sort_key
-> XN Network (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Send to leader
-> XN Sort (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Sort Key: sort_key
-> XN Seq Scan on tbl (cost=0.00..44413061.12 rows=4441306112 width=4)
Finding the MAX() of a value requires Amazon Redshift to look through every value in the column. It probably isn't smart enough to realise that the MAX of the Sortkey is right at the end.
You could speed it up by helping the query use Zone Maps, which identify the range of values stored in each block.
If you know that the maximum sortkey is above a particular value, include that in the WHERE clause, eg:
SELECT MAX(sort_key) FROM tbl WHERE sort_key > 50000;
This will dramatically reduce the number of blocks that Redshift needs to retrieve from disk.