Very bad performance of UNION select query in RedShift / ParAccel - amazon-redshift

I have two tables in redshift:
tbl_current_day - about 4.5M rows
tbl_previous_day - about 4.5M rows, with the same data exactly as tbl_current_day
In addition to it, I have a view called qry_both_days defined as following:
CREATE OR REPLACE qry_both_days AS
SELECT * FROM tbl_current_day
UNION SELECT * FROM tbl_previous_day;
When I run a query on one of the separate tables, I get very good performance as expected.
For example, the following query runs 5 seconds:
select count(distinct person_id) from tbl_current_day;
-- (person_id is of type int)
Explain plan:
XN Aggregate (cost=1224379.82..1224379.82 rows=1 width=4)
-> XN Subquery Scan volt_dt_0 (cost=1224373.80..1224378.61 rows=481 width=4)
-> XN HashAggregate (cost=1224373.80..1224373.80 rows=481 width=4)
-> XN Seq Scan on tbl_current_day (cost=0.00..979499.04 rows=97949904 width=4)
Note that width is 4 bytes, as it's supposed to be, as my column is of type int.
HOWEVER, when I run the same query on qry_both_days the query runs 20 times slower, while I would expect it to run only 2 times slower, as it should go over twice more rows:
select count(distinct person_id) from qry_both_days;
Explain plan:
XN Aggregate (cost=55648338.34..55648338.34 rows=1 width=4)
-> XN Subquery Scan volt_dt_0 (cost=55648335.84..55648337.84 rows=200 width=4)
-> XN HashAggregate (cost=55648335.84..55648335.84 rows=200 width=4)
-> XN Subquery Scan qry_both_days (cost=0.00..54354188.49 rows=517658938 width=4)
-> XN Unique (cost=0.00..49177599.11 rows=517658938 width=190)
-> XN Append (cost=0.00..10353178.76 rows=517658938 width=190)
-> XN Subquery Scan "*SELECT* 1" (cost=0.00..89649.20 rows=4482460 width=190)
-> XN Seq Scan on tbl_current_day (cost=0.00..44824.60 rows=4482460 width=190)
-> XN Subquery Scan "*SELECT* 2" (cost=0.00..90675.00 rows=4533750 width=187)
-> XN Seq Scan on tbl_previous_day (cost=0.00..45337.50 rows=4533750 width=187)
The problem: width is now 190, not 4 bytes as it's supposed to be!!!
Anybody knows how to make RedShift pick only the relevant columns on UNION SELECT?
Thanks!

UNION used by itself removes duplicate rows, e.g., uses an implied DISTINCT, as per the SQL spec.
That means that a lot more processing is required to prepare the output.
If you do not want DISTINCT results then you should always use UNION ALL to make sure the database is not checking for potential dupes.

Your view is created as SELECT *, so it always queries all the columns to create data for the view.
Then another SELECT is used and only requested columns from the view are returned.
If you have limited number of selected columns (like a two, three sets that are used all the time), I'd create a separate view for each column set.
Another way (even less elegant than one before) is to call each view so its name says which columns are included (lets say sorted and separated with '__') - like qry_both_days__age__name__person_id. Then, before each query, check if required view exists, if not create it.

Related

Postgres JSONB Retrieval is very slow

At bit of a loss here. First and foremost, I'm not a dba nor do I really have any experience with Postgres with the exception of what I'm doing now.
Postgres seems to choke when you want to return jsonb documents in general for anything more than a couple of hundred rows. When you try to return thousands, query performance becomes terrible. If you go even further and attempt to return multiple jsonb documents following various table joins, forget it.
Here is my scenario with some code:
I have 3 tables - all tables have jsonb models, all complex models and 2 of which are sizeable (8 to 12kb in size uncompressed). In this particular operation I need to unnest a jsonb array of elements to then work through - this gives me roughly 12k records.
Each record then contains an ID that I use to join another important table - I need to retreive the jsonb doc from this table. From there, I need to join that table on to another (much smaller) table and also pull the doc from there based on another key.
The output is therefore several columns + 3 jsonb documents ranging from <1kb in size to around 12kb uncompressed in size.
Query data retrieval is effectively pointless - I've yet to see the query return data. As soon as I strip away the json doc columns, naturally the query speeds up to seconds or less. 1 jsonb document bumps the retrieval to 40seconds in my case, adding a second takes us to 2 minutes and adding the third is much longer.
What am I doing wrong? Is there any way to retrieve the jsonb documents in a performant way?
SELECT x.id,
a.doc1,
b.doc2,
c.doc3
FROM ( SELECT id,
(elements.elem ->> 'a'::text)::integer AS a,
(elements.elem ->> 'b'::text)::integer AS b,
(elements.elem ->> 'c'::text)::integer AS c,
(elements.elem ->> 'd'::text)::integer AS d,
(elements.elem ->> 'e'::text)::integer AS e
FROM tab
CROSS JOIN LATERAL jsonb_array_elements(tab.doc -> 'arrayList'::text) WITH ORDINALITY elements(elem, ordinality)) x
LEFT JOIN table2 a ON x.id = a.id
LEFT JOIN table3 b ON a.other_id = b.id
LEFT JOIN table4 c ON b.other_id = c.id;
The tables themselves are fairly standard:
CREATE TABLE a (
id (primary key),
other_id (foreign key),
doc jsonb
)
Nothing special about these tables, they are ids and jsonb documents
A note - we are using Postgres for a few reasons, we do need the relational aspects of PG but at the same time we need to document storage and retrieval ability for later in our workflow.
Apologies if I've not provided enough data here, I can try to add some more based on any comments
EDIT: added explain:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1843)
Hash Cond: (pr.table_3_id = br.id)
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1149)
Hash Cond: (((p.doc ->> 'secondary_id'::text))::integer = pr.id)
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=1029)
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40)
-> Seq Scan on table_3 (cost=0.00..13.13 rows=113 width=710)
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32)
-> Index Scan using table_1_pkey on table_1 p (cost=0.43..8.41 rows=1 width=993)
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
-> Hash (cost=325.36..325.36 rows=10036 width=124)
-> Seq Scan on table_2 pr (cost=0.00..325.36 rows=10036 width=124)
-> Hash (cost=13.13..13.13 rows=113 width=710)
-> Seq Scan on table_3 br (cost=0.00..13.13 rows=113 width=710)
(14 rows)
EDIT2: Sorry been mega busy - I will try to go into more detail - firstly the fully explain plan (I didn't know about the additional parameters) - Ill leave in the actual tables (I wasn't sure if I was allowed to):
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1726) (actual time=4.669..278.781 rows=12522 loops=1)
Hash Cond: (pr.brand_id = br.id)
Buffers: shared hit=64813
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1032) (actual time=4.537..265.749 rows=12522 loops=1)
Hash Cond: (((p.doc ->> 'productId'::text))::integer = pr.id)
Buffers: shared hit=64801
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=912) (actual time=0.240..39.480 rows=12522 loops=1)
Buffers: shared hit=49964
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40) (actual time=0.230..8.177 rows=12522 loops=1)
Buffers: shared hit=163
-> Seq Scan on brand (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.038 rows=113 loops=1)
Buffers: shared hit=12
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32) (actual time=0.045..0.057 rows=111 loops=113)
Buffers: shared hit=151
-> Index Scan using product_variant_pkey on product_variant p (cost=0.43..8.41 rows=1 width=876) (actual time=0.002..0.002 rows=1 loops=12522)
Index Cond: (((elements.elem ->> 'productVariantId'::text))::integer = id)
Buffers: shared hit=49801
-> Hash (cost=325.36..325.36 rows=10036 width=124) (actual time=4.174..4.174 rows=10036 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1684kB
Buffers: shared hit=225
-> Seq Scan on product pr (cost=0.00..325.36 rows=10036 width=124) (actual time=0.003..1.836 rows=10036 loops=1)
Buffers: shared hit=225
-> Hash (cost=13.13..13.13 rows=113 width=710) (actual time=0.114..0.114 rows=113 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 90kB
Buffers: shared hit=12
-> Seq Scan on brand br (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.043 rows=113 loops=1)
Buffers: shared hit=12
Planning Time: 0.731 ms
Execution Time: 279.952 ms
(29 rows)
Your query is hard to follow for a few reasons:
Your tables are named tab, table2, table3, `table4.
Your subquery parses JSON for every single row in the table, projects out some values, and then the outer query never uses those values. The only one that appears to be relevant is id.
Outer joins must be executed in order while inner joins can be freely re-arranged for performance. Without knowing the purpose of this query, it's impossible for me to determine if an outer join is appropriate or not.
The table names and column names in the execution plan do not match the query, so I'm not convinced this plan is accurate.
You did not supply a schema.
That said, I'll do my best.
Things that stand out performance-wise
No where clause
Since there is no where clause, your query will run jsonb_array_elements against every single row of tab, which is what is happening. Aside from extracting the data out of JSON and storing it into a separate column, I can't imagine much that could be done to optimize it.
Insufficient indexes to power the joins
The query plan suggests there might be a meaningful cost to the joins. Except for table1, each join is driven by a sequential scan of the tables, which means, reading every row of both tables. I suspect adding indexes on each table will help. It appears you are joining on id columns, so a simple primary key constraint will improve both your data integrity and query performance.
alter table tab add constraint primary key (id);
alter table table2 add constraint primary key (id);
alter table table3 add constraint primary key (id);
alter table table4 add constraint primary key (id);
Type conversions
This part of the execution plan shows a double type conversion in your first join:
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
This predicate means that the id value from tab is being converted to text, then the text converted to an integer so it can match against table2.id. These conversions can be expensive in compute time and it some cases, can prevent index usage. It's hard to advise on what to do because I don't know what the actual types are.

Slow redshift query with low cost and number of rows

I have a Redshift query that results on the following query plan being generated:
XN HashAggregate (cost=4.00..4.06 rows=1 width=213)
-> XN Hash Join DS_DIST_ALL_NONE (cost=0.02..3.97 rows=1 width=213)
-> XN Hash Join DS_DIST_NONE (cost=0.00..3.93 rows=1 width=213)
-> XN Hash (cost=0.01..0.01 rows=1 width=8)
-> XN Seq Scan on response_entities re (cost=0.00..1.96 rows=157 width=85)
-> XN Hash (cost=0.00..0.00 rows=1 width=208)
-> XN Seq Scan on response_views rv (cost=0.00..0.00 rows=1 width=208)
-> XN Seq Scan on dim_date dd (cost=0.00..0.01 rows=1 width=8)
The query wouldn't broadcast or redistribute any data, it has a very low cost, and doesn't need to read a large number of rows. It actually doesn't return any rows, and none of its steps are diskbased.
The execution details on the AWS console show this:
I'm not including the query because I'm not looking for a reason why this particular query took 3 seconds to complete. I keep seeing execution timelines similar to this one, I'm trying to understand why even though each step takes just a couple of milliseconds to complete, the query ends up taking much longer. There are no other concurrent queries being executed.
Is all this time spent just on query compilation? Is this expected? Is there something I'm missing?
Query compilation is what to seems the reason for this. This query the slow compiled segment.
select userid, xid, pid, query, segment, locus,
datediff(ms, starttime, endtime) as duration, compile
from svl_compile
where query = 26540
order by query, segment;
More information on svl_compile can be found here.
And this article explains the same issue and how to reduce number of compilations (or workarounds).

Select max(sort_key) from tbl_5billion_rows taking too long

I have redshift table with 5 billion rows which is going to grow alot in near future. When I run a simple query select max(sort_key) from tbl it takes 30 sec. I have only one sort key in the table.I have run vacuum and analyze on the table recently. The reason I am worried about 30 sec is, I use max(sort_key) multiple times in my subquery. Is there anything I am missing?
Output Explain select max(sort_key) from tbl
XN Aggregate (cost=55516326.40..55516326.40 rows=1 width=4)
-> XN Seq Scan on tbl (cost=0.00..44413061.12 rows=4441306112 width=4)
Output Explain select sort_key from tbl order by sortkey desc limit 1
XN Limit (cost=1000756095433.11..1000756095433.11 rows=1 width=4)
-> XN Merge (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Merge Key: sort_key
-> XN Network (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Send to leader
-> XN Sort (cost=1000756095433.11..1000767198698.39 rows=4441306112 width=4)
Sort Key: sort_key
-> XN Seq Scan on tbl (cost=0.00..44413061.12 rows=4441306112 width=4)
Finding the MAX() of a value requires Amazon Redshift to look through every value in the column. It probably isn't smart enough to realise that the MAX of the Sortkey is right at the end.
You could speed it up by helping the query use Zone Maps, which identify the range of values stored in each block.
If you know that the maximum sortkey is above a particular value, include that in the WHERE clause, eg:
SELECT MAX(sort_key) FROM tbl WHERE sort_key > 50000;
This will dramatically reduce the number of blocks that Redshift needs to retrieve from disk.

Postgres 9.4: How to fix Query Planner's Choice of Hash Join in ANY ARRAY lookup which runs 10x slower

I realize of course that figuring out these issues can be complex and require lots of info but I'm hoping there is a known issue or workaround for this particular case. I've narrowed down the change in the query that causes the sub-optimal query plan (this is running Postgres 9.4).
The following query runs in about 50ms. The tag_device table is a junction table with ~2 million entries, the devices table has about 1.5 million entries and the tags table has about 500,000 entries (Note: the actual IP values are just made up).
WITH inner_query AS (
SELECT * FROM tag_device
INNER JOIN tags ON tag_device.tag_id = tags.id
INNER JOIN devices ON tag_device.device_id = devices.id
WHERE devices.device_ip <<= ANY(ARRAY[
'10.0.0.1', '10.0.0.2', '10.0.0.5', '11.1.1.1', '12.2.2.35','13.0.0.1', '15.0.0.8', '1.160.0.1', '17.1.1.24', '18.2.2.1',
'10.0.0.6', '10.0.0.21', '10.0.0.52', '11.1.1.2', '12.2.2.34','13.0.0.2', '15.0.0.7', '1.160.0.2', '17.1.1.23', '18.2.2.2',
'10.0.0.7', '10.0.0.22', '10.0.0.53', '11.1.1.3', '12.2.2.33','13.0.0.3', '15.0.0.6', '1.160.0.3', '17.1.1.22', '18.2.2.3'
]::iprange[])
))
SELECT * FROM inner_query LIMIT 100 OFFSET 0;
A few things to note. device_ip is using the ip4r module (https://github.com/RhodiumToad/ip4r) to provide ip range lookups and this column has a gist index on it. The above query runs in about 50ms using the following query plan:
Limit (cost=140367.19..140369.19 rows=100 width=239)
CTE inner_query
-> Nested Loop (cost=40147.63..140367.19 rows=56193 width=431)
-> Merge Join (cost=40147.20..113345.15 rows=56193 width=261)
Merge Cond: (tag_device.device_id = devices.id)
-> Index Scan using tag_device_device_id_idx on tag_device (cost=0.43..67481.36 rows=1900408 width=51)
-> Materialize (cost=40136.82..40402.96 rows=53228 width=210)
-> Sort (cost=40136.82..40269.89 rows=53228 width=210)
Sort Key: devices.id
-> Bitmap Heap Scan on devices (cost=1489.12..30498.45 rows=53228 width=210)
Recheck Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2,10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2 (...)
-> Bitmap Index Scan on devices_iprange_idx (cost=0.00..1475.81 rows=53228 width=0)
Index Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2,10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2 (...)
-> Index Scan using tags_id_pkey on tags (cost=0.42..0.47 rows=1 width=170)
Index Cond: (id = tag_device.tag_id)
-> CTE Scan on inner_query (cost=0.00..1123.86 rows=56193 width=239)
If I increase the number of IP addresses in the ARRAY being looked up then the query plan changes and becomes drastically slower. So in the fast version of the query there are 30 items in the array. If I increase this to 80 items in the array then the query plan changes and becomes significantly slower (over 10x) The query remains the same in all other ways. The new query plan makes use of hash joins instead of merge joins and nested loops. Here is the new (much slower) query plan for when the array has 80 items in it instead of 30.
Limit (cost=204482.39..204484.39 rows=100 width=239)
CTE inner_query
-> Hash Join (cost=85839.13..204482.39 rows=146180 width=431)
Hash Cond: (tag_device.tag_id = tags.id)
-> Hash Join (cost=51368.40..145023.34 rows=146180 width=261)
Hash Cond: (tag_device.device_id = devices.id)
-> Seq Scan on tag_device (cost=0.00..36765.08 rows=1900408 width=51)
-> Hash (cost=45580.57..45580.57 rows=138466 width=210)
-> Bitmap Heap Scan on devices (cost=3868.31..45580.57 rows=138466 width=210)
Recheck Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.35,13.0.0.1,15.0.0.8,1.160.0.1,17.1.1.24,18.2.2.1,10.0.0.6,10.0.0.21,10.0.0.52,11.1.1.2,12.2.2.34,13.0.0.2,15.0.0.7,1.160.0.2,17.1.1.23,18.2.2.2 (...)
-> Bitmap Index Scan on devices_iprange_idx (cost=0.00..3833.70 rows=138466 width=0)
Index Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.35,13.0.0.1,15.0.0.8,1.160.0.1,17.1.1.24,18.2.2.1,10.0.0.6,10.0.0.21,10.0.0.52,11.1.1.2,12.2.2.34,13.0.0.2,15.0.0.7,1.160.0.2,17.1.1.23,18.2 (...)
-> Hash (cost=16928.88..16928.88 rows=475188 width=170)
-> Seq Scan on tags (cost=0.00..16928.88 rows=475188 width=170)
-> CTE Scan on inner_query (cost=0.00..2923.60 rows=146180 width=239)
The above query with it's default query plan runs in about 500ms (over 10 times slower). If I turn off hash joins using SET enable_hashjoin= OFF; then the query plan goes back to using merge joins and runs in ~50ms again with 80 items in the array.
Again the only change here is the number of items in the ARRAY that are being looked up.
Does anyone have any thoughts on why the planner is making the poor choice that results in the massive slow down?
The database fits into memory completely and is on SSDs.
I also want to point out that I'm using a CTE because I ran into an issue where the planner would not use the index on the tag_device table when I added in the limit to the query. Basically the issue described here: http://thebuild.com/blog/2014/11/18/when-limit-attacks/.
Thanks!
I see that there is a sort as part of the merge join. Once you get past a certain threshold the sort operation needed to do the merge join is deemed to be too expensive and a hash join is estimated to be cheaper. It may be more expensive (time wise) but cheaper in terms of CPU consumption to run the query this way.

How PostgreSQL execute query?

Can anyone explain why PostgreSQL works so:
If I execute this query
SELECT
*
FROM project_archive_doc as PAD, project_archive_doc as PAD2
WHERE
PAD.id = PAD2.id
it will be simple JOIN and EXPLAIN will looks like this:
Hash Join (cost=6.85..13.91 rows=171 width=150)
Hash Cond: (pad.id = pad2.id)
-> Seq Scan on project_archive_doc pad (cost=0.00..4.71 rows=171 width=75)
-> Hash (cost=4.71..4.71 rows=171 width=75)
-> Seq Scan on project_archive_doc pad2 (cost=0.00..4.71 rows=171 width=75)
But if I will execute this query:
SELECT *
FROM project_archive_doc as PAD
WHERE
PAD.id = (
SELECT PAD2.id
FROM project_archive_doc as PAD2
WHERE
PAD2.project_id = PAD.project_id
ORDER BY PAD2.created_at
LIMIT 1)
there will be no joins and EXPLAIN looks like:
Seq Scan on project_archive_doc pad (cost=0.00..886.22 rows=1 width=75)"
Filter: (id = (SubPlan 1))
SubPlan 1
-> Limit (cost=5.15..5.15 rows=1 width=8)
-> Sort (cost=5.15..5.15 rows=1 width=8)
Sort Key: pad2.created_at
-> Seq Scan on project_archive_doc pad2 (cost=0.00..5.14 rows=1 width=8)
Filter: (project_id = pad.project_id)
Why it is so and is there any documentation or articles about this?
Without table definitions and data it's hard to be specific for this case. In general, PostgreSQL is like most SQL databases in that it doesn't treat SQL as a step-by-step program for how to execute a query. It's more like a description of what you want the results to be and a hint about how you want the database to produce those results.
PostgreSQL is free to actually execute the query however it can most efficiently do so, so long as it produces the results you want.
Often it has several choices about how to produce a particular result. It will choose between them based on cost estimates.
It can also "understand" that several different ways of writing a particular query are equivalent, and transform one into another where it's more efficient. For example, it can transform an IN (SELECT ...) into a join, because it can prove they're equivalent.
However, sometimes apparently small changes to a query fundamentally change its meaning, and limit what optimisations/transformations PostgreSQL can make. Adding a LIMIT or OFFSET inside a subquery prevents PostgreSQL from flattening it, i.e. combining it with the outer query by tranforming it into a join. It also prevents PostgreSQL from moving WHERE clause entries between the subquery and outer query, because that'd change the meaning of the query. Without a LIMIT or OFFSET clause, it can do both these things because they don't change the query's meaning.
There's some info on the planner here.