Let's say I have the following query:
SELECT sum(a), sum(b), sum(a) - sum(b)
FROM salelines
Hopefully it should only need to do sum(a) and sum(b) once since the third column could reuse these aggregations. Here is the EXPLAIN:
XN HashAggregate (cost=35.21..41.90 rows=535 width=22)
-> XN Seq Scan on salelines (cost=0.00..15.65 rows=1565 width=22)
However, the interesting thing is if I change the column to be in a form that it should not be able to optimise out:
SELECT sum(a), sum(b), sum(a - b)
FROM salelines
Firstly I get a different average cost, which would suggest the query planner is actually treating the statements differently somewhere:
XN HashAggregate (cost=31.30..36.65 rows=535 width=22)
-> XN Seq Scan on salelines (cost=0.00..15.65 rows=1565 width=22)
But what's most interesting is this query plan suggests that it should actually be faster to do it this way. I understand that the cost does not directly relate to the performance of the query.
My question is:
Is Redshift able to optimise out the expressions, or would it actually be faster to allow Redshift to do a third aggregate with its extreme column aggregate performance instead?
Per your testing, looks like it doesn't optimize. At the same time, you can try to optimize it yourself:
WITH
totals as (
SELECT sum(a) as sum_a, sum(b) as sum_b
FROM salelines
)
SELECT sum_a, sum_b, sum_a-sum_b as dif_ab
FROM totals
That would definitely let Redshift skip the step that you'd like to skip
Here are the results on a larger table:
SELECT sum(a), sum(b), sum(a) - sum(b)
FROM salelines
XN Aggregate (cost=14455901.45..14455901.45 rows=1 width=20)
-> XN Seq Scan on salelines (cost=0.00..7227950.72 rows=722795072 width=20)
(25.905 + 22.870 + 29.091 + 22.970 + 21.893) / 5 = 24.545 seconds
SELECT sum(a), sum(b), sum(a - b)
FROM salelines
XN Aggregate (cost=12648913.77..12648913.77 rows=1 width=20)
-> XN Seq Scan on salelines (cost=0.00..7227950.72 rows=722795072 width=20)
(22.829 + 22.162 + 23.063 + 19.526 + 22.688) / 5 = 22.054 seconds
The query planner does not give enough output to explain exactly what it's doing, but from these results it would be reasonable to say that:
sum(a), sum(b), sum(a) - sum(b) probably requires 4 aggregates, whereas sum(a), sum(b), sum(a - b) is only 3 aggregates. It would probably be safe to assume that it does not optimise away expressions like this.
Related
My task is to create such index/es, which will optimize the given SQL query for dvdrental database, without touching the query itself:
EXPLAIN ANALYZE
SELECT title, release_year FROM film f1
WHERE f1.rental_rate > (
SELECT AVG(f2.rental_rate) FROM film f2
WHERE f1.release_year = f2.release_year
);
Output:
Seq Scan on film f1 (cost=0.00..69079.00 rows=333 width=19) (actual time=5.272..164.779 rows=659 loops=1)
Filter: (rental_rate > (SubPlan 1))
Rows Removed by Filter: 341
SubPlan 1
-> Aggregate (cost=69.00..69.01 rows=1 width=32) (actual time=0.164..0.164 rows=1 loops=1000)
-> Seq Scan on film f2 (cost=0.00..66.50 rows=1000 width=6) (actual time=0.000..0.083 rows=1000 loops=1000)
Filter: ((f1.release_year)::integer = (release_year)::integer)
Planning Time: 2.987 ms
Execution Time: 164.865 ms
From that I can see, the only thing we can optimize is the sequential scan on film f1, because subplan 1 contains aggregate. I tried plenty of indexes on the column rental_rate, but none of them produce any improvement. If I specify set enable_seqscan = off;, performance only gets worse. Maybe I am missing something here, but how it can possibly be optimized using indexes then?
P.S. The structure of film table:
It contains 1000 rows, the column rental_rate contains float values with only 3 distinct values: 0.99, 2.99, 4.99.
You have 1000 nested queries for avg - for each record in the table (you can see at as loops=1000 in the plan).
Extracting these queries to CTE and further joining will increase speed you query:
WITH avg_rates as (SELECT AVG(f.rental_rate) avg, f.release_year
FROM film f GROUP BY f.release_year)
SELECT title, f.release_year FROM film f
JOIN avg_rates a on a.release_year = f.release_year
WHERE f.rental_rate > a.avg
I am working on this simple example:
=> create table t1 ( a int, b int, c int );
CREATE TABLE
=> insert into t1 select a, a, a from generate_series(1,100) a;
INSERT 0 100
=> create index i1 on t1(b);
CREATE INDEX
=> vacuum t1;
VACUUM
=> explain analyze select b from t1 where b = 10;
QUERY PLAN
--------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..2.25 rows=1 width=4) (actual time=0.016..0.035 rows=1 loops=1)
Filter: (b = 10)
Rows Removed by Filter: 99
Planning Time: 0.082 ms
Execution Time: 0.051 ms
(5 rows)
You can see that I select b and query on b only. And also vacuum t1; manually to make sure the Visibility information is stored in the index.
But why does Postgresql still do Seq Scan instead of index-only-scan?
Edited
After adding more rows, it will do index-only-scan:
=> insert into t1 select a, a, a from generate_series(1,2000) a;
=> vacuum t1;
=> explain analyze select b from t1 where b = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Only Scan using i1 on t1 (cost=0.28..4.45 rows=10 width=4) (actual time=0.038..0.039 rows=1 loops=1)
Index Cond: (b = 10)
Heap Fetches: 0
Planning Time: 0.186 ms
Execution Time: 0.058 ms
(5 rows)
It seems like PostgreSQL doesn't like index-only-scan when the rows number is small.
Since nobody want to provide a detail explanation, I will write a simple answer here.
From #a_horse_with_no_name:
100 rows will fit on a single data block, so doing a seq scan will only require a single I/O operation and the index only scan would require the same. Use explain (analyze, buffers) to see more details on the blocks (=buffers) needed by the query
From https://www.postgresql.org/docs/current/indexes-examine.html:
It is especially fatal to use very small test data sets. While selecting 1000 out of 100000 rows could be a candidate for an index, selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can beat sequentially fetching 1 disk page.
I have 2 tables, "transaksi" and "buku". "transaksi" has around ~250k rows, and buku has around ~170k rows. Both tables have column called "k999a", and both tables use no indexes. Now I check these 2 statements.
Statement 1:
explain select k999a from transaksi where k999a not in (select k999a from buku);
Statement 1 outputs:
Seq Scan on transaksi (cost=0.00..721109017.46 rows=125426 width=9)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..5321.60 rows=171040 width=8)
-> Seq Scan on buku (cost=0.00..3797.40 rows=171040 width=8)
Statement 2:
explain select k999a from transaksi where k999a in (select k999a from buku);
Statement 2 outputs:
Hash Semi Join (cost=6604.40..22664.82 rows=250853 width=9)
Hash Cond: (transaksi.k999a = buku.k999a)
-> Seq Scan on transaksi (cost=0.00..6356.53 rows=250853 width=9)
-> Hash (cost=3797.40..3797.40 rows=171040 width=8)
-> Seq Scan on buku (cost=0.00..3797.40 rows=171040 width=8)
Why in the NOT IN query, postgresql does loop join, making the query takes a long time?
PS: postgresql version 9.6.1 on windows 10
This is to be expected. You may get better performance using WHERE NOT EXISTS instead:
SELECT k999a
FROM transaksi
WHERE NOT EXISTS (
SELECT 1 FROM buku WHERE buku.k999a = transaksi.k999a LIMIT 1
);
Here is a good explanation as to why for each of the methods: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/
I have a table in Redshift with a few billion rows which looks like this
CREATE TABLE channels AS (
fact_key TEXT NOT NULL distkey
job_key BIGINT
channel_key TEXT NOT NULL
)
diststyle key
compound sortkey(job_key, channel_key);
When I query by job_key + channel_key my seq scan is properly restricted by the full sortkey if I use specific values for channel_key in my query.
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN ('1234', '1235', '1236', '1237')
XN Seq Scan on channels scd (cost=0.00..3178474.92 rows=3428929 width=77)
Filter: ((((channel_key)::text = '1234'::text) OR ((channel_key)::text = '1235'::text) OR ((channel_key)::text = '1236'::text) OR ((channel_key)::text = '1237'::text)) AND (job_key = 1))
However if I query against channel_key by using IN + a subquery Redshift does not use the sortkey.
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN (select distinct channel_key from other_channel_list where job_key = 14 order by 1)
XN Hash IN Join DS_DIST_ALL_NONE (cost=3.75..3540640.36 rows=899781 width=77)
Hash Cond: (("outer".channel_key)::text = ("inner".channel_key)::text)
-> XN Seq Scan on channels scd (cost=0.00..1765819.40 rows=141265552 width=77)
Filter: (job_key = 1)
-> XN Hash (cost=3.75..3.75 rows=1 width=402)
-> XN Subquery Scan "IN_subquery" (cost=0.00..3.75 rows=1 width=402)
-> XN Unique (cost=0.00..3.74 rows=1 width=29)
-> XN Seq Scan on other_channel_list (cost=0.00..3.74 rows=1 width=29)
Filter: (job_key = 14)
Is it possible to get this to work? My ultimate goal is to turn this into a view so pre-defining my list of channel_keys won't work.
Edit to provide more context:
This is part of a larger query and the results of this get hash joined to some other data. If I hard-code the channel_keys then the input to the hash join is ~2 million rows. If I use the IN condition with the subquery (nothing else changes) then the input to the hash join is 400 million rows. The total query time goes from ~40 seconds to 15+ minutes.
Does this give you a better plan than the subquery version?
with other_channels as (
select distinct channel_key from other_channel_list where job_key = 14 order by 1
)
SELECT *
FROM channels scd
JOIN other_channels ocd on scd.channel_key = ocd.channel_key
WHERE scd.job_key = 1
If I want to select 0.5% rows, or even 5% rows from the following table via a PK, the query planner correctly chooses to use the PK index. Here is the table:
create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id,
50 + 50*sin(id) as temperature_in_f,
50 + 50*sin(id) as humidity_in_percent
from numbers;
alter table weather
add constraint pk_weather primary key(id);
vacuum analyze weather;
The stats are up-to-date, and the following query does use the PK index:
explain analyze select sum(w.id), sum(humidity_in_percent), count(*)
from weather as w
where w.id between 1 and 66720;
Suppose, however, that we need to join this table with another, much smaller, one:
create table lightnings
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;
alter table lightnings
add constraint pk_lightnings
primary key(weather_id);
analyze lightnings;
Here is my join, in four logically equivalent forms:
explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
explain analyze select sum(w.id), count(*)
from weather as w
join lightnings as l
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
explain analyze select sum(w.id), count(*)
from lightnings as l
join weather as w
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*)
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;
Unfortunately the query planner resorts to scanning the whole weather table:
"Aggregate (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
" -> Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
" Hash Cond: (w.id = l.weather_id)"
" -> Seq Scan on weather w (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
" Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
" Rows Removed by Filter: 1041904"
" -> Hash (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 235kB"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"
The query planner's estimate on how many rows in weather table will be selected is rows=5106. This is more or less close to the exact value of 6672. If I select this small number of rows in weather table via id, the PK index is used. If I select the same amount via a join with another table, the query planner goes for scanning the table.
What am I missing?
select version()
"PostgreSQL 9.4.0"
Edit: if I remove the condition on humidity, the query planner correctly recognizes that the condition on weather.id is quite selective, and chooses to use the index on PK:
explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
where l.weather_id=w.id);
"Aggregate (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
" -> Nested Loop (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
" -> Index Only Scan using pk_weather on weather w (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
" Index Cond: (id = l.weather_id)"
" Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"
Yet adding a condition totally confuses the query planner.
Expecting the optimiser to use an index on the PK of the larger table implies that you expect the query to be driven from the smaller table. Of course, you know that the rows that the smaller table will join to in the larger one are the same as those selected by the predicate on it, but the optimiser does not.
Look at the line on the plan:
Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
It expects 32 rows to result from the join, but 6672 actually result.
Anyway, it pretty much has the option of:
A full scan on the smaller table, and an index lookup on the larger, with the predicate being used to filter out rows subsequent to the join (and expecting most of the rows to then be filtered out).
A full scan on both tables, with rows being removed by the predicate on the larger table, and a hash join of the result.
A scan of the larger table with rows being removed by the predicate, and an index lookup on the smaller table that may fail to find a value.
The second of these has been judged to be the lowest cost, and I think it is correct to do so based on the evidence it has, as hash joins are very efficient for joining many rows.
Of course it would probably be more efficient to place an index on weather(humidity_in_percent,id) in this particular case, but I suspect that this is a modified version of your real situation (the sum of the id column?) so specific advice may not be applicable.
I believe the differences your seeing between the first query, which uses the index and the other 3 which don't, is in the where clause.
In the first query, your where clause is on w.id, which is indexed.
In the other 3, the effective where clause is on w.humidity_in_percent. I tested the following ...
create index wh_idx on weather(humidity_in_percent);
explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
and get a much better plan. I tried to post the actual plan returned, but I'm having trouble formatting it for proper display, sorry.