Postgresql avoid nested loop with join - postgresql

Please help me to improve query performance if possible.
I have following query
select
s."CustomerCode",
s."MaterialCode",
fw."Name",
fw."ReverseName",
s."Uc"
from
"Sales" s
left join
"FiscalWeeks" fw on s."SalesDate" between fw."StartedAt" and fw."EndedAt"
And execution plan is
"Nested Loop Left Join (cost=0.00..1439970.46 rows=8954562 width=40) (actual time=0.129..114889.581 rows=1492427 loops=1)"
" Join Filter: ((s."SalesDate" >= fw."StartedAt") AND (s."SalesDate" <= fw."EndedAt"))"
" Rows Removed by Join Filter: 79098631"
" Buffers: shared hit=3818 read=10884"
" -> Seq Scan on "Sales" s (cost=0.00..29625.27 rows=1492427 width=26) (actual time=0.098..1216.287 rows=1492427 loops=1)"
" Buffers: shared hit=3817 read=10884"
" -> Materialize (cost=0.00..1.81 rows=54 width=26) (actual time=0.001..0.034 rows=54 loops=1492427)"
" Buffers: shared hit=1"
" -> Seq Scan on "FiscalWeeks" fw (cost=0.00..1.54 rows=54 width=26) (actual time=0.006..0.044 rows=54 loops=1)"
" Buffers: shared hit=1"
"Planning time: 0.291 ms"
"Execution time: 115840.838 ms"
I have following indexes
CREATE INDEX "Sales_SalesDate_idx" ON public."Sales" USING btree ("SalesDate");
ADD CONSTRAINT "FiscalWeekUnique" EXCLUDE USING gist (daterange("StartedAt", "EndedAt", '[]'::text) WITH &&);
Postgresql version is
"PostgreSQL 9.5.0, compiled by Visual C++ build 1800, 32-bit"
vacuum analyze was performed
I think that postgresql does not understand that for each row in Sales table exists only one row in table FiscalWeeks and use nested loop. How can I explain it?
Thank you.

The query has to use a nested loop join because of the join condition. The operators <= and >= do not support hash or merge joins.
Perhaps you can improve the query by adding an index to "FiscalWeeks" so that a sequential scan can be avoided, and the join condition can be pushed down into the inner loop:
CREATE INDEX ON "FiscalWeeks" ("StartedAt", "EndedAt");
Unrelated to that, but you would make your life better if you avoided upper case letters in table and column names.

Related

Postgres is using UNIQUE index instead of FTS index

I have a table with > 10.000.000 rows. The table has a column OfficialEnterprise_vatNumber that should be unique and can be part of a full text search.
Here are the indexes:
"uq_officialenterprise_vatnumber" "CREATE UNIQUE INDEX uq_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING btree (""OfficialEnterprise_vatNumber"")"
"ix_officialenterprise_vatnumber" "CREATE INDEX ix_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING gin (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text))"
But if I EXPLAIN a query that should be using the FTS index like this
SELECT * FROM commonservices."OfficialEnterprise"
WHERE
to_tsvector('commonservices.unaccent_dictionary', "OfficialEnterprise_vatNumber") ## to_tsquery('FR:* | IE:*')
ORDER BY "OfficialEnterprise_vatNumber" ASC
LIMIT 100
It shows that the used index is uq_officialenterprise_vatnumber and not ix_officialenterprise_vatnumber.
Is their something I'm missing ?
EDIT:
Here is the EXPLAIN ANALYZE statement of the original query.
"Limit (cost=0.43..1460.27 rows=100 width=238) (actual time=6996.976..6997.057 rows=15 loops=1)"
" -> Index Scan using uq_officialenterprise_vatnumber on ""OfficialEnterprise"" (cost=0.43..1067861.32 rows=73149 width=238) (actual time=6996.975..6997.054 rows=15 loops=1)"
" Filter: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Rows Removed by Filter: 1847197"
"Planning Time: 0.185 ms"
"Execution Time: 6997.081 ms"
Here is the EXPLAIN ANALYZE of the query if I add || '0' to the order by.
"Limit (cost=55558.82..55570.49 rows=100 width=270) (actual time=7.069..9.827 rows=15 loops=1)"
" -> Gather Merge (cost=55558.82..62671.09 rows=60958 width=270) (actual time=7.068..9.823 rows=15 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=54558.80..54635.00 rows=30479 width=270) (actual time=0.235..0.238 rows=5 loops=3)"
" Sort Key: (((""OfficialEnterprise_vatNumber"")::text || '0'::text))"
" Sort Method: quicksort Memory: 28kB"
" Worker 0: Sort Method: quicksort Memory: 25kB"
" Worker 1: Sort Method: quicksort Memory: 25kB"
" -> Parallel Bitmap Heap Scan on ""OfficialEnterprise"" (cost=719.16..53393.91 rows=30479 width=270) (actual time=0.157..0.166 rows=5 loops=3)"
" Recheck Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Heap Blocks: exact=6"
" -> Bitmap Index Scan on ix_officialenterprise_vatnumber (cost=0.00..700.87 rows=73149 width=0) (actual time=0.356..0.358 rows=15 loops=1)"
" Index Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
"Planning Time: 0.108 ms"
"Execution Time: 9.886 ms"
rows=73149...(actual...rows=15)
So it vastly misestimates the number of rows it will find. If it actually found 73149 rows, using the ordering index probably would be faster.
Have you ANALYZED your table since creating the functional index on it? Doing that might fix the estimation problem. Or at least improve it enough to fix the planning decision.
Yes, doing dummy operations like +0 or ||'' is hackish. They work by forcing the planner to think it can't use the index to fulfill the ORDER BY.
Doing full text search on a VAT number seems rather misguided in the first place. Maybe this would be best addressed with a LIKE query, or by creating an explicit column to hold the country of origin flag.

PostgreSQL - ORDER BY with LIMIT not using indexes as expected

We have two tables - event_deltas and deltas_to_retrieve - which both have BTREE indexes on the same two columns:
CREATE TABLE event_deltas
(
event_id UUID REFERENCES events(id) NOT NULL,
version INT NOT NULL,
json_patch JSONB NOT NULL,
PRIMARY KEY (event_id, version)
);
CREATE TABLE deltas_to_retrieve(event_id UUID NOT NULL, version INT NOT NULL);
CREATE UNIQUE INDEX event_id_version ON deltas_to_retrieve (event_id, version);
In terms of table size, deltas_to_retrieve is a tiny lookup table of ~500 rows. The event_deltas table contains ~7,000,000 rows. Due to the size of the latter table, we want to limit how much we retrieve at once. Therefore, the tables are queried as follows:
SELECT ed.event_id, ed.version
FROM deltas_to_retrieve zz, event_deltas ed
WHERE zz.event_id = ed.event_id
AND ed.version > zz.version
ORDER BY ed.event_id, ed.version
LIMIT 5000;
Without the LIMIT, for the example I'm looking at the query returns ~30,000 rows.
What's odd about this query is the impact of the ORDER BY. Due to the existing indexes, the data comes back in the order we want with or without it. I would rather keep the explicit ORDER BY there so we're future-proofed against future changes, as well as for readability etc. However, as things stand it has a significant negative impact on performance.
According to the docs:
An important special case is ORDER BY in combination with LIMIT n: an explicit sort will have to process all the data to identify the first n rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly, without scanning the remainder at all.
This makes me think that, given the indexes we already have in place, the ORDER BY should not slow down the query at all. However, in practice I'm seeing execution times of ~10s with the ORDER BY and <1s without. I've included the plans outputted by EXPLAIN below:
Without ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.56..20033.38 rows=5000 width=20)
-> Nested Loop (cost=0.56..331980.39 rows=82859 width=20)
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..9.37 rows=537 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..616.66 rows=154 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.56..20039.35 rows=5000 width=20) (actual time=3.675..2083.063 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Nested Loop (cost=0.56..1055082.88 rows=263260 width=20) (actual time=3.673..2080.745 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..27.00 rows=1700 width=20) (actual time=0.022..0.307 rows=241 loops=1)
Buffers: local hit=2
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..619.07 rows=155 width=20) (actual time=1.317..8.612 rows=21 loops=241)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Heap Fetches: 5000
Buffers: shared hit=1450 read=4783
Planning Time: 1.150 ms
Execution Time: 2084.647 ms
With ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20)
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20)
-> Materialize (cost=0.28..6178.03 rows=1700 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20)
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20) (actual time=4457.770..506706.443 rows=5000 loops=1)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20) (actual time=4457.768..506704.815 rows=5000 loops=1)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20) (actual time=4.566..505443.407 rows=1813438 loops=1)
Heap Fetches: 1814767
Buffers: shared hit=78806 read=1071004 dirtied=148
-> Materialize (cost=0.28..6178.03 rows=1700 width=20) (actual time=0.063..2.524 rows=5000 loops=1)
Buffers: local hit=63
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20) (actual time=0.056..0.663 rows=78 loops=1)
Heap Fetches: 78
Buffers: local hit=63
Planning Time: 1.088 ms
Execution Time: 506709.819 ms
I'm not very experienced at reading these plans, but it's obviously thinking that it needs to retrieve everything, sort it and then return TOP N, rather than just grabbing the first N using the index. It's doing a Seq Scan on the smaller deltas_to_retrieve table rather than an Index Only Scan - is that the problem? That table is v. small (~500 rows), so I wonder if it's just not bothering to use the index because of that?
Postgres version: 11.12
Upgrading to Postgres 13 fixed this for us, with the introduction of incremental sort. From some docs on the feature:
Incremental sorting: Sorting is a performance-intensive task, so every improvement in this area can make a difference. Now PostgreSQL 13 introduces incremental sorting, which leverages early-stage sorts of a query and sorts only the incremental unsorted fields, increasing the chances the sorted block will fit in memory and by that, improving performance.
The new query plan from EXPLAIN is as follows, with the query now completing in <500ms consistently:
QUERY PLAN
Limit (cost=71.06..820.32 rows=5000 width=20)
-> Incremental Sort (cost=71.06..15461.82 rows=102706 width=20)
" Sort Key: ed.event_id, ed.version"
Presorted Key: ed.event_id
-> Nested Loop (cost=0.84..6659.05 rows=102706 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..1116.39 rows=541 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..8.35 rows=190 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Note:
[Start by running VACUUM ANALYZE on both tables]
since deltas_to_retrieve only needs to contain the lowest versions, it could be unique on event_id
you can simplify the query to:
SELECT event_id, version
FROM event_deltas ed
WHERE EXISTS (
SELECT * FROM deltas_to_retrieve zz
WHERE zz.event_id = ed.event_id
AND zz.version < ed.version
)
ORDER BY event_id, version
LIMIT 5000;

PostgreSQL Slow DISTINCT WHERE

Imagine the following table:
CREATE TABLE drops(
id BIGSERIAL PRIMARY KEY,
loc VARCHAR(5) NOT NULL,
tag INT NOT NULL
);
What I want to do is perform a query where I can find all unique locations where a value matches the tag.
SELECT DISTINCT loc
FROM drops
WHERE tag = '1'
GROUP BY loc;
I am not sure whether it is due to the size (its 9m rows big!) or me being inefficient, but the query takes way too long for users to efficiently use it. At the time I was writing this, the above query took me 1:14 minutes.
Is there any tricks or methods I can utilize in order to shorten this to a mere few seconds?
Much appreciated!
The execution plan:
"Unique (cost=1967352.72..1967407.22 rows=41 width=4) (actual time=40890.768..40894.984 rows=30 loops=1)"
" -> Group (cost=1967352.72..1967407.12 rows=41 width=4) (actual time=40890.767..40894.972 rows=30 loops=1)"
" Group Key: loc"
" -> Gather Merge (cost=1967352.72..1967406.92 rows=82 width=4) (actual time=40890.765..40895.031 rows=88 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Group (cost=1966352.70..1966397.43 rows=41 width=4) (actual time=40879.910..40883.362 rows=29 loops=3)"
" Group Key: loc"
" -> Sort (cost=1966352.70..1966375.06 rows=8946 width=4) (actual time=40879.907..40881.154 rows=19129 loops=3)"
" Sort Key: loc"
" Sort Method: quicksort Memory: 1660kB"
" -> Parallel Seq Scan on drops (cost=0.00..1965765.53 rows=8946 width=4) (actual time=1.341..40858.553 rows=19129 loops=3)"
" Filter: (tag = 1)"
" Rows Removed by Filter: 3113338"
"Planning time: 0.146 ms"
"Execution time: 40895.280 ms"
The table is indexed on loc and tag.
Your 40 seconds are spent sequentially reading the whole table, throwing away 3113338 rows to keep only 19129.
The remedy is simple:
CREATE INDEX ON drops(tag);
But you say you have already done that, but I find it hard to believe. What is the command you used?
Change the condition in the query from
WHERE tag = '1'
to
WHERE tag = 1
It happens to work because '1' is a literal, but don't try to compare strings and numbers.
And, as has been mentioned, keep either the DISTINCT or the GROUP BY, but not both.
If you have used a GROUP BY clause, then there is no need to use the DISTINCT keyword. Omitting that should speed up the running time for the query.

Postgres chooses wrong query plan

I have problems with a query that uses a wrong query plan. Because of the non-optimal query plan the query takes almost 20s.
The problem occurs only for a small number of owner_ids. The distribution of the owner_ids is not uniform. The owner_id in the example has 7948 routes. The total number of routes is 2903096.
The database is hosted on Amazon RDS on a server with 34.2 GiB memory, 4vCPU and provisioned IOPS (instance type db.m2.2xlarge). The Postgres version is 9.3.5.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
61
Query plan:
"Limit (cost=0.86..58637.88 rows=61 width=24) (actual time=49.731..18828.052 rows=61 loops=1)"
" -> Nested Loop (cost=0.86..7934263.10 rows=8254 width=24) (actual time=49.728..18827.887 rows=61 loops=1)"
" -> Index Scan using route_meta_i_name on route_meta (cost=0.43..289911.22 rows=2902910 width=24) (actual time=0.016..2825.932 rows=1411126 loops=1)"
" -> Index Scan using route_pkey on route (cost=0.43..2.62 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=1411126)"
" Index Cond: (id = route_meta.id)"
" Filter: (owner_id = 128905)"
" Rows Removed by Filter: 1"
"Total runtime: 18828.214 ms"
If I increase the limit to 100, a better query plan is used. It takes now less then 100ms.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
100
Query plan:
"Limit (cost=79964.98..79965.23 rows=100 width=24) (actual time=93.037..93.294 rows=100 loops=1)"
" -> Sort (cost=79964.98..79985.61 rows=8254 width=24) (actual time=93.033..93.120 rows=100 loops=1)"
" Sort Key: route_meta.name"
" Sort Method: top-N heapsort Memory: 31kB"
" -> Nested Loop (cost=0.86..79649.52 rows=8254 width=24) (actual time=0.039..77.955 rows=7948 loops=1)"
" -> Index Scan using route_i_owner_id on route (cost=0.43..22765.84 rows=8408 width=4) (actual time=0.023..13.839 rows=7948 loops=1)"
" Index Cond: (owner_id = 128905)"
" -> Index Scan using route_meta_pkey on route_meta (cost=0.43..6.76 rows=1 width=24) (actual time=0.003..0.004 rows=1 loops=7948)"
" Index Cond: (id = route.id)"
"Total runtime: 93.444 ms"
I already tried following things:
increasing statistics for owner_id (The owner_id in the example is included in the pg_stats)
ALTER TABLE route ALTER COLUMN owner_id SET STATISTICS 1000;
reindex owner_id and name
vacuum analyse
increased work_mem from 1MB to 16MB
when I rewrite the query to row_number() OVER (ORDER BY xxx) AS rn
... WHERE rn <= yyy in a subquery, the specific case is solved. However it
introduces performance problems with other ownerids.
A similar problem was solved with a combined index, but that seems impossible here because of the different tables.
Postgres uses wrong index in query plan

Is it possible to answer queries on a view before fully materializing the view?

In short: Distinct,Min,Max on the Left hand side of a Left Join, should be answerable without doing the join.
I’m using a SQL array type (on Postgres 9.3) to condense several rows of data in to a single row, and then a view to return the unnested normalized view. I do this to save on index costs, as well as to get Postgres to compress the data in the array.
Things work pretty well, but some queries that could be answered without unnesting and materializing/exploding the view are quite expensive because they are deferred till after the view is materialized. Is there any way to solve this?
Here is the basic table:
CREATE TABLE mt_count_by_day
(
run_id integer NOT NULL,
type character varying(64) NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL,
CONSTRAINT mt_count_by_day_pkey PRIMARY KEY (run_id, type),
)
An index on ‘type’ just for good measure:
CREATE INDEX runinfo_mt_count_by_day_type_idx on runinfo.mt_count_by_day (type);
Here is the view that uses generate_series and unnest
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT mt_count_by_day.run_id,
mt_count_by_day.type,
mt_count_by_day.brand,
generate_series(mt_count_by_day.start_day::timestamp without time zone, mt_count_by_day.end_day - '1 day'::interval, '1 day'::interval) AS row_date,
unnest(mt_count_by_day.counts) AS row_count
FROM runinfo.mt_count_by_day;
What if I want to do distinct on the ‘type' column?
explain analyze select distinct(type) from mt_count_by_day;
"HashAggregate (cost=9566.81..9577.28 rows=1047 width=19) (actual time=171.653..172.019 rows=1221 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..9318.25 rows=99425 width=19) (actual time=0.089..99.110 rows=99425 loops=1)"
"Total runtime: 172.338 ms"
Now what happens if I do the same on the view?
explain analyze select distinct(type) from v_mt_count_by_day;
"HashAggregate (cost=1749752.88..1749763.34 rows=1047 width=19) (actual time=58586.934..58587.191 rows=1221 loops=1)"
" -> Subquery Scan on v_mt_count_by_day (cost=0.00..1501190.38 rows=99425000 width=19) (actual time=0.114..37134.349 rows=68299959 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..506940.38 rows=99425000 width=597) (actual time=0.113..24907.147 rows=68299959 loops=1)"
"Total runtime: 58587.474 ms"
Is there a way to get postgres to recognize that it can solve this without first exploding the view?
Here we can see for comparison we are counting the number of rows matching criteria in the table vs the view. Everything works as expected. Postgres filters down the rows before materializing the view. Not quite the same, but this property is what makes our data more manageable.
explain analyze select count(*) from mt_count_by_day where type = ’SOCIAL_GOOGLE'
"Aggregate (cost=157.01..157.02 rows=1 width=0) (actual time=0.538..0.538 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..156.91 rows=40 width=0) (actual time=0.139..0.509 rows=122 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.098..0.098 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 0.625 ms"
explain analyze select count(*) from v_mt_count_by_day where type = 'SOCIAL_GOOGLE'
"Aggregate (cost=857.11..857.12 rows=1 width=0) (actual time=6.827..6.827 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..357.11 rows=40000 width=597) (actual time=0.124..5.294 rows=15916 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.082..0.082 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 6.885 ms"
Here is the code required to reproduce this:
CREATE TABLE base_table
(
run_id integer NOT NULL,
type integer NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL
CONSTRAINT match_check CHECK (end_day > start_day AND (end_day - start_day) = array_length(counts, 1)),
CONSTRAINT base_table_pkey PRIMARY KEY (run_id, type)
);
--Just because...
CREATE INDEX base_type_idx on base_table (type);
CREATE OR REPLACE VIEW v_foo AS
SELECT m.run_id,
m.type,
t.row_date::date,
t.row_count
FROM base_table m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts),
generate_series(m.start_day, m.end_day-1, interval '1d')
) t(row_count, row_date) ON true;
insert into base_table
select a.run_id, a.type, '20120101'::date as start_day, '20120401'::date as end_day, b.counts from (SELECT N AS run_id, L as type
FROM
generate_series(1, 10000) N
CROSS JOIN
generate_series(1, 7) L
ORDER BY N, L) a, (SELECT array_agg(generate_series)::bigint[] as counts FROM generate_series(1, 91) ) b
And the results on 9.4.1:
explain analyze select distinct type from base_table;
"HashAggregate (cost=6750.00..6750.03 rows=3 width=4) (actual time=51.939..51.940 rows=3 loops=1)"
" Group Key: type"
" -> Seq Scan on base_table (cost=0.00..6600.00 rows=60000 width=4) (actual time=0.030..33.655 rows=60000 loops=1)"
"Planning time: 0.086 ms"
"Execution time: 51.975 ms"
explain analyze select distinct type from v_foo;
"HashAggregate (cost=1356600.01..1356600.04 rows=3 width=4) (actual time=9215.630..9215.630 rows=3 loops=1)"
" Group Key: m.type"
" -> Nested Loop Left Join (cost=0.01..1206600.01 rows=60000000 width=4) (actual time=0.112..7834.094 rows=5460000 loops=1)"
" -> Seq Scan on base_table m (cost=0.00..6600.00 rows=60000 width=764) (actual time=0.009..42.694 rows=60000 loops=1)"
" -> Function Scan on t (cost=0.01..10.01 rows=1000 width=0) (actual time=0.091..0.111 rows=91 loops=60000)"
"Planning time: 0.132 ms"
"Execution time: 9215.686 ms"
Generally, the Postgres query planner does "inline" views to optimize the whole query. Per documentation:
One application of the rewrite system is in the realization of views.
Whenever a query against a view (i.e., a virtual table) is made, the
rewrite system rewrites the user's query to a query that accesses the
base tables given in the view definition instead.
But I don't think Postgres is smart enough to conclude that it can reach the same result from the base table without exploding rows.
You can try this alternative query with a LATERAL join. It's cleaner:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
, m.start_day + c.rn - 1 AS row_date
, c.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL unnest(m.counts) WITH ORDINALITY c(row_count, rn) ON true;
It also makes clear that one of (end_day, start_day) is redundant.
Using LEFT JOIN because that might allow the query planner to ignore the join from your query:
SELECT DISTINCT type FROM v_mt_count_by_day;
Else (with a CROSS JOIN or INNER JOIN) it must evaluate the join to see whether rows from the first table are eliminated.
BTW, it's:
SELECT DISTINCT type ...
not:
SELECT DISTINCT(type) ...
Note that this returns a date instead of the timestamp in your original. Easer, and I guess it's what you want anyway?
Requires Postgres 9.3+ Details:
PostgreSQL unnest() with element number
ROWS FROM in Postgres 9.4+
To explode both columns in parallel safely:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
t.row_date::date, t.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts)
, generate_series(m.start_day, m.end_day, interval '1d')
) t(row_count, row_date) ON true;
The main benefit: This would not derail into a Cartesian product if the two SRF don't return the same number of rows. Instead, NULL values would be padded.
Again, I can't say whether this would help the query planner with a faster plan for DISTINCT type without testing.