Make use of GIN index in view composed from multiple tables - postgresql

I have a view that is composed from elements of various tables:
CREATE OR REPLACE VIEW public.search_view
AS SELECT search_data.key,
search_data.title,
search_data.content
FROM ( SELECT 'foo.'::text || foo.id AS key,
foo.name AS title,
foo.information AS content
FROM foo
UNION ALL
SELECT 'bar.'::text || bar.id AS key,
bar.code AS title,
bar.info AS content
FROM bar
UNION ALL
SELECT 'baz.'::text || baz.id AS key,
baz.title AS title,
baz.text AS content
FROM baz) search_data;
All source tables have a GIN index on the column that is mapped to "content" in the view, e.g.
CREATE INDEX bar__info__gin ON bar USING GIN (info gin_trgm_ops);
The combined table has about 200k entries, each source table between 10k and 80k.
When I now execute a query like this
explain (analyze on, timing on)
select "content", 'lorem ipsum dolor sit amet' <-> content as distance
from search_view sv
where "content" ilike '%lorem%ipsum%dolor%sit%amet%'
order by distance desc
limit 10;
then I can see that the query planner does not make use of the indexed content:
QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=13515.11..13515.13 rows=10 width=36) (actual time=344.365..344.374 rows=10 loops=1) |
-> Sort (cost=13515.11..13515.15 rows=19 width=36) (actual time=344.364..344.369 rows=10 loops=1) |
Sort Key: (('lorem ipsum dolor sit amet'::text <-> search_data.content)) DESC |
Sort Method: quicksort Memory: 25kB |
-> Subquery Scan on search_data (cost=0.00..13514.70 rows=19 width=36) (actual time=225.391..344.355 rows=11 loops=1) |
Filter: (search_data.content ~~* '%lorem%ipsum%dolor%sit%amet%'::text) |
Rows Removed by Filter: 190048 |
-> Append (cost=0.00..11138.92 rows=190059 width=164) (actual time=0.015..146.100 rows=190055 loops=1) |
-> Seq Scan on foo (cost=0.00..1701.05 rows=40003 width=147) (actual time=0.015..18.544 rows=40003 loops=1) |
-> Seq Scan on bar (cost=0.00..4645.03 rows=100002 width=115) (actual time=0.009..39.029 rows=100002 loops=1) |
-> Subquery Scan on "*SELECT* 3" (cost=0.00..2441.38 rows=50050 width=164) (actual time=0.011..34.077 rows=50050 loops=1) |
-> Seq Scan on baz (cost=0.00..1940.88 rows=50050 width=254) (actual time=0.010..18.936 rows=50050 loops=1)|
Planning Time: 0.437 ms |
Execution Time: 344.421 ms e: 344.421 ms |
When I run the query independently against the source tables, the index is leveraged:
explain (analyze on, timing on)
select "info", 'lorem ipsum dolor sit amet' <-> info as distance
from bar b
where "info" ilike '%lorem%ipsum%dolor%sit%amet%'
order by distance desc
limit 10;
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=95.14..95.16 rows=5 width=81) (actual time=3.064..3.072 rows=10 loops=1) |
-> Sort (cost=95.14..95.16 rows=5 width=81) (actual time=3.062..3.066 rows=10 loops=1) |
Sort Key: (('lorem ipsum dolor sit amet::text <-> (info)::text)) DESC |
Sort Method: quicksort Memory: 25kB |
-> Bitmap Heap Scan on bar b (cost=76.04..95.09 rows=5 width=81) (actual time=2.976..3.052 rows=11 loops=1) |
Recheck Cond: ((info)::text ~~* '%lorem%ipsum%dolor%sit%amet%'::text) |
Rows Removed by Index Recheck: 9 |
Heap Blocks: exact=1 |
-> Bitmap Index Scan on bar__info__gin (cost=0.00..76.04 rows=5 width=0) (actual time=2.945..2.946 rows=20 loops=1)|
Index Cond: ((info)::text ~~* '%lorem%ipsum%dolor%sit%amet%'::text) |
Planning Time: 1.063 ms |
Execution Time: 3.215 ms
What am I doing wrong in the view or how can I get a query against a view make use of the source indexes?
I am using Postgres 13 btw.

Related

Unusually slow query across a join table

I have a query that is strangely slow in Postgres 13 for a database containing only small amounts of data. I have even seen the problem in my test suite where I fabricate some fake data.
SELECT sales.* FROM sales
INNER JOIN members ON members.id = sales.member_id
INNER JOIN members_teams ON members_teams.member_id = members.id
INNER JOIN teams ON teams.id = members_teams.team_id
WHERE teams.id IN (1, 2)
In my test suite I have the following counts of data in the different tables:
| Table | Count |
| -------- | -------------- |
| members | 501 |
| teams | 3 |
| members_teams | 501 |
| sales | 502 |
Here is an example of when it is slow:
Nested Loop (cost=0.75..25.83 rows=1 width=631) (actual time=38226.620..38226.622 rows=0 loops=1)
Join Filter: (members_teams.team_id = teams.id)
-> Nested Loop (cost=0.75..24.82 rows=1 width=635) (actual time=0.082..38220.385 rows=502 loops=1)
Join Filter: (members.id = members_teams.member_id)
Rows Removed by Join Filter: 251000
-> Index Scan using index_members_teams_on_team_id on members_teams (cost=0.25..8.26 rows=1 width=8) (actual time=0.031..0.544 rows=501 loops=1)
-> Nested Loop (cost=0.50..16.54 rows=1 width=635) (actual time=0.014..76.217 rows=502 loops=501)
Join Filter: (sales.member_id = members.id)
Rows Removed by Join Filter: 125250
-> Index Scan using index_sales_on_member_id on sales (cost=0.25..8.26 rows=1 width=631) (actual time=0.005..0.262 rows=502 loops=501)
-> Index Only Scan using members_pkey on members (cost=0.25..8.26 rows=1 width=4) (actual time=0.008..0.124 rows=251 loops=251502)
Heap Fetches: 63001752
-> Seq Scan on teams (cost=0.00..1.00 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=502)
Filter: (id = ANY ('{1,2}'::integer[]))
Rows Removed by Filter: 3
Planning Time: 0.690 ms
Execution Time: 38226.701 ms
Here is an example of when it is a more normal speed:
Nested Loop (cost=0.75..24.82 rows=1 width=631) (actual time=224.746..224.747 rows=0 loops=1)
Join Filter: (members.id = members_teams.member_id)
-> Nested Loop (cost=0.50..16.54 rows=1 width=635) (actual time=0.047..80.953 rows=502 loops=1)
Join Filter: (sales.member_id = members.id)
Rows Removed by Join Filter: 125250
-> Index Scan using index_sales_on_member_id on sales (cost=0.25..8.26 rows=1 width=631) (actual time=0.015..0.367 rows=502 loops=1)
-> Index Only Scan using members_pkey on members (cost=0.25..8.26 rows=1 width=4) (actual time=0.009..0.131 rows=251 loops=502)
Heap Fetches: 125752
-> Index Only Scan using index_members_teams_on_member_id_and_team_id on members_teams (cost=0.25..8.27 rows=1 width=4) (actual time=0.286..0.286 rows=0 loops=502)
Filter: (team_id = ANY (‘{1,2}’::integer[]))
Rows Removed by Filter: 501
Heap Fetches: 251502
Planning Time: 0.481 ms
Execution Time: 224.798 ms
Summary
A key difference seems to be which index it uses for the join table members_teams. Do you have any suggestions for how I can make this consistently performant? I thought about removing the join to teams and filtering on the team_id on the join table, but I'm worried that in the future we may need to use this query with additional constraints from the teams table.
Your estimates seem completely off. Do you have autovacuum disabled, or is your statistics collector malfunctioning? You should get better plans by explicitly collecting statistics:
ANALYZE sales;
ANALYZE members;
ANALYZE members_teams;

PostgreSQL 9.4 Index speed very slow on Select

The box running PostgreSQL 9.4 is a 32 core system at 3.5ghz per core with 128GB or ram with mirrored Samsung pro 850 SSD drives on FreeBSD with ZFS. This is no reason for this poor of performance from PostgreSQL!
psql (9.4.5)
forex=# \d pair_data;
Table "public.pair_data"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------------
id | integer | not null default nextval('pair_data_id_seq'::regclass)
pair_name | character varying(6) |
pair_price | numeric(9,6) |
ts | timestamp without time zone |
Indexes:
"pair_data_pkey" PRIMARY KEY, btree (id)
"date_idx" gin (ts)
"pair_data_gin_idx1" gin (pair_name)
"pair_data_gin_idx2" gin (ts)
With this select:
forex=# explain (analyze, buffers) select pair_name as name, pair_price as price, ts as date from pair_data where pair_name = 'EURUSD' order by ts desc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=201442.55..201442.57 rows=10 width=22) (actual time=10870.395..10870.430 rows=10 loops=1)
Buffers: shared hit=40 read=139150 dirtied=119
-> Sort (cost=201442.55..203787.44 rows=937957 width=22) (actual time=10870.390..10870.401 rows=10 loops=1)
Sort Key: ts
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Heap Scan on pair_data (cost=9069.17..181173.63 rows=937957 width=22) (actual time=614.976..8903.913 rows=951858 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13624055
Heap Blocks: exact=33464 lossy=105456
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8834.68 rows=937957 width=0) (actual time=593.701..593.701 rows=951858 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=40 read=230
Planning time: 0.387 ms
Execution time: 10871.419 ms
(16 rows)
Or this select:
forex=# explain (analyze, buffers) with intervals as ( select start, start + interval '4hr' as end from generate_series('2015-12-01 15:00', '2016-01-19 16:00', interval '4hr') as start ) select distinct intervals.start as date, min(pair_price) over w as low, max(pair_price) over w as high, first_value(pair_price) over w as open, last_value(pair_price) over w as close from intervals join pair_data mb on mb.pair_name = 'EURUSD' and mb.ts >= intervals.start and mb.ts < intervals.end window w as (partition by intervals.start order by mb.ts asc rows between unbounded preceding and unbounded following) order by intervals.start;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=64634864.43..66198331.09 rows=815054 width=23) (actual time=1379732.924..1384602.952 rows=82 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
CTE intervals
-> Function Scan on generate_series start (cost=0.00..12.50 rows=1000 width=8) (actual time=0.135..1.801 rows=295 loops=1)
-> Sort (cost=64634851.92..64895429.70 rows=104231111 width=23) (actual time=1379732.918..1381724.179 rows=951970 loops=1)
Sort Key: intervals.start, (min(mb.pair_price) OVER (?)), (max(mb.pair_price) OVER (?)), (first_value(mb.pair_price) OVER (?)), (last_value(mb.pair_price) OVER (?))
Sort Method: external sort Disk: 60808kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
-> WindowAgg (cost=41474743.35..44341098.90 rows=104231111 width=23) (actual time=1341744.405..1365946.672 rows=951970 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3324995 written=9449
-> Sort (cost=41474743.35..41735321.12 rows=104231111 width=23) (actual time=1341743.502..1343723.884 rows=951970 loops=1)
Sort Key: intervals.start, mb.ts
Sort Method: external sort Disk: 32496kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1154778 written=7975
-> Nested Loop (cost=9098.12..21180990.32 rows=104231111 width=23) (actual time=271672.696..1337526.628 rows=951970 loops=1)
Join Filter: ((mb.ts >= intervals.start) AND (mb.ts < intervals."end"))
Rows Removed by Join Filter: 279879180
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> CTE Scan on intervals (cost=0.00..20.00 rows=1000 width=16) (actual time=0.142..4.075 rows=295 loops=1)
-> Materialize (cost=9098.12..190496.52 rows=938080 width=15) (actual time=2.125..1962.153 rows=951970 loops=295)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> Bitmap Heap Scan on pair_data mb (cost=9098.12..181225.12 rows=938080 width=15) (actual time=622.818..11474.987 rows=951970 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13623989
Heap Blocks: exact=33485 lossy=105456
Buffers: shared hit=8 read=139210 dirtied=22
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8863.60 rows=938080 width=0) (actual time=601.158..601.158 rows=951970 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=8 read=269
Planning time: 0.454 ms
Execution time: 1384653.385 ms
(31 rows)
Table of pair_data only has:
forex=# select count(*) from pair_data;
count
----------
21833886
(1 row)
Why is this doing heap scans when their are indexes? I do not understand what is going on with the query plan? Dose anyone have an idea on where the problem might be?

PostgreSQL 9.4 query tuning

I have a query that is running too slowly.
select c.vm_name,
round(sum(bytes_sent)*1.8/power(10,9)) gb_sent,
round(sum(bytes_received)*1.8/power(10,9)) gb_received
from groups b,
vms c,
vm_ip_address_histories d,
ip_address_usage_histories e
where b.group_id = c.group_id
and c.vm_id = d.vm_id
and d.ip_address = e.ip_address
and e.datetime >= firstday()
and d.allocation_date <= last_day(sysdate()) and (d.deallocation_date is null or d.deallocation_date > last_day(sysdate()))
and b.customer_id = 29
group by c.vm_name
order by 1;
The function sysdate() returns the current system timestamp without a time zone, last_day() returns the timestamp that represents the last day of the month. I created these because Hibernate doesn't like the Postgres casting notation.
The issue is that the planner is doing full table scans where there are indexes in place. Here is the explain plan for the above query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1326387.13..1326391.38 rows=1698 width=24) (actual time=13221.041..13221.042 rows=7 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1326236.61..1326296.04 rows=1698 width=24) (actual time=13221.008..13221.026 rows=7 loops=1)
Group Key: c.vm_name
-> Hash Join (cost=1309056.97..1325972.10 rows=35268 width=24) (actual time=13131.323..13211.612 rows=13631 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..6942.24 rows=79 width=15) (actual time=0.249..56.904 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.066..0.638 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.009..0.281 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.011 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..70.34 rows=8 width=15) (actual time=0.011..0.921 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= last_day(sysdate())) AND ((deallocation_date IS NULL) OR (deallocation_date > last_day(sysdate()))))
Rows Removed by Filter: 84
-> Hash (cost=1280129.06..1280129.06 rows=1575435 width=23) (actual time=13130.223..13130.223 rows=203702 loops=1)
Buckets: 8192 Batches: 32 Memory Usage: 422kB
-> Seq Scan on ip_address_usage_histories e (cost=0.00..1280129.06 rows=1575435 width=23) (actual time=0.205..13002.776 rows=203702 loops=1)
Filter: (datetime >= firstday())
Rows Removed by Filter: 4522813
Planning time: 0.804 ms
Execution time: 13221.155 ms
(27 rows)
Notice that the planner is choosing to perform a very expensive full table scans on the largest tables - ip_address_usage_histories and vm_ip_address_histories. I have tried changing the configuration parameter enable_seqscan to off, but that made the problem worse, total execution time went to 63 seconds.
Here are the describes of the aforementioned tables:
Table "ip_address_usage_histories"
Column | Type | Modifiers
-----------------------------+-----------------------------+-----------
ip_address_usage_history_id | bigint | not null
datetime | timestamp without time zone | not null
ip_address | inet | not null
bytes_sent | bigint | not null
bytes_received | bigint | not null
Indexes:
"ip_address_usage_histories_pkey" PRIMARY KEY, btree (ip_address_usage_history_id)
"ip_address_usage_histories_datetime_ip_address_key" UNIQUE CONSTRAINT, btree (datetime, ip_address)
"uk_mit6tbiu8k62vdae4tmtnwb3f" UNIQUE CONSTRAINT, btree (datetime, ip_address)
Table "vm_ip_address_histories"
Column | Type | Modifiers
--------------------------+-----------------------------+--------------------------------------------------------------------------------------------
vm_ip_address_history_id | bigint | not null default nextval('vm_ip_address_histories_vm_ip_address_history_id_seq'::regclass)
ip_address | inet | not null
allocation_date | timestamp without time zone | not null
deallocation_date | timestamp without time zone |
vm_id | bigint | not null
Indexes:
"vm_ip_address_histories_pkey" PRIMARY KEY, btree (vm_ip_address_history_id)
"xie1vm_ip_address_histories" btree (replicate_date)
"xif1vm_ip_address_histories" btree (vm_id)
Foreign-key constraints:
"vm_ip_address_histories_vm_id_fkey" FOREIGN KEY (vm_id) REFERENCES vms(vm_id) ON DELETE RESTRICT
It appears that Postgres does not have query hints to direct the planner. I also tried the from clause inner join ... on ... syntax, but that did not improve things either.
Update 1
create or replace function firstday() returns timestamp without time zone as $$
begin
return date_trunc('month',now()::timestamp without time zone)::timestamp without time zone;
end; $$
language plpgsql;
I have not tried to replace this function with a standard function because Postgres doesn't have a function that returns the first day of the month to my knowledge.
The following was embedded in the question, but it reads as an answer.
After changing the all of my functions to immutable, the query now runs in 200ms! All the right things are happening.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=51865.24..51914.88 rows=1103 width=24) (actual time=178.793..188.223 rows=7 loops=1)
Group Key: c.vm_name
-> Sort (cost=51865.24..51868.00 rows=1103 width=24) (actual time=178.517..180.541 rows=13823 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 1464kB
-> Hash Join (cost=50289.49..51809.50 rows=1103 width=24) (actual time=131.278..155.971 rows=13823 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..272.36 rows=23 width=15) (actual time=0.149..2.310 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.046..0.590 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.006..0.250 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.012 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..2.34 rows=2 width=15) (actual time=0.002..0.027 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= '2015-03-31 00:00:00'::timestamp without time zone) AND ((deallocation_date IS NULL) OR (deallocation_date > '2015-03-31 00:00:00'::timestamp without time zone)))
Rows Removed by Filter: 84
-> Hash (cost=46621.83..46621.83 rows=199575 width=23) (actual time=130.762..130.762 rows=206266 loops=1)
Buckets: 8192 Batches: 4 Memory Usage: 2818kB
-> Bitmap Heap Scan on ip_address_usage_histories e (cost=4627.14..46621.83 rows=199575 width=23) (actual time=18.335..69.763 rows=206266 loops=1)
Recheck Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Heap Blocks: exact=3684
-> Bitmap Index Scan on uk_mit6tbiu8k62vdae4tmtnwb3f (cost=0.00..4577.24 rows=199575 width=0) (actual time=17.797..17.797 rows=206935 loops=1)
Index Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Planning time: 0.837 ms
Execution time: 188.301 ms
(29 rows)
I now see the planner is executing the functions, and using their values to insert into the where clause, which causes the indexes to be used.

why does my view in postgresql not use the index?

I have a large table (star catalog), of which I have a subset. I implement the subset
as a union of two tables, where I make use of a cross index.
The issue is that a query from the view doesn't seem to be using the index, the time takes the same as a scan through the table.
A query against the large table goes quickly:
select count(*) from ucac4 where rnm in (select ucac4_rnm from grid_catalog limit 5);
count
-------
5
(1 row)
Time: 12.132 ms
A query against the view does not go quickly, even though I would expect it to.
select count(*) from grid_catalog_view where ident in (select ucac4_rnm from grid_catalog limit 5);
count
-------
5
(1 row)
Time: 1056237.045 ms
An explain of this query yeilds:
Aggregate (cost=23175810.51..23175810.52 rows=1 width=0)
-> Hash Join (cost=23081888.41..23172893.67 rows=1166734 width=0)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Unique (cost=23081888.17..23140224.87 rows=2333468 width=44)
-> Sort (cost=23081888.17..23087721.84 rows=2333468 width=44)
Sort Key: ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm, ucac4.nest4, ucac4.nest6, ucac4.nest7, public.grid_catalog.subset
-> Append (cost=63349.87..22763295.24 rows=2333468 width=44)
-> Hash Join (cost=63349.87..22738772.75 rows=2333467 width=44)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Seq Scan on ucac4 (cost=0.00..16394129.04 rows=455124304 width=40)
-> Hash (cost=34048.69..34048.69 rows=2344094 width=8)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2344094 width=8)
Filter: (petrov_prikey IS NULL)
-> Hash Join (cost=415.51..1187.80 rows=1 width=36)
Hash Cond: (petrov.prikey = public.grid_catalog.petrov_prikey)
-> Seq Scan on petrov (cost=0.00..709.15 rows=7215 width=32)
-> Hash (cost=282.08..282.08 rows=10675 width=8)
-> Index Scan using grid_catalog_petrov_prikey_idx on grid_catalog (cost=0.00..282.08 row
s=10675 width=8)
-> Hash (cost=0.18..0.18 rows=5 width=4)
-> HashAggregate (cost=0.13..0.18 rows=5 width=4)
-> Limit (cost=0.00..0.07 rows=5 width=4)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2354769 width=4)
(22 rows)
The explain analyze (request in a comment) is:
Aggregate (cost=23175810.51..23175810.52 rows=1 width=0) (actual time=1625067.627..1625067.628 rows=1 loops=1)
-> Hash Join (cost=23081888.41..23172893.67 rows=1166734 width=0) (actual time=1621395.200..1625067.618 rows=5 loops=1)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Unique (cost=23081888.17..23140224.87 rows=2333468 width=44) (actual time=1620897.932..1624102.849 rows=1597359 loops
=1)
-> Sort (cost=23081888.17..23087721.84 rows=2333468 width=44) (actual time=1620897.928..1622191.358 rows=1597359 l
oops=1)
Sort Key: ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm, ucac4.nest4, ucac4.nest6, ucac4.nest7, pu
blic.grid_catalog.subset
Sort Method: external merge Disk: 87536kB
-> Append (cost=63349.87..22763295.24 rows=2333468 width=44) (actual time=890293.619..1613769.160 rows=15973
59 loops=1)
-> Hash Join (cost=63349.87..22738772.75 rows=2333467 width=44) (actual time=890293.617..1611550.313 r
ows=1590144 loops=1)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Seq Scan on ucac4 (cost=0.00..16394129.04 rows=455124304 width=40) (actual time=886086.630..1
359934.589 rows=113780093 loops=1)
-> Hash (cost=34048.69..34048.69 rows=2344094 width=8) (actual time=4203.785..4203.785 rows=1590
144 loops=1)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2344094 width=8) (actual time=0.014.
.2813.031 rows=1590144 loops=1)
Filter: (petrov_prikey IS NULL)
-> Hash Join (cost=415.51..1187.80 rows=1 width=36) (actual time=101.604..165.749 rows=7215 loops=1)
Hash Cond: (petrov.prikey = public.grid_catalog.petrov_prikey)
-> Seq Scan on petrov (cost=0.00..709.15 rows=7215 width=32) (actual time=58.280..108.043 rows=7
215 loops=1)
-> Hash (cost=282.08..282.08 rows=10675 width=8) (actual time=43.276..43.276 rows=7215 loops=1)
-> Index Scan using grid_catalog_petrov_prikey_idx on grid_catalog (cost=0.00..282.08 rows
=10675 width=8) (actual time=19.387..37.533 rows=7215 loops=1)
-> Hash (cost=0.18..0.18 rows=5 width=4) (actual time=0.035..0.035 rows=5 loops=1)
-> HashAggregate (cost=0.13..0.18 rows=5 width=4) (actual time=0.026..0.030 rows=5 loops=1)
-> Limit (cost=0.00..0.07 rows=5 width=4) (actual time=0.009..0.017 rows=5 loops=1)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2354769 width=4) (actual time=0.007..0.009 rows=
5 loops=1)
Total runtime: 1625108.504 ms
(24 rows)
Time: 1625466.830 ms
To see the time to scan through the view:
select count(*) from grid_catalog_view;
count
---------
1597359
(1 row)
Time: 1033732.786 ms
My view is defined as:
PS1=# \d grid_catalog_view
View "public.grid_catalog_view"
Column | Type | Modifiers
--------+------------------+-----------
ra | double precision |
dec | double precision |
pmrac | integer |
pmdc | integer |
ident | integer |
nest4 | integer |
nest6 | integer |
nest7 | integer |
subset | integer |
View definition:
SELECT ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm AS ident, ucac4.nest4, ucac4.nest6, ucac4.nest7, grid_catalog.subset
FROM ucac4, grid_catalog
WHERE ucac4.rnm = grid_catalog.ucac4_rnm AND grid_catalog.petrov_prikey IS NULL
UNION
SELECT petrov.ra, petrov."dec", 0 AS pmrac, 0 AS pmdc, grid_catalog.petrov_prikey AS ident, petrov.nest4, petrov.nest6, petrov.nest7, grid_catalog.subset
FROM petrov, grid_catalog
WHERE petrov.prikey = grid_catalog.petrov_prikey AND grid_catalog.ucac4_rnm IS NULL;
The large table is defined as:
PS1=# \d ucac4
Table "public.ucac4"
Column | Type | Modifiers
----------+------------------+-----------
radi | bigint |
spdi | bigint |
magm | smallint |
maga | smallint |
sigmag | smallint |
objt | smallint |
cdf | smallint |
... deleted entries not of relavance ...
ra | double precision |
dec | double precision |
x | double precision |
y | double precision |
z | double precision |
nest4 | integer |
nest6 | integer |
nest7 | integer |
Indexes:
"ucac4_pkey" PRIMARY KEY, btree (rnm)
"q3c_ucac4_idx" btree (q3c_ang2ipix(ra, "dec")) CLUSTER
"ucac4_nest4_idx" btree (nest4)
"ucac4_nest6_idx" btree (nest6)
"ucac4_nest7_idx" btree (nest7)
Referenced by:
TABLE "grid_catalog" CONSTRAINT "grid_catalog_ucac4_rnm_fkey" FOREIGN KEY (ucac4_rnm) REFERENCES ucac4(rnm)
Any idea why my index doesn't seem to be used?
As far as I can see it's a limitation in postgres - it's hard to make it avoid scanning the whole table on a union in this way.
See:
https://www.postgresql-archive.org/Poor-plan-when-joining-against-a-union-containing-a-join-td5747690.html
and
https://www.postgresql-archive.org/Pushing-IN-subquery-down-through-UNION-ALL-td3398684.html
and also maybe related
https://dba.stackexchange.com/questions/47572/in-postgresql-9-3-union-view-with-where-clause-not-taken-into-account
Basically - I guess you need to revisit your view definition! Sorry for no definitive solution.

How to create postgres date index properly?

I'm using Django ORM and postgresql.
ORM creates a query:
SELECT
(date_part('month', stat_date)) AS "stat_date",
"direct_keywordstat"."banner_id",
SUM("direct_keywordstat"."total") AS "total",
SUM("direct_keywordstat"."clicks") AS "clicks",
SUM("direct_keywordstat"."shows") AS "shows"
FROM "direct_keywordstat"
LEFT OUTER JOIN "direct_banner" ON ("direct_keywordstat"."banner_id" = "direct_banner"."banner_ptr_id")
LEFT OUTER JOIN "platforms_banner" ON ("direct_banner"."banner_ptr_id" = "platforms_banner"."id")
WHERE (
"direct_keywordstat".stat_date BETWEEN E'2009-08-25' AND E'2010-08-25' AND
"direct_keywordstat"."keyword_id" IN (
SELECT U0."id"
FROM "direct_keyword" U0
INNER JOIN "direct_banner" U1 ON (U0."banner_id" = U1."banner_ptr_id")
INNER JOIN "platforms_banner" U2 ON (U1."banner_ptr_id" = U2."id")
INNER JOIN "platforms_campaign" U3 ON (U2."campaign_id" = U3."id")
INNER JOIN "direct_campaign" U4 ON (U3."id" = U4."campaign_ptr_id")
WHERE (
U0."deleted" = E'False' AND
U0."low_ctr" = E'False' AND
U4."status_active" = E'True' AND
U0."banner_id" IN (
SELECT U0."banner_ptr_id"
FROM "direct_banner" U0
INNER JOIN "platforms_banner" U1
ON (U0."banner_ptr_id" = U1."id")
WHERE (
U0."status_show" = E'True' AND
U1."campaign_id" = E'174' )
)
)
)
)
GROUP BY
"direct_keywordstat"."banner_id",
(date_part('month', stat_date)),
"platforms_banner"."title", date_trunc('month', stat_date)
ORDER BY "platforms_banner"."title" ASC, "stat_date" ASC
Problem is, direct_keywordstat contains 3mln+ records, so the query executes in ~15 seconds.
I've tried creating indexes like
CREATE INDEX direct_keywordstat_stat_date on direct_keywordstat using btree(stat_date);
But EXPLAIN ANALYZE show that index is not used.
Table schema:
\d direct_keywordstat
Table "public.direct_keywordstat"
Column | Type | Modifiers
-------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('direct_keywordstat_id_seq'::regclass)
keyword_id | integer | not null
banner_id | integer | not null
campaign_id | integer | not null
stat_date | date | not null
region_id | integer | not null
place_type | character varying(30) |
place_name | character varying(100) |
clicks | integer | not null default 0
shows | integer | not null default 0
total | numeric(19,6) | not null
How can i create useful index?
Or, maybe, there's a chance to optimize this query other way?
Thing is, if WHERE looks like
"direct_keywordstat".clicks BETWEEN 10 AND 3000000
query executes in 0.8 seconds.
Do you have indexes on these columns:
direct_banner.banner_ptr_id
direct_keywordstat.banner_id
direct_keywordstat.stat_date
Both columns in direct_keywordstat could be combined in a single index, just check
This is also a problem:
Sort Method: external merge Disk:
20600kB
Check your settings for work_mem, you need at least 20MB for this query.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=727967.61..847401.71 rows=2514402 width=67) (actual time=22010.522..23408.262 rows=5 loops=1)
-> Sort (cost=727967.61..734253.62 rows=2514402 width=67) (actual time=21742.365..23134.748 rows=198978 loops=1)
Sort Key: platforms_banner.title, (date_part('month'::text, (direct_keywordstat.stat_date)::timestamp without time zone)), direct_keywordstat.banner_id, (date_trunc('month'::text, (direct_keywordstat.stat_date)::timestamp with time zone))
Sort Method: external merge Disk: 20600kB
-> Hash Join (cost=1034.02..164165.25 rows=2514402 width=67) (actual time=5159.538..14942.441 rows=198978 loops=1)
Hash Cond: (direct_keywordstat.keyword_id = u0.id)
-> Hash Left Join (cost=365.78..117471.99 rows=2514402 width=71) (actual time=26.672..13101.294 rows=2523151 loops=1)
Hash Cond: (direct_keywordstat.banner_id = direct_banner.banner_ptr_id)
-> Seq Scan on direct_keywordstat (cost=0.00..76247.17 rows=2514402 width=25) (actual time=8.892..9386.010 rows=2523151 loops=1)
Filter: ((stat_date >= '2009-08-25'::date) AND (stat_date <= '2010-08-25'::date))
-> Hash (cost=324.86..324.86 rows=3274 width=50) (actual time=17.754..17.754 rows=2851 loops=1)
-> Hash Left Join (cost=209.15..324.86 rows=3274 width=50) (actual time=10.845..15.385 rows=2851 loops=1)
Hash Cond: (direct_banner.banner_ptr_id = platforms_banner.id)
-> Seq Scan on direct_banner (cost=0.00..66.74 rows=3274 width=4) (actual time=0.004..1.196 rows=2851 loops=1)
-> Hash (cost=173.51..173.51 rows=2851 width=50) (actual time=10.683..10.683 rows=2851 loops=1)
-> Seq Scan on platforms_banner (cost=0.00..173.51 rows=2851 width=50) (actual time=0.004..3.576 rows=2851 loops=1)
-> Hash (cost=641.44..641.44 rows=2144 width=4) (actual time=30.420..30.420 rows=106 loops=1)
-> HashAggregate (cost=620.00..641.44 rows=2144 width=4) (actual time=30.162..30.288 rows=106 loops=1)
-> Hash Join (cost=407.17..614.64 rows=2144 width=4) (actual time=16.152..30.031 rows=106 loops=1)
Hash Cond: (u0.banner_id = u1.banner_ptr_id)
-> Nested Loop (cost=76.80..238.50 rows=6488 width=16) (actual time=8.670..22.343 rows=106 loops=1)
-> HashAggregate (cost=76.80..76.87 rows=7 width=8) (actual time=0.045..0.047 rows=1 loops=1)
-> Nested Loop (cost=0.00..76.79 rows=7 width=8) (actual time=0.033..0.036 rows=1 loops=1)
-> Index Scan using platforms_banner_campaign_id on platforms_banner u1 (cost=0.00..22.82 rows=7 width=4) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (campaign_id = 174)
-> Index Scan using direct_banner_pkey on direct_banner u0 (cost=0.00..7.70 rows=1 width=4) (actual time=0.009..0.011 rows=1 loops=1)
Index Cond: (u0.banner_ptr_id = u1.id)
Filter: u0.status_show
-> Index Scan using direct_keyword_banner_id on direct_keyword u0 (cost=0.00..23.03 rows=5 width=8) (actual time=8.620..22.127 rows=106 loops=1)
Index Cond: (u0.banner_id = u0.banner_ptr_id)
Filter: ((NOT u0.deleted) AND (NOT u0.low_ctr))
-> Hash (cost=316.84..316.84 rows=1082 width=8) (actual time=7.458..7.458 rows=403 loops=1)
-> Hash Join (cost=227.00..316.84 rows=1082 width=8) (actual time=3.584..7.149 rows=403 loops=1)
Hash Cond: (u1.banner_ptr_id = u2.id)
-> Seq Scan on direct_banner u1 (cost=0.00..66.74 rows=3274 width=4) (actual time=0.002..1.570 rows=2851 loops=1)
-> Hash (cost=213.48..213.48 rows=1082 width=4) (actual time=3.521..3.521 rows=403 loops=1)
-> Hash Join (cost=23.88..213.48 rows=1082 width=4) (actual time=0.715..3.268 rows=403 loops=1)
Hash Cond: (u2.campaign_id = u3.id)
-> Seq Scan on platforms_banner u2 (cost=0.00..173.51 rows=2851 width=8) (actual time=0.001..1.272 rows=2851 loops=1)
-> Hash (cost=22.95..22.95 rows=74 width=8) (actual time=0.345..0.345 rows=37 loops=1)
-> Hash Join (cost=11.84..22.95 rows=74 width=8) (actual time=0.133..0.320 rows=37 loops=1)
Hash Cond: (u3.id = u4.campaign_ptr_id)
-> Seq Scan on platforms_campaign u3 (cost=0.00..8.91 rows=391 width=4) (actual time=0.006..0.098 rows=196 loops=1)
-> Hash (cost=10.91..10.91 rows=74 width=4) (actual time=0.117..0.117 rows=37 loops=1)
-> Seq Scan on direct_campaign u4 (cost=0.00..10.91 rows=74 width=4) (actual time=0.004..0.097 rows=37 loops=1)
Filter: status_active
Total runtime: 23436.715 ms
(47 rows)
Here it is