why does my view in postgresql not use the index? - postgresql

I have a large table (star catalog), of which I have a subset. I implement the subset
as a union of two tables, where I make use of a cross index.
The issue is that a query from the view doesn't seem to be using the index, the time takes the same as a scan through the table.
A query against the large table goes quickly:
select count(*) from ucac4 where rnm in (select ucac4_rnm from grid_catalog limit 5);
count
-------
5
(1 row)
Time: 12.132 ms
A query against the view does not go quickly, even though I would expect it to.
select count(*) from grid_catalog_view where ident in (select ucac4_rnm from grid_catalog limit 5);
count
-------
5
(1 row)
Time: 1056237.045 ms
An explain of this query yeilds:
Aggregate (cost=23175810.51..23175810.52 rows=1 width=0)
-> Hash Join (cost=23081888.41..23172893.67 rows=1166734 width=0)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Unique (cost=23081888.17..23140224.87 rows=2333468 width=44)
-> Sort (cost=23081888.17..23087721.84 rows=2333468 width=44)
Sort Key: ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm, ucac4.nest4, ucac4.nest6, ucac4.nest7, public.grid_catalog.subset
-> Append (cost=63349.87..22763295.24 rows=2333468 width=44)
-> Hash Join (cost=63349.87..22738772.75 rows=2333467 width=44)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Seq Scan on ucac4 (cost=0.00..16394129.04 rows=455124304 width=40)
-> Hash (cost=34048.69..34048.69 rows=2344094 width=8)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2344094 width=8)
Filter: (petrov_prikey IS NULL)
-> Hash Join (cost=415.51..1187.80 rows=1 width=36)
Hash Cond: (petrov.prikey = public.grid_catalog.petrov_prikey)
-> Seq Scan on petrov (cost=0.00..709.15 rows=7215 width=32)
-> Hash (cost=282.08..282.08 rows=10675 width=8)
-> Index Scan using grid_catalog_petrov_prikey_idx on grid_catalog (cost=0.00..282.08 row
s=10675 width=8)
-> Hash (cost=0.18..0.18 rows=5 width=4)
-> HashAggregate (cost=0.13..0.18 rows=5 width=4)
-> Limit (cost=0.00..0.07 rows=5 width=4)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2354769 width=4)
(22 rows)
The explain analyze (request in a comment) is:
Aggregate (cost=23175810.51..23175810.52 rows=1 width=0) (actual time=1625067.627..1625067.628 rows=1 loops=1)
-> Hash Join (cost=23081888.41..23172893.67 rows=1166734 width=0) (actual time=1621395.200..1625067.618 rows=5 loops=1)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Unique (cost=23081888.17..23140224.87 rows=2333468 width=44) (actual time=1620897.932..1624102.849 rows=1597359 loops
=1)
-> Sort (cost=23081888.17..23087721.84 rows=2333468 width=44) (actual time=1620897.928..1622191.358 rows=1597359 l
oops=1)
Sort Key: ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm, ucac4.nest4, ucac4.nest6, ucac4.nest7, pu
blic.grid_catalog.subset
Sort Method: external merge Disk: 87536kB
-> Append (cost=63349.87..22763295.24 rows=2333468 width=44) (actual time=890293.619..1613769.160 rows=15973
59 loops=1)
-> Hash Join (cost=63349.87..22738772.75 rows=2333467 width=44) (actual time=890293.617..1611550.313 r
ows=1590144 loops=1)
Hash Cond: (ucac4.rnm = public.grid_catalog.ucac4_rnm)
-> Seq Scan on ucac4 (cost=0.00..16394129.04 rows=455124304 width=40) (actual time=886086.630..1
359934.589 rows=113780093 loops=1)
-> Hash (cost=34048.69..34048.69 rows=2344094 width=8) (actual time=4203.785..4203.785 rows=1590
144 loops=1)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2344094 width=8) (actual time=0.014.
.2813.031 rows=1590144 loops=1)
Filter: (petrov_prikey IS NULL)
-> Hash Join (cost=415.51..1187.80 rows=1 width=36) (actual time=101.604..165.749 rows=7215 loops=1)
Hash Cond: (petrov.prikey = public.grid_catalog.petrov_prikey)
-> Seq Scan on petrov (cost=0.00..709.15 rows=7215 width=32) (actual time=58.280..108.043 rows=7
215 loops=1)
-> Hash (cost=282.08..282.08 rows=10675 width=8) (actual time=43.276..43.276 rows=7215 loops=1)
-> Index Scan using grid_catalog_petrov_prikey_idx on grid_catalog (cost=0.00..282.08 rows
=10675 width=8) (actual time=19.387..37.533 rows=7215 loops=1)
-> Hash (cost=0.18..0.18 rows=5 width=4) (actual time=0.035..0.035 rows=5 loops=1)
-> HashAggregate (cost=0.13..0.18 rows=5 width=4) (actual time=0.026..0.030 rows=5 loops=1)
-> Limit (cost=0.00..0.07 rows=5 width=4) (actual time=0.009..0.017 rows=5 loops=1)
-> Seq Scan on grid_catalog (cost=0.00..34048.69 rows=2354769 width=4) (actual time=0.007..0.009 rows=
5 loops=1)
Total runtime: 1625108.504 ms
(24 rows)
Time: 1625466.830 ms
To see the time to scan through the view:
select count(*) from grid_catalog_view;
count
---------
1597359
(1 row)
Time: 1033732.786 ms
My view is defined as:
PS1=# \d grid_catalog_view
View "public.grid_catalog_view"
Column | Type | Modifiers
--------+------------------+-----------
ra | double precision |
dec | double precision |
pmrac | integer |
pmdc | integer |
ident | integer |
nest4 | integer |
nest6 | integer |
nest7 | integer |
subset | integer |
View definition:
SELECT ucac4.ra, ucac4."dec", ucac4.pmrac, ucac4.pmdc, ucac4.rnm AS ident, ucac4.nest4, ucac4.nest6, ucac4.nest7, grid_catalog.subset
FROM ucac4, grid_catalog
WHERE ucac4.rnm = grid_catalog.ucac4_rnm AND grid_catalog.petrov_prikey IS NULL
UNION
SELECT petrov.ra, petrov."dec", 0 AS pmrac, 0 AS pmdc, grid_catalog.petrov_prikey AS ident, petrov.nest4, petrov.nest6, petrov.nest7, grid_catalog.subset
FROM petrov, grid_catalog
WHERE petrov.prikey = grid_catalog.petrov_prikey AND grid_catalog.ucac4_rnm IS NULL;
The large table is defined as:
PS1=# \d ucac4
Table "public.ucac4"
Column | Type | Modifiers
----------+------------------+-----------
radi | bigint |
spdi | bigint |
magm | smallint |
maga | smallint |
sigmag | smallint |
objt | smallint |
cdf | smallint |
... deleted entries not of relavance ...
ra | double precision |
dec | double precision |
x | double precision |
y | double precision |
z | double precision |
nest4 | integer |
nest6 | integer |
nest7 | integer |
Indexes:
"ucac4_pkey" PRIMARY KEY, btree (rnm)
"q3c_ucac4_idx" btree (q3c_ang2ipix(ra, "dec")) CLUSTER
"ucac4_nest4_idx" btree (nest4)
"ucac4_nest6_idx" btree (nest6)
"ucac4_nest7_idx" btree (nest7)
Referenced by:
TABLE "grid_catalog" CONSTRAINT "grid_catalog_ucac4_rnm_fkey" FOREIGN KEY (ucac4_rnm) REFERENCES ucac4(rnm)
Any idea why my index doesn't seem to be used?

As far as I can see it's a limitation in postgres - it's hard to make it avoid scanning the whole table on a union in this way.
See:
https://www.postgresql-archive.org/Poor-plan-when-joining-against-a-union-containing-a-join-td5747690.html
and
https://www.postgresql-archive.org/Pushing-IN-subquery-down-through-UNION-ALL-td3398684.html
and also maybe related
https://dba.stackexchange.com/questions/47572/in-postgresql-9-3-union-view-with-where-clause-not-taken-into-account
Basically - I guess you need to revisit your view definition! Sorry for no definitive solution.

Related

postgis/postgresql Using GIST to create a multicolumn-index on types geometry(point, 4326) and bigint succeeds, but the query cannot hit all

create extension btree_gist
CREATE INDEX poi_timestamp_midx ON spatio using gist(poi, timestamp);
explain analyze
select count(*) from spatio as a
where ST_DWithin(ST_GeomFromText('Point(30.391324 114.117508)',4326),a.poi,0.01)
and timestamp > 1645727066-2*86400
and timestamp < 1645727066+86400
;
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Aggregate (cost=250.00..250.02 rows=1 width=8) (actual time=13.561..13.563 rows=1 loops=1) |
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=8) (actual time=13.541..13.545 rows=32 loops=1) |
Task Count: 32 |
Tuple data received from nodes: 32 bytes |
Tasks Shown: One of 32 |
-> Task |
Tuple data received from node: 1 bytes |
Node: host=10.174.250.40 port=5432 dbname=postgres |
-> Aggregate (cost=33.30..33.31 rows=1 width=8) (actual time=0.087..0.087 rows=1 loops=1) |
-> Index Scan using poi_timestamp_midx_102039 on spatio_102039 a (cost=0.27..33.30 rows=1 width=0) (actual time=0.086..0.086 rows=0 loops=1) |
Index Cond: (poi && st_expand('0101000020E6100000D12346CF2D643E402D41464085875C40'::geometry, '0.01'::double precision)) |
Filter: (("timestamp" > 1645554266) AND ("timestamp" < 1645813466) AND st_dwithin('0101000020E6100000D12346CF2D643E402D41464085875C40'::geometry, poi, '0.01'::double precision))|
Planning Time: 0.133 ms |
Execution Time: 0.102 ms |
Planning Time: 1.730 ms |
Execution Time: 13.630 ms |`

Can this self-join be optimized further?

I'm trying to understand if it's possible to optimize the query containing a self-join, and if it is possible - how to do it.
I'm working on a bigger real-life task, but here I extracted a simple sub-task from it to keep focus on a particular issue: optimizing a self-join query.
I have a table called parties. It contains over 85k records and looks like this:
# \d test.parties
Table "test.parties"
Column | Type | Collation | Nullable | Default
-------------+------+-----------+----------+---------
id | uuid | | |
contract_id | uuid | | |
Doing a self-join on contract_id I get this plan:
# explain analyse select p1.id from test.parties p1 join test.parties p2 on p1.contract_id = p2.contract_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=20207.87..628157.87 rows=40500000 width=16) (actual time=109.709..184.523 rows=197632 loops=1)
Merge Cond: (p1.contract_id = p2.contract_id)
-> Sort (cost=11181.94..11406.94 rows=90000 width=32) (actual time=55.560..66.173 rows=86332 loops=1)
Sort Key: p1.contract_id
Sort Method: external merge Disk: 3560kB
-> Seq Scan on parties p1 (cost=0.00..1620.00 rows=90000 width=32) (actual time=0.018..14.518 rows=86332 loops=1)
-> Sort (cost=9025.94..9250.94 rows=90000 width=16) (actual time=54.135..74.973 rows=197631 loops=1)
Sort Key: p2.contract_id
Sort Method: external sort Disk: 2544kB
-> Seq Scan on parties p2 (cost=0.00..1620.00 rows=90000 width=16) (actual time=0.009..10.462 rows=86332 loops=1)
Planning Time: 0.167 ms
Execution Time: 199.677 ms
(12 rows)
Adding an index on contract_id I get this plan:
# create index on test.parties(contract_id);
CREATE INDEX
# explain analyse select p1.id from test.parties p1 join test.parties p2 on p1.contract_id = p2.contract_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3084.47..10570.76 rows=192484 width=16) (actual time=32.457..97.662 rows=197632 loops=1)
Hash Cond: (p1.contract_id = p2.contract_id)
-> Seq Scan on parties p1 (cost=0.00..1583.32 rows=86332 width=32) (actual time=0.013..11.293 rows=86332 loops=1)
-> Hash (cost=1583.32..1583.32 rows=86332 width=16) (actual time=32.133..32.133 rows=86332 loops=1)
Buckets: 131072 Batches: 2 Memory Usage: 3048kB
-> Seq Scan on parties p2 (cost=0.00..1583.32 rows=86332 width=16) (actual time=0.007..12.815 rows=86332 loops=1)
Planning Time: 0.444 ms
Execution Time: 110.692 ms
(8 rows)
Is there a way I could get rid of those Seq Scans?
I don't see the presence of any index in your explain plan, so assign that you have not yet looked into using indices, here is one suggestion:
CREATE INDEX idx ON parties (contract_id, id);
This should speed up the join, and it also covers the id value, which is required in the SELECT clause.

PostgreSQL 9.4 query tuning

I have a query that is running too slowly.
select c.vm_name,
round(sum(bytes_sent)*1.8/power(10,9)) gb_sent,
round(sum(bytes_received)*1.8/power(10,9)) gb_received
from groups b,
vms c,
vm_ip_address_histories d,
ip_address_usage_histories e
where b.group_id = c.group_id
and c.vm_id = d.vm_id
and d.ip_address = e.ip_address
and e.datetime >= firstday()
and d.allocation_date <= last_day(sysdate()) and (d.deallocation_date is null or d.deallocation_date > last_day(sysdate()))
and b.customer_id = 29
group by c.vm_name
order by 1;
The function sysdate() returns the current system timestamp without a time zone, last_day() returns the timestamp that represents the last day of the month. I created these because Hibernate doesn't like the Postgres casting notation.
The issue is that the planner is doing full table scans where there are indexes in place. Here is the explain plan for the above query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1326387.13..1326391.38 rows=1698 width=24) (actual time=13221.041..13221.042 rows=7 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1326236.61..1326296.04 rows=1698 width=24) (actual time=13221.008..13221.026 rows=7 loops=1)
Group Key: c.vm_name
-> Hash Join (cost=1309056.97..1325972.10 rows=35268 width=24) (actual time=13131.323..13211.612 rows=13631 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..6942.24 rows=79 width=15) (actual time=0.249..56.904 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.066..0.638 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.009..0.281 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.011 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..70.34 rows=8 width=15) (actual time=0.011..0.921 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= last_day(sysdate())) AND ((deallocation_date IS NULL) OR (deallocation_date > last_day(sysdate()))))
Rows Removed by Filter: 84
-> Hash (cost=1280129.06..1280129.06 rows=1575435 width=23) (actual time=13130.223..13130.223 rows=203702 loops=1)
Buckets: 8192 Batches: 32 Memory Usage: 422kB
-> Seq Scan on ip_address_usage_histories e (cost=0.00..1280129.06 rows=1575435 width=23) (actual time=0.205..13002.776 rows=203702 loops=1)
Filter: (datetime >= firstday())
Rows Removed by Filter: 4522813
Planning time: 0.804 ms
Execution time: 13221.155 ms
(27 rows)
Notice that the planner is choosing to perform a very expensive full table scans on the largest tables - ip_address_usage_histories and vm_ip_address_histories. I have tried changing the configuration parameter enable_seqscan to off, but that made the problem worse, total execution time went to 63 seconds.
Here are the describes of the aforementioned tables:
Table "ip_address_usage_histories"
Column | Type | Modifiers
-----------------------------+-----------------------------+-----------
ip_address_usage_history_id | bigint | not null
datetime | timestamp without time zone | not null
ip_address | inet | not null
bytes_sent | bigint | not null
bytes_received | bigint | not null
Indexes:
"ip_address_usage_histories_pkey" PRIMARY KEY, btree (ip_address_usage_history_id)
"ip_address_usage_histories_datetime_ip_address_key" UNIQUE CONSTRAINT, btree (datetime, ip_address)
"uk_mit6tbiu8k62vdae4tmtnwb3f" UNIQUE CONSTRAINT, btree (datetime, ip_address)
Table "vm_ip_address_histories"
Column | Type | Modifiers
--------------------------+-----------------------------+--------------------------------------------------------------------------------------------
vm_ip_address_history_id | bigint | not null default nextval('vm_ip_address_histories_vm_ip_address_history_id_seq'::regclass)
ip_address | inet | not null
allocation_date | timestamp without time zone | not null
deallocation_date | timestamp without time zone |
vm_id | bigint | not null
Indexes:
"vm_ip_address_histories_pkey" PRIMARY KEY, btree (vm_ip_address_history_id)
"xie1vm_ip_address_histories" btree (replicate_date)
"xif1vm_ip_address_histories" btree (vm_id)
Foreign-key constraints:
"vm_ip_address_histories_vm_id_fkey" FOREIGN KEY (vm_id) REFERENCES vms(vm_id) ON DELETE RESTRICT
It appears that Postgres does not have query hints to direct the planner. I also tried the from clause inner join ... on ... syntax, but that did not improve things either.
Update 1
create or replace function firstday() returns timestamp without time zone as $$
begin
return date_trunc('month',now()::timestamp without time zone)::timestamp without time zone;
end; $$
language plpgsql;
I have not tried to replace this function with a standard function because Postgres doesn't have a function that returns the first day of the month to my knowledge.
The following was embedded in the question, but it reads as an answer.
After changing the all of my functions to immutable, the query now runs in 200ms! All the right things are happening.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=51865.24..51914.88 rows=1103 width=24) (actual time=178.793..188.223 rows=7 loops=1)
Group Key: c.vm_name
-> Sort (cost=51865.24..51868.00 rows=1103 width=24) (actual time=178.517..180.541 rows=13823 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 1464kB
-> Hash Join (cost=50289.49..51809.50 rows=1103 width=24) (actual time=131.278..155.971 rows=13823 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..272.36 rows=23 width=15) (actual time=0.149..2.310 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.046..0.590 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.006..0.250 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.012 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..2.34 rows=2 width=15) (actual time=0.002..0.027 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= '2015-03-31 00:00:00'::timestamp without time zone) AND ((deallocation_date IS NULL) OR (deallocation_date > '2015-03-31 00:00:00'::timestamp without time zone)))
Rows Removed by Filter: 84
-> Hash (cost=46621.83..46621.83 rows=199575 width=23) (actual time=130.762..130.762 rows=206266 loops=1)
Buckets: 8192 Batches: 4 Memory Usage: 2818kB
-> Bitmap Heap Scan on ip_address_usage_histories e (cost=4627.14..46621.83 rows=199575 width=23) (actual time=18.335..69.763 rows=206266 loops=1)
Recheck Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Heap Blocks: exact=3684
-> Bitmap Index Scan on uk_mit6tbiu8k62vdae4tmtnwb3f (cost=0.00..4577.24 rows=199575 width=0) (actual time=17.797..17.797 rows=206935 loops=1)
Index Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Planning time: 0.837 ms
Execution time: 188.301 ms
(29 rows)
I now see the planner is executing the functions, and using their values to insert into the where clause, which causes the indexes to be used.

Slow sql select with index on postgres

I have a production database which is replicated to a another host using londist. The table looks like
# \d+ usermessage
Table "public.usermessage"
Column | Type | Modifiers | Description
-------------------+-------------------+-----------+-------------
id | bigint | not null |
subject | character varying | |
message | character varying | |
read | boolean | |
timestamp | bigint | |
owner | bigint | |
sender | bigint | |
recipient | bigint | |
dao_created | bigint | |
dao_updated | bigint | |
type | integer | |
replymessageid | character varying | |
originalmessageid | character varying | |
replied | boolean | |
mheader | boolean | |
mbody | boolean | |
Indexes:
"usermessage_pkey" PRIMARY KEY, btree (id)
"usermessage_owner_key" btree (owner)
"usermessage_recipient_key" btree (recipient)
"usermessage_timestamp_key" btree ("timestamp")
"usermessage_type_key" btree (type)
Has OIDs: no
If executed on the replicated database, the select is fast as expected, if executed on the production host it's horrible slow. To make things more strange, not all timestamps are slow, some of them are fast on both hosts. The filesystem and the storage behind the production host is fine and not under heavy usage. Any ideas?
replication# explain analyse SELECT COUNT(id) FROM usermessage WHERE owner = 1234567 AND timestamp > 1362077127010;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=263.37..263.38 rows=1 width=8) (actual time=0.059..0.060 rows=1 loops=1)
-> Bitmap Heap Scan on usermessage (cost=259.35..263.36 rows=1 width=8) (actual time=0.055..0.055 rows=0 loops=1)
Recheck Cond: ((owner = 1234567) AND ("timestamp" > 1362077127010::bigint))
-> BitmapAnd (cost=259.35..259.35 rows=1 width=0) (actual time=0.054..0.054 rows=0 loops=1)
-> Bitmap Index Scan on usermessage_owner_key (cost=0.00..19.27 rows=241 width=0) (actual time=0.032..0.032 rows=33 loops=1)
Index Cond: (owner = 1234567)
-> Bitmap Index Scan on usermessage_timestamp_key (cost=0.00..239.82 rows=12048 width=0) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: ("timestamp" > 1362077127010::bigint)
Total runtime: 0.103 ms
(9 rows)
production# explain analyse SELECT COUNT(id) FROM usermessage WHERE owner = 1234567 AND timestamp > 1362077127010;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=267.39..267.40 rows=1 width=8) (actual time=47536.590..47536.590 rows=1 loops=1)
-> Bitmap Heap Scan on usermessage (cost=263.37..267.38 rows=1 width=8) (actual time=47532.520..47536.579 rows=3 loops=1)
Recheck Cond: ((owner = 1234567) AND ("timestamp" > 1362077127010::bigint))
-> BitmapAnd (cost=263.37..263.37 rows=1 width=0) (actual time=47532.334..47532.334 rows=0 loops=1)
-> Bitmap Index Scan on usermessage_owner_key (cost=0.00..21.90 rows=168 width=0) (actual time=0.123..0.123 rows=46 loops=1)
Index Cond: (owner = 1234567)
-> Bitmap Index Scan on usermessage_timestamp_key (cost=0.00..241.22 rows=12209 width=0) (actual time=47530.255..47530.255 rows=5255617 loops=1)
Index Cond: ("timestamp" > 1362077127010::bigint)
Total runtime: 47536.668 ms
(9 rows)
I am less familiar with postgresql than mysql but
(actual time=0.013..0.013 rows=0 loops=1)
and
(actual time=47530.255..47530.255 rows=5255617 loops=1)
Suggests to me that your production db has more data given that the rows are drastically different.

How to create postgres date index properly?

I'm using Django ORM and postgresql.
ORM creates a query:
SELECT
(date_part('month', stat_date)) AS "stat_date",
"direct_keywordstat"."banner_id",
SUM("direct_keywordstat"."total") AS "total",
SUM("direct_keywordstat"."clicks") AS "clicks",
SUM("direct_keywordstat"."shows") AS "shows"
FROM "direct_keywordstat"
LEFT OUTER JOIN "direct_banner" ON ("direct_keywordstat"."banner_id" = "direct_banner"."banner_ptr_id")
LEFT OUTER JOIN "platforms_banner" ON ("direct_banner"."banner_ptr_id" = "platforms_banner"."id")
WHERE (
"direct_keywordstat".stat_date BETWEEN E'2009-08-25' AND E'2010-08-25' AND
"direct_keywordstat"."keyword_id" IN (
SELECT U0."id"
FROM "direct_keyword" U0
INNER JOIN "direct_banner" U1 ON (U0."banner_id" = U1."banner_ptr_id")
INNER JOIN "platforms_banner" U2 ON (U1."banner_ptr_id" = U2."id")
INNER JOIN "platforms_campaign" U3 ON (U2."campaign_id" = U3."id")
INNER JOIN "direct_campaign" U4 ON (U3."id" = U4."campaign_ptr_id")
WHERE (
U0."deleted" = E'False' AND
U0."low_ctr" = E'False' AND
U4."status_active" = E'True' AND
U0."banner_id" IN (
SELECT U0."banner_ptr_id"
FROM "direct_banner" U0
INNER JOIN "platforms_banner" U1
ON (U0."banner_ptr_id" = U1."id")
WHERE (
U0."status_show" = E'True' AND
U1."campaign_id" = E'174' )
)
)
)
)
GROUP BY
"direct_keywordstat"."banner_id",
(date_part('month', stat_date)),
"platforms_banner"."title", date_trunc('month', stat_date)
ORDER BY "platforms_banner"."title" ASC, "stat_date" ASC
Problem is, direct_keywordstat contains 3mln+ records, so the query executes in ~15 seconds.
I've tried creating indexes like
CREATE INDEX direct_keywordstat_stat_date on direct_keywordstat using btree(stat_date);
But EXPLAIN ANALYZE show that index is not used.
Table schema:
\d direct_keywordstat
Table "public.direct_keywordstat"
Column | Type | Modifiers
-------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('direct_keywordstat_id_seq'::regclass)
keyword_id | integer | not null
banner_id | integer | not null
campaign_id | integer | not null
stat_date | date | not null
region_id | integer | not null
place_type | character varying(30) |
place_name | character varying(100) |
clicks | integer | not null default 0
shows | integer | not null default 0
total | numeric(19,6) | not null
How can i create useful index?
Or, maybe, there's a chance to optimize this query other way?
Thing is, if WHERE looks like
"direct_keywordstat".clicks BETWEEN 10 AND 3000000
query executes in 0.8 seconds.
Do you have indexes on these columns:
direct_banner.banner_ptr_id
direct_keywordstat.banner_id
direct_keywordstat.stat_date
Both columns in direct_keywordstat could be combined in a single index, just check
This is also a problem:
Sort Method: external merge Disk:
20600kB
Check your settings for work_mem, you need at least 20MB for this query.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=727967.61..847401.71 rows=2514402 width=67) (actual time=22010.522..23408.262 rows=5 loops=1)
-> Sort (cost=727967.61..734253.62 rows=2514402 width=67) (actual time=21742.365..23134.748 rows=198978 loops=1)
Sort Key: platforms_banner.title, (date_part('month'::text, (direct_keywordstat.stat_date)::timestamp without time zone)), direct_keywordstat.banner_id, (date_trunc('month'::text, (direct_keywordstat.stat_date)::timestamp with time zone))
Sort Method: external merge Disk: 20600kB
-> Hash Join (cost=1034.02..164165.25 rows=2514402 width=67) (actual time=5159.538..14942.441 rows=198978 loops=1)
Hash Cond: (direct_keywordstat.keyword_id = u0.id)
-> Hash Left Join (cost=365.78..117471.99 rows=2514402 width=71) (actual time=26.672..13101.294 rows=2523151 loops=1)
Hash Cond: (direct_keywordstat.banner_id = direct_banner.banner_ptr_id)
-> Seq Scan on direct_keywordstat (cost=0.00..76247.17 rows=2514402 width=25) (actual time=8.892..9386.010 rows=2523151 loops=1)
Filter: ((stat_date >= '2009-08-25'::date) AND (stat_date <= '2010-08-25'::date))
-> Hash (cost=324.86..324.86 rows=3274 width=50) (actual time=17.754..17.754 rows=2851 loops=1)
-> Hash Left Join (cost=209.15..324.86 rows=3274 width=50) (actual time=10.845..15.385 rows=2851 loops=1)
Hash Cond: (direct_banner.banner_ptr_id = platforms_banner.id)
-> Seq Scan on direct_banner (cost=0.00..66.74 rows=3274 width=4) (actual time=0.004..1.196 rows=2851 loops=1)
-> Hash (cost=173.51..173.51 rows=2851 width=50) (actual time=10.683..10.683 rows=2851 loops=1)
-> Seq Scan on platforms_banner (cost=0.00..173.51 rows=2851 width=50) (actual time=0.004..3.576 rows=2851 loops=1)
-> Hash (cost=641.44..641.44 rows=2144 width=4) (actual time=30.420..30.420 rows=106 loops=1)
-> HashAggregate (cost=620.00..641.44 rows=2144 width=4) (actual time=30.162..30.288 rows=106 loops=1)
-> Hash Join (cost=407.17..614.64 rows=2144 width=4) (actual time=16.152..30.031 rows=106 loops=1)
Hash Cond: (u0.banner_id = u1.banner_ptr_id)
-> Nested Loop (cost=76.80..238.50 rows=6488 width=16) (actual time=8.670..22.343 rows=106 loops=1)
-> HashAggregate (cost=76.80..76.87 rows=7 width=8) (actual time=0.045..0.047 rows=1 loops=1)
-> Nested Loop (cost=0.00..76.79 rows=7 width=8) (actual time=0.033..0.036 rows=1 loops=1)
-> Index Scan using platforms_banner_campaign_id on platforms_banner u1 (cost=0.00..22.82 rows=7 width=4) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (campaign_id = 174)
-> Index Scan using direct_banner_pkey on direct_banner u0 (cost=0.00..7.70 rows=1 width=4) (actual time=0.009..0.011 rows=1 loops=1)
Index Cond: (u0.banner_ptr_id = u1.id)
Filter: u0.status_show
-> Index Scan using direct_keyword_banner_id on direct_keyword u0 (cost=0.00..23.03 rows=5 width=8) (actual time=8.620..22.127 rows=106 loops=1)
Index Cond: (u0.banner_id = u0.banner_ptr_id)
Filter: ((NOT u0.deleted) AND (NOT u0.low_ctr))
-> Hash (cost=316.84..316.84 rows=1082 width=8) (actual time=7.458..7.458 rows=403 loops=1)
-> Hash Join (cost=227.00..316.84 rows=1082 width=8) (actual time=3.584..7.149 rows=403 loops=1)
Hash Cond: (u1.banner_ptr_id = u2.id)
-> Seq Scan on direct_banner u1 (cost=0.00..66.74 rows=3274 width=4) (actual time=0.002..1.570 rows=2851 loops=1)
-> Hash (cost=213.48..213.48 rows=1082 width=4) (actual time=3.521..3.521 rows=403 loops=1)
-> Hash Join (cost=23.88..213.48 rows=1082 width=4) (actual time=0.715..3.268 rows=403 loops=1)
Hash Cond: (u2.campaign_id = u3.id)
-> Seq Scan on platforms_banner u2 (cost=0.00..173.51 rows=2851 width=8) (actual time=0.001..1.272 rows=2851 loops=1)
-> Hash (cost=22.95..22.95 rows=74 width=8) (actual time=0.345..0.345 rows=37 loops=1)
-> Hash Join (cost=11.84..22.95 rows=74 width=8) (actual time=0.133..0.320 rows=37 loops=1)
Hash Cond: (u3.id = u4.campaign_ptr_id)
-> Seq Scan on platforms_campaign u3 (cost=0.00..8.91 rows=391 width=4) (actual time=0.006..0.098 rows=196 loops=1)
-> Hash (cost=10.91..10.91 rows=74 width=4) (actual time=0.117..0.117 rows=37 loops=1)
-> Seq Scan on direct_campaign u4 (cost=0.00..10.91 rows=74 width=4) (actual time=0.004..0.097 rows=37 loops=1)
Filter: status_active
Total runtime: 23436.715 ms
(47 rows)
Here it is