PostgreSQL 9.4 query tuning - postgresql-9.4

I have a query that is running too slowly.
select c.vm_name,
round(sum(bytes_sent)*1.8/power(10,9)) gb_sent,
round(sum(bytes_received)*1.8/power(10,9)) gb_received
from groups b,
vms c,
vm_ip_address_histories d,
ip_address_usage_histories e
where b.group_id = c.group_id
and c.vm_id = d.vm_id
and d.ip_address = e.ip_address
and e.datetime >= firstday()
and d.allocation_date <= last_day(sysdate()) and (d.deallocation_date is null or d.deallocation_date > last_day(sysdate()))
and b.customer_id = 29
group by c.vm_name
order by 1;
The function sysdate() returns the current system timestamp without a time zone, last_day() returns the timestamp that represents the last day of the month. I created these because Hibernate doesn't like the Postgres casting notation.
The issue is that the planner is doing full table scans where there are indexes in place. Here is the explain plan for the above query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1326387.13..1326391.38 rows=1698 width=24) (actual time=13221.041..13221.042 rows=7 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1326236.61..1326296.04 rows=1698 width=24) (actual time=13221.008..13221.026 rows=7 loops=1)
Group Key: c.vm_name
-> Hash Join (cost=1309056.97..1325972.10 rows=35268 width=24) (actual time=13131.323..13211.612 rows=13631 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..6942.24 rows=79 width=15) (actual time=0.249..56.904 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.066..0.638 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.009..0.281 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.011 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..70.34 rows=8 width=15) (actual time=0.011..0.921 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= last_day(sysdate())) AND ((deallocation_date IS NULL) OR (deallocation_date > last_day(sysdate()))))
Rows Removed by Filter: 84
-> Hash (cost=1280129.06..1280129.06 rows=1575435 width=23) (actual time=13130.223..13130.223 rows=203702 loops=1)
Buckets: 8192 Batches: 32 Memory Usage: 422kB
-> Seq Scan on ip_address_usage_histories e (cost=0.00..1280129.06 rows=1575435 width=23) (actual time=0.205..13002.776 rows=203702 loops=1)
Filter: (datetime >= firstday())
Rows Removed by Filter: 4522813
Planning time: 0.804 ms
Execution time: 13221.155 ms
(27 rows)
Notice that the planner is choosing to perform a very expensive full table scans on the largest tables - ip_address_usage_histories and vm_ip_address_histories. I have tried changing the configuration parameter enable_seqscan to off, but that made the problem worse, total execution time went to 63 seconds.
Here are the describes of the aforementioned tables:
Table "ip_address_usage_histories"
Column | Type | Modifiers
-----------------------------+-----------------------------+-----------
ip_address_usage_history_id | bigint | not null
datetime | timestamp without time zone | not null
ip_address | inet | not null
bytes_sent | bigint | not null
bytes_received | bigint | not null
Indexes:
"ip_address_usage_histories_pkey" PRIMARY KEY, btree (ip_address_usage_history_id)
"ip_address_usage_histories_datetime_ip_address_key" UNIQUE CONSTRAINT, btree (datetime, ip_address)
"uk_mit6tbiu8k62vdae4tmtnwb3f" UNIQUE CONSTRAINT, btree (datetime, ip_address)
Table "vm_ip_address_histories"
Column | Type | Modifiers
--------------------------+-----------------------------+--------------------------------------------------------------------------------------------
vm_ip_address_history_id | bigint | not null default nextval('vm_ip_address_histories_vm_ip_address_history_id_seq'::regclass)
ip_address | inet | not null
allocation_date | timestamp without time zone | not null
deallocation_date | timestamp without time zone |
vm_id | bigint | not null
Indexes:
"vm_ip_address_histories_pkey" PRIMARY KEY, btree (vm_ip_address_history_id)
"xie1vm_ip_address_histories" btree (replicate_date)
"xif1vm_ip_address_histories" btree (vm_id)
Foreign-key constraints:
"vm_ip_address_histories_vm_id_fkey" FOREIGN KEY (vm_id) REFERENCES vms(vm_id) ON DELETE RESTRICT
It appears that Postgres does not have query hints to direct the planner. I also tried the from clause inner join ... on ... syntax, but that did not improve things either.
Update 1
create or replace function firstday() returns timestamp without time zone as $$
begin
return date_trunc('month',now()::timestamp without time zone)::timestamp without time zone;
end; $$
language plpgsql;
I have not tried to replace this function with a standard function because Postgres doesn't have a function that returns the first day of the month to my knowledge.

The following was embedded in the question, but it reads as an answer.
After changing the all of my functions to immutable, the query now runs in 200ms! All the right things are happening.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=51865.24..51914.88 rows=1103 width=24) (actual time=178.793..188.223 rows=7 loops=1)
Group Key: c.vm_name
-> Sort (cost=51865.24..51868.00 rows=1103 width=24) (actual time=178.517..180.541 rows=13823 loops=1)
Sort Key: c.vm_name
Sort Method: quicksort Memory: 1464kB
-> Hash Join (cost=50289.49..51809.50 rows=1103 width=24) (actual time=131.278..155.971 rows=13823 loops=1)
Hash Cond: (d.ip_address = e.ip_address)
-> Nested Loop (cost=2.97..272.36 rows=23 width=15) (actual time=0.149..2.310 rows=192 loops=1)
-> Hash Join (cost=2.69..41.02 rows=98 width=16) (actual time=0.046..0.590 rows=61 loops=1)
Hash Cond: (c.group_id = b.group_id)
-> Seq Scan on vms c (cost=0.00..30.98 rows=1698 width=24) (actual time=0.006..0.250 rows=1698 loops=1)
-> Hash (cost=2.65..2.65 rows=3 width=8) (actual time=0.014..0.014 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on groups b (cost=0.00..2.65 rows=3 width=8) (actual time=0.004..0.012 rows=4 loops=1)
Filter: (customer_id = 29)
Rows Removed by Filter: 49
-> Index Scan using xif1vm_ip_address_histories on vm_ip_address_histories d (cost=0.29..2.34 rows=2 width=15) (actual time=0.002..0.027 rows=3 loops=61)
Index Cond: (vm_id = c.vm_id)
Filter: ((allocation_date <= '2015-03-31 00:00:00'::timestamp without time zone) AND ((deallocation_date IS NULL) OR (deallocation_date > '2015-03-31 00:00:00'::timestamp without time zone)))
Rows Removed by Filter: 84
-> Hash (cost=46621.83..46621.83 rows=199575 width=23) (actual time=130.762..130.762 rows=206266 loops=1)
Buckets: 8192 Batches: 4 Memory Usage: 2818kB
-> Bitmap Heap Scan on ip_address_usage_histories e (cost=4627.14..46621.83 rows=199575 width=23) (actual time=18.335..69.763 rows=206266 loops=1)
Recheck Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Heap Blocks: exact=3684
-> Bitmap Index Scan on uk_mit6tbiu8k62vdae4tmtnwb3f (cost=0.00..4577.24 rows=199575 width=0) (actual time=17.797..17.797 rows=206935 loops=1)
Index Cond: (datetime >= '2015-03-01 00:00:00'::timestamp without time zone)
Planning time: 0.837 ms
Execution time: 188.301 ms
(29 rows)
I now see the planner is executing the functions, and using their values to insert into the where clause, which causes the indexes to be used.

Related

Postgres function slower than same ad hoc query

I have had several cases where a Postgres function that returns a table result from a query is much slower than running the actual query. Why is that?
This is one example, but I've found that function is slower than just the query in many cases.
create function trending_names(date_start timestamp with time zone, date_end timestamp with time zone, gender_filter character, country_filter text)
returns TABLE(name_id integer, gender character, country text, score bigint, rank bigint)
language sql
as
$$
select u.name_id,
n.gender,
u.country,
count(u.rank) as score,
row_number() over (order by count(u.rank) desc) as rank
from babynames.user_scores u
inner join babynames.names n on u.name_id = n.id
where u.created_at between date_start and date_end
and u.rank > 0
and n.gender = gender_filter
and u.country = country_filter
group by u.name_id, n.gender, u.country
$$;
This is the query plan for a select from the function:
Function Scan on trending_names (cost=0.25..10.25 rows=1000 width=84) (actual time=1118.673..1118.861 rows=2238 loops=1)
Buffers: shared hit=216509 read=29837
Planning Time: 0.078 ms
Execution Time: 1119.083 ms
Query plan from just running the query. This takes less than half the time.
WindowAgg (cost=44834.98..45593.32 rows=43334 width=25) (actual time=383.387..385.223 rows=2238 loops=1)
Planning Time: 2.512 ms
Execution Time: 387.403 ms
Buffers: shared hit=100446 read=50220
-> Sort (cost=44834.98..44943.31 rows=43334 width=17) (actual time=383.375..383.546 rows=2238 loops=1)
Sort Method: quicksort Memory: 271kB
Sort Key: (count(u.rank)) DESC
Buffers: shared hit=100446 read=50220
-> HashAggregate (cost=41064.22..41497.56 rows=43334 width=17) (actual time=381.088..381.906 rows=2238 loops=1)
" Group Key: u.name_id, u.country, n.gender"
Buffers: shared hit=100446 read=50220
-> Hash Join (cost=5352.15..40630.88 rows=43334 width=13) (actual time=60.710..352.646 rows=36271 loops=1)
Hash Cond: (u.name_id = n.id)
Buffers: shared hit=100446 read=50220
-> Index Scan using user_scores_rank_ix on user_scores u (cost=0.43..35077.55 rows=76796 width=11) (actual time=24.193..287.393 rows=69770 loops=1)
-> Hash (cost=5005.89..5005.89 rows=27667 width=6) (actual time=36.420..36.420 rows=27472 loops=1)
Rows Removed by Filter: 106521
Index Cond: (rank > 0)
Filter: ((created_at >= '2021-01-01 00:00:00+00'::timestamp with time zone) AND (country = 'sv'::text) AND (created_at <= now()))
Buffers: shared hit=99417 read=46856
Buffers: shared hit=1029 read=3364
Buckets: 32768 Batches: 1 Memory Usage: 1330kB
-> Seq Scan on names n (cost=0.00..5005.89 rows=27667 width=6) (actual time=0.022..24.447 rows=27472 loops=1)
Rows Removed by Filter: 21559
Filter: (gender = 'f'::bpchar)
Buffers: shared hit=1029 read=3364
I'm also confused on why it does a Seq scan on names n in the last step since names.id is the primary key and gender is indexed.

Getting high query execution time in Postgres

Table: Customer
Type:
telephone1 | character varying(255)
telephone2 | character varying(255)
location_id | integer
Index:
"idx_customers_location_id" btree (location_id)
"idx_customers_telephone1_txt" btree (telephone1 text_pattern_ops)
"idx_customers_trim_telephone_1" btree (btrim(telephone1::text))
"idx_customers_trim_telephone2" btree (btrim(telephone2::text))
I have a table called customers, total rows are 141182. I was checking values in two columns (telephone1, telephone2), all the column telephone1 has data, but only 8 rows have value for the column telephone2
When I check for the value 1, getting this below execution time.
SELECT customers.id, location_id, telephone1, telephon2 FROM "customers" INNER JOIN "locations" ON
"locations"."id" = "customers"."location_id" WHERE (customers.location_id = 189 AND (telephone1 = '1'
OR telephone2 = '1')) GROUP BY customers.id LIMIT 20 OFFSET 0;
Limit (cost=519.62..519.64 rows=4 width=125) (actual time=25.895..25.898 rows=1 loops=1)
-> GroupAggregate (cost=519.62..519.64 rows=4 width=125) (actual time=25.893..25.896 rows=1 loops=1)
Group Key: customers.id
-> Sort (cost=519.62..519.62 rows=4 width=127) (actual time=25.876..25.879 rows=1 loops=1)
Sort Key: customers.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=8.62..519.61 rows=4 width=127) (actual time=10.740..25.869 rows=1 loops=1)
-> Index Scan using locations_pkey on locations (cost=0.06..4.06 rows=1 width=70) (actual time=0.027..0.029 rows=1 loops=1)
Index Cond: (id = 189)
-> Bitmap Heap Scan on customers (cost=8.56..515.54 rows=4 width=61) (actual time=10.707..25.832 rows=1 loops=1)
Recheck Cond: (((telephone1)::text = '1'::text) OR ((telephone2)::text = '1'::text))
Filter: (location_id = 189)
Rows Removed by Filter: 1048
Heap Blocks: exact=1737
-> BitmapOr (cost=8.56..8.56 rows=259 width=0) (actual time=3.445..3.446 rows=0 loops=1)
-> Bitmap Index Scan on idx_customers_telephone1_txt (cost=0.00..2.10 rows=7 width=0) (actual time=0.065..0.066 rows=99 loops=1)
Index Cond: ((telephone1)::text = '1'::text)
-> Bitmap Index Scan on idx_customers_telephone2_txt (cost=0.00..6.47 rows=253 width=0) (actual time=3.378..3.378 rows=1664 loops=1)
Index Cond: ((telephone2)::text = '1'::text)
Planning Time: 0.419 ms
Execution Time: 25.995 ms
When I check for value 0 there is a huge change in the execution time (7753.216 ms)
Limit (cost=0.14..2440.90 rows=10 width=125) (actual time=5900.924..7753.133 rows=4 loops=1)
-> GroupAggregate (cost=0.14..292402.20 rows=1198 width=125) (actual time=5900.922..7753.129 rows=4 loops=1)
Group Key: customers.id
-> Nested Loop (cost=0.14..292395.61 rows=1198 width=127) (actual time=4350.358..7753.087 rows=4 loops=1)
-> Index Scan using customers_pkey on customers (cost=0.09..292387.36 rows=1198 width=61) (actual time=4350.338..7753.054 rows=4 loops=1)
Filter: ((location_id = 189) AND (((telephone1)::text = '0'::text) OR ((telephone2)::text = '0'::text)))
Rows Removed by Filter: 8484280
-> Materialize (cost=0.06..4.06 rows=1 width=70) (actual time=0.005..0.005 rows=1 loops=4)
-> Index Scan using locations_pkey on locations (cost=0.06..4.06 rows=1 width=70) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (id = 189)
Planning Time: 0.322 ms
Execution Time: 7753.216 ms
Is there any particular reason, that takes more time to execute for the value 0 ? or anything wrong here?
One more thing I have noticed this issue happens only with column telephone2.
but only 8 rows have value for the column telephone2
Your explain plan indicates otherwise, finding 1664 rows with one specific value for telephone2. Now maybe most of those are not visible, but in that case you really need to VACUUM ANALYZE the table.
Nested Loop (cost=0.14..292395.61 rows=1198 width=127) (actual time=4350.358..7753.087 rows=4 loops=1)
With this second query, it thinks it will find 1198 rows (if run to completion). But it thinks it can stop after the first 20, so that would be 1.67% of index. But instead there are only 4 rows, so it unexpectedly has to scan the entire index without getting to stop early.
Why are the estimates off by so much? I don't know, it could just be stale statistics (again, VACUUM ANALYZE the table), or there could be some interrelation between the columns that make the estimation hard to do even with accurate statistics.
What is the point in joining to locations at all?

PostgreSQL 9.4 Index speed very slow on Select

The box running PostgreSQL 9.4 is a 32 core system at 3.5ghz per core with 128GB or ram with mirrored Samsung pro 850 SSD drives on FreeBSD with ZFS. This is no reason for this poor of performance from PostgreSQL!
psql (9.4.5)
forex=# \d pair_data;
Table "public.pair_data"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------------
id | integer | not null default nextval('pair_data_id_seq'::regclass)
pair_name | character varying(6) |
pair_price | numeric(9,6) |
ts | timestamp without time zone |
Indexes:
"pair_data_pkey" PRIMARY KEY, btree (id)
"date_idx" gin (ts)
"pair_data_gin_idx1" gin (pair_name)
"pair_data_gin_idx2" gin (ts)
With this select:
forex=# explain (analyze, buffers) select pair_name as name, pair_price as price, ts as date from pair_data where pair_name = 'EURUSD' order by ts desc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=201442.55..201442.57 rows=10 width=22) (actual time=10870.395..10870.430 rows=10 loops=1)
Buffers: shared hit=40 read=139150 dirtied=119
-> Sort (cost=201442.55..203787.44 rows=937957 width=22) (actual time=10870.390..10870.401 rows=10 loops=1)
Sort Key: ts
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Heap Scan on pair_data (cost=9069.17..181173.63 rows=937957 width=22) (actual time=614.976..8903.913 rows=951858 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13624055
Heap Blocks: exact=33464 lossy=105456
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8834.68 rows=937957 width=0) (actual time=593.701..593.701 rows=951858 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=40 read=230
Planning time: 0.387 ms
Execution time: 10871.419 ms
(16 rows)
Or this select:
forex=# explain (analyze, buffers) with intervals as ( select start, start + interval '4hr' as end from generate_series('2015-12-01 15:00', '2016-01-19 16:00', interval '4hr') as start ) select distinct intervals.start as date, min(pair_price) over w as low, max(pair_price) over w as high, first_value(pair_price) over w as open, last_value(pair_price) over w as close from intervals join pair_data mb on mb.pair_name = 'EURUSD' and mb.ts >= intervals.start and mb.ts < intervals.end window w as (partition by intervals.start order by mb.ts asc rows between unbounded preceding and unbounded following) order by intervals.start;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=64634864.43..66198331.09 rows=815054 width=23) (actual time=1379732.924..1384602.952 rows=82 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
CTE intervals
-> Function Scan on generate_series start (cost=0.00..12.50 rows=1000 width=8) (actual time=0.135..1.801 rows=295 loops=1)
-> Sort (cost=64634851.92..64895429.70 rows=104231111 width=23) (actual time=1379732.918..1381724.179 rows=951970 loops=1)
Sort Key: intervals.start, (min(mb.pair_price) OVER (?)), (max(mb.pair_price) OVER (?)), (first_value(mb.pair_price) OVER (?)), (last_value(mb.pair_price) OVER (?))
Sort Method: external sort Disk: 60808kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
-> WindowAgg (cost=41474743.35..44341098.90 rows=104231111 width=23) (actual time=1341744.405..1365946.672 rows=951970 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3324995 written=9449
-> Sort (cost=41474743.35..41735321.12 rows=104231111 width=23) (actual time=1341743.502..1343723.884 rows=951970 loops=1)
Sort Key: intervals.start, mb.ts
Sort Method: external sort Disk: 32496kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1154778 written=7975
-> Nested Loop (cost=9098.12..21180990.32 rows=104231111 width=23) (actual time=271672.696..1337526.628 rows=951970 loops=1)
Join Filter: ((mb.ts >= intervals.start) AND (mb.ts < intervals."end"))
Rows Removed by Join Filter: 279879180
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> CTE Scan on intervals (cost=0.00..20.00 rows=1000 width=16) (actual time=0.142..4.075 rows=295 loops=1)
-> Materialize (cost=9098.12..190496.52 rows=938080 width=15) (actual time=2.125..1962.153 rows=951970 loops=295)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> Bitmap Heap Scan on pair_data mb (cost=9098.12..181225.12 rows=938080 width=15) (actual time=622.818..11474.987 rows=951970 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13623989
Heap Blocks: exact=33485 lossy=105456
Buffers: shared hit=8 read=139210 dirtied=22
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8863.60 rows=938080 width=0) (actual time=601.158..601.158 rows=951970 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=8 read=269
Planning time: 0.454 ms
Execution time: 1384653.385 ms
(31 rows)
Table of pair_data only has:
forex=# select count(*) from pair_data;
count
----------
21833886
(1 row)
Why is this doing heap scans when their are indexes? I do not understand what is going on with the query plan? Dose anyone have an idea on where the problem might be?

Postgres Slow group by query with max

I am using postgres 9.1 and I have a table with about 3.5M rows of eventtype (varchar) and eventtime (timestamp) - and some other fields. There are only about 20 different eventtype's and the event time spans about 4 years.
I want to get the last timestamp of each event type. If I run a query like:
select eventtype, max(eventtime)
from allevents
group by eventtype
it takes around 20 seconds. Selecting distinct eventtype's is equally slow. The query plan shows a full sequential scan of the table - not surprising it is slow.
Explain analyse for the above query gives:
HashAggregate (cost=84591.47..84591.68 rows=21 width=21) (actual time=20918.131..20918.141 rows=21 loops=1)
-> Seq Scan on allevents (cost=0.00..66117.98 rows=3694698 width=21) (actual time=0.021..4831.793 rows=3694392 loops=1)
Total runtime: 20918.204 ms
If I add a where clause to select a specific eventtype, it takes anywhere from 40ms to 150ms which is at least decent.
Query plan when selecting specific eventtype:
GroupAggregate (cost=343.87..24942.71 rows=1 width=21) (actual time=98.397..98.397 rows=1 loops=1)
-> Bitmap Heap Scan on allevents (cost=343.87..24871.07 rows=14325 width=21) (actual time=6.820..89.610 rows=19736 loops=1)
Recheck Cond: ((eventtype)::text = 'TEST_EVENT'::text)
-> Bitmap Index Scan on allevents_idx2 (cost=0.00..340.28 rows=14325 width=0) (actual time=6.121..6.121 rows=19736 loops=1)
Index Cond: ((eventtype)::text = 'TEST_EVENT'::text)
Total runtime: 98.482 ms
Primary key is (eventtype, eventtime). I also have the following indexes:
allevents_idx (event time desc, eventtype)
allevents_idx2 (eventtype).
How can I speed up the query?
Results of query play for correlated subquery suggested by #denis below with 14 manually entered values gives:
Function Scan on unnest val (cost=0.00..185.40 rows=100 width=32) (actual time=0.121..8983.134 rows=14 loops=1)
SubPlan 2
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=641.644..641.645 rows=1 loops=14)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=641.640..641.641 rows=1 loops=14)
-> Index Scan using allevents_idx on allevents (cost=0.00..322672.36 rows=175938 width=8) (actual time=641.638..641.638 rows=1 loops=14)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = val.val))
Total runtime: 8983.203 ms
Using the recursive query suggested by #jjanes, the query runs between 4 and 5 seconds with the following plan:
CTE Scan on t (cost=260.32..448.63 rows=101 width=32) (actual time=0.146..4325.598 rows=22 loops=1)
CTE t
-> Recursive Union (cost=2.52..260.32 rows=101 width=32) (actual time=0.075..1.449 rows=22 loops=1)
-> Result (cost=2.52..2.53 rows=1 width=0) (actual time=0.074..0.074 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..2.52 rows=1 width=13) (actual time=0.070..0.071 rows=1 loops=1)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..9315751.37 rows=3696851 width=13) (actual time=0.070..0.070 rows=1 loops=1)
Index Cond: ((eventtype)::text IS NOT NULL)
-> WorkTable Scan on t (cost=0.00..25.58 rows=10 width=32) (actual time=0.059..0.060 rows=1 loops=22)
Filter: (eventtype IS NOT NULL)
SubPlan 3
-> Result (cost=2.53..2.54 rows=1 width=0) (actual time=0.059..0.059 rows=1 loops=21)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..2.53 rows=1 width=13) (actual time=0.057..0.057 rows=1 loops=21)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..3114852.66 rows=1232284 width=13) (actual time=0.055..0.055 rows=1 loops=21)
Index Cond: (((eventtype)::text IS NOT NULL) AND ((eventtype)::text > t.eventtype))
SubPlan 6
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=196.549..196.549 rows=1 loops=22)
InitPlan 5 (returns $6)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=196.546..196.546 rows=1 loops=22)
-> Index Scan using allevents_idx on allevents (cost=0.00..322946.21 rows=176041 width=8) (actual time=196.544..196.544 rows=1 loops=22)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = t.eventtype))
Total runtime: 4325.694 ms
What you need is a "skip scan" or "loose index scan". PostgreSQL's planner does not yet implement those automatically, but you can trick it into using one by using a recursive query.
WITH RECURSIVE t AS (
SELECT min(eventtype) AS eventtype FROM allevents
UNION ALL
SELECT (SELECT min(eventtype) as eventtype FROM allevents WHERE eventtype > t.eventtype)
FROM t where t.eventtype is not null
)
select eventtype, (select max(eventtime) from allevents where eventtype=t.eventtype) from t;
There may be a way to collapse the max(eventtime) into the recursive query rather than doing it outside that query, but if so I have not hit upon it.
This needs an index on (eventtype, eventtime) in order to be efficient. You can have it be DESC on the eventtime, but that is not necessary. This is efficiently only if eventtype has only a few distinct values (21 of them, in your case).
Based on the question you already have the relevant index.
If upgrading to Postgres 9.3 or an index on (eventtype, eventtime desc) doesn't make a difference, this is a case where rewriting the query so it uses a correlated subquery works very well if you can enumerate all of the event types manually:
select val as eventtype,
(select max(eventtime)
from allevents
where allevents.eventtype = val
) as eventtime
from unnest('{type1,type2,…}'::text[]) as val;
Here's the plans I get when running similar queries:
denis=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.1 on x86_64-apple-darwin11.4.2, compiled by Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn), 64-bit
(1 row)
Test data:
denis=# create table test (evttype int, evttime timestamp, primary key (evttype, evttime));
CREATE TABLE
denis=# insert into test (evttype, evttime) select i, now() + (i % 3) * interval '1 min' - j * interval '1 sec' from generate_series(1,10) i, generate_series(1,10000) j;
INSERT 0 100000
denis=# create index on test (evttime, evttype);
CREATE INDEX
denis=# vacuum analyze test;
VACUUM
First query:
denis=# explain analyze select evttype, max(evttime) from test group by evttype; QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2041.00..2041.10 rows=10 width=12) (actual time=54.983..54.987 rows=10 loops=1)
-> Seq Scan on test (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.009..15.954 rows=100000 loops=1)
Total runtime: 55.045 ms
(3 rows)
Second query:
denis=# explain analyze select val as evttype, (select max(evttime) from test where test.evttype = val) as evttime from unnest('{1,2,3,4,5,6,7,8,9,10}'::int[]) val;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on unnest val (cost=0.00..48.39 rows=100 width=4) (actual time=0.086..0.292 rows=10 loops=1)
SubPlan 2
-> Result (cost=0.46..0.47 rows=1 width=0) (actual time=0.024..0.024 rows=1 loops=10)
InitPlan 1 (returns $1)
-> Limit (cost=0.42..0.46 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=10)
-> Index Only Scan Backward using test_pkey on test (cost=0.42..464.42 rows=10000 width=8) (actual time=0.019..0.019 rows=1 loops=10)
Index Cond: ((evttype = val.val) AND (evttime IS NOT NULL))
Heap Fetches: 0
Total runtime: 0.370 ms
(9 rows)
index on (eventtype, eventtime desc) should help. or reindex on primary key index. I would also recommend replace type of eventtype to enum (if number of types is fixed) or int/smallint. This will decrease size of data and indexes so queries will run faster.

How to create postgres date index properly?

I'm using Django ORM and postgresql.
ORM creates a query:
SELECT
(date_part('month', stat_date)) AS "stat_date",
"direct_keywordstat"."banner_id",
SUM("direct_keywordstat"."total") AS "total",
SUM("direct_keywordstat"."clicks") AS "clicks",
SUM("direct_keywordstat"."shows") AS "shows"
FROM "direct_keywordstat"
LEFT OUTER JOIN "direct_banner" ON ("direct_keywordstat"."banner_id" = "direct_banner"."banner_ptr_id")
LEFT OUTER JOIN "platforms_banner" ON ("direct_banner"."banner_ptr_id" = "platforms_banner"."id")
WHERE (
"direct_keywordstat".stat_date BETWEEN E'2009-08-25' AND E'2010-08-25' AND
"direct_keywordstat"."keyword_id" IN (
SELECT U0."id"
FROM "direct_keyword" U0
INNER JOIN "direct_banner" U1 ON (U0."banner_id" = U1."banner_ptr_id")
INNER JOIN "platforms_banner" U2 ON (U1."banner_ptr_id" = U2."id")
INNER JOIN "platforms_campaign" U3 ON (U2."campaign_id" = U3."id")
INNER JOIN "direct_campaign" U4 ON (U3."id" = U4."campaign_ptr_id")
WHERE (
U0."deleted" = E'False' AND
U0."low_ctr" = E'False' AND
U4."status_active" = E'True' AND
U0."banner_id" IN (
SELECT U0."banner_ptr_id"
FROM "direct_banner" U0
INNER JOIN "platforms_banner" U1
ON (U0."banner_ptr_id" = U1."id")
WHERE (
U0."status_show" = E'True' AND
U1."campaign_id" = E'174' )
)
)
)
)
GROUP BY
"direct_keywordstat"."banner_id",
(date_part('month', stat_date)),
"platforms_banner"."title", date_trunc('month', stat_date)
ORDER BY "platforms_banner"."title" ASC, "stat_date" ASC
Problem is, direct_keywordstat contains 3mln+ records, so the query executes in ~15 seconds.
I've tried creating indexes like
CREATE INDEX direct_keywordstat_stat_date on direct_keywordstat using btree(stat_date);
But EXPLAIN ANALYZE show that index is not used.
Table schema:
\d direct_keywordstat
Table "public.direct_keywordstat"
Column | Type | Modifiers
-------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('direct_keywordstat_id_seq'::regclass)
keyword_id | integer | not null
banner_id | integer | not null
campaign_id | integer | not null
stat_date | date | not null
region_id | integer | not null
place_type | character varying(30) |
place_name | character varying(100) |
clicks | integer | not null default 0
shows | integer | not null default 0
total | numeric(19,6) | not null
How can i create useful index?
Or, maybe, there's a chance to optimize this query other way?
Thing is, if WHERE looks like
"direct_keywordstat".clicks BETWEEN 10 AND 3000000
query executes in 0.8 seconds.
Do you have indexes on these columns:
direct_banner.banner_ptr_id
direct_keywordstat.banner_id
direct_keywordstat.stat_date
Both columns in direct_keywordstat could be combined in a single index, just check
This is also a problem:
Sort Method: external merge Disk:
20600kB
Check your settings for work_mem, you need at least 20MB for this query.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=727967.61..847401.71 rows=2514402 width=67) (actual time=22010.522..23408.262 rows=5 loops=1)
-> Sort (cost=727967.61..734253.62 rows=2514402 width=67) (actual time=21742.365..23134.748 rows=198978 loops=1)
Sort Key: platforms_banner.title, (date_part('month'::text, (direct_keywordstat.stat_date)::timestamp without time zone)), direct_keywordstat.banner_id, (date_trunc('month'::text, (direct_keywordstat.stat_date)::timestamp with time zone))
Sort Method: external merge Disk: 20600kB
-> Hash Join (cost=1034.02..164165.25 rows=2514402 width=67) (actual time=5159.538..14942.441 rows=198978 loops=1)
Hash Cond: (direct_keywordstat.keyword_id = u0.id)
-> Hash Left Join (cost=365.78..117471.99 rows=2514402 width=71) (actual time=26.672..13101.294 rows=2523151 loops=1)
Hash Cond: (direct_keywordstat.banner_id = direct_banner.banner_ptr_id)
-> Seq Scan on direct_keywordstat (cost=0.00..76247.17 rows=2514402 width=25) (actual time=8.892..9386.010 rows=2523151 loops=1)
Filter: ((stat_date >= '2009-08-25'::date) AND (stat_date <= '2010-08-25'::date))
-> Hash (cost=324.86..324.86 rows=3274 width=50) (actual time=17.754..17.754 rows=2851 loops=1)
-> Hash Left Join (cost=209.15..324.86 rows=3274 width=50) (actual time=10.845..15.385 rows=2851 loops=1)
Hash Cond: (direct_banner.banner_ptr_id = platforms_banner.id)
-> Seq Scan on direct_banner (cost=0.00..66.74 rows=3274 width=4) (actual time=0.004..1.196 rows=2851 loops=1)
-> Hash (cost=173.51..173.51 rows=2851 width=50) (actual time=10.683..10.683 rows=2851 loops=1)
-> Seq Scan on platforms_banner (cost=0.00..173.51 rows=2851 width=50) (actual time=0.004..3.576 rows=2851 loops=1)
-> Hash (cost=641.44..641.44 rows=2144 width=4) (actual time=30.420..30.420 rows=106 loops=1)
-> HashAggregate (cost=620.00..641.44 rows=2144 width=4) (actual time=30.162..30.288 rows=106 loops=1)
-> Hash Join (cost=407.17..614.64 rows=2144 width=4) (actual time=16.152..30.031 rows=106 loops=1)
Hash Cond: (u0.banner_id = u1.banner_ptr_id)
-> Nested Loop (cost=76.80..238.50 rows=6488 width=16) (actual time=8.670..22.343 rows=106 loops=1)
-> HashAggregate (cost=76.80..76.87 rows=7 width=8) (actual time=0.045..0.047 rows=1 loops=1)
-> Nested Loop (cost=0.00..76.79 rows=7 width=8) (actual time=0.033..0.036 rows=1 loops=1)
-> Index Scan using platforms_banner_campaign_id on platforms_banner u1 (cost=0.00..22.82 rows=7 width=4) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (campaign_id = 174)
-> Index Scan using direct_banner_pkey on direct_banner u0 (cost=0.00..7.70 rows=1 width=4) (actual time=0.009..0.011 rows=1 loops=1)
Index Cond: (u0.banner_ptr_id = u1.id)
Filter: u0.status_show
-> Index Scan using direct_keyword_banner_id on direct_keyword u0 (cost=0.00..23.03 rows=5 width=8) (actual time=8.620..22.127 rows=106 loops=1)
Index Cond: (u0.banner_id = u0.banner_ptr_id)
Filter: ((NOT u0.deleted) AND (NOT u0.low_ctr))
-> Hash (cost=316.84..316.84 rows=1082 width=8) (actual time=7.458..7.458 rows=403 loops=1)
-> Hash Join (cost=227.00..316.84 rows=1082 width=8) (actual time=3.584..7.149 rows=403 loops=1)
Hash Cond: (u1.banner_ptr_id = u2.id)
-> Seq Scan on direct_banner u1 (cost=0.00..66.74 rows=3274 width=4) (actual time=0.002..1.570 rows=2851 loops=1)
-> Hash (cost=213.48..213.48 rows=1082 width=4) (actual time=3.521..3.521 rows=403 loops=1)
-> Hash Join (cost=23.88..213.48 rows=1082 width=4) (actual time=0.715..3.268 rows=403 loops=1)
Hash Cond: (u2.campaign_id = u3.id)
-> Seq Scan on platforms_banner u2 (cost=0.00..173.51 rows=2851 width=8) (actual time=0.001..1.272 rows=2851 loops=1)
-> Hash (cost=22.95..22.95 rows=74 width=8) (actual time=0.345..0.345 rows=37 loops=1)
-> Hash Join (cost=11.84..22.95 rows=74 width=8) (actual time=0.133..0.320 rows=37 loops=1)
Hash Cond: (u3.id = u4.campaign_ptr_id)
-> Seq Scan on platforms_campaign u3 (cost=0.00..8.91 rows=391 width=4) (actual time=0.006..0.098 rows=196 loops=1)
-> Hash (cost=10.91..10.91 rows=74 width=4) (actual time=0.117..0.117 rows=37 loops=1)
-> Seq Scan on direct_campaign u4 (cost=0.00..10.91 rows=74 width=4) (actual time=0.004..0.097 rows=37 loops=1)
Filter: status_active
Total runtime: 23436.715 ms
(47 rows)
Here it is