Postgres function slower than same ad hoc query - postgresql

I have had several cases where a Postgres function that returns a table result from a query is much slower than running the actual query. Why is that?
This is one example, but I've found that function is slower than just the query in many cases.
create function trending_names(date_start timestamp with time zone, date_end timestamp with time zone, gender_filter character, country_filter text)
returns TABLE(name_id integer, gender character, country text, score bigint, rank bigint)
language sql
as
$$
select u.name_id,
n.gender,
u.country,
count(u.rank) as score,
row_number() over (order by count(u.rank) desc) as rank
from babynames.user_scores u
inner join babynames.names n on u.name_id = n.id
where u.created_at between date_start and date_end
and u.rank > 0
and n.gender = gender_filter
and u.country = country_filter
group by u.name_id, n.gender, u.country
$$;
This is the query plan for a select from the function:
Function Scan on trending_names (cost=0.25..10.25 rows=1000 width=84) (actual time=1118.673..1118.861 rows=2238 loops=1)
Buffers: shared hit=216509 read=29837
Planning Time: 0.078 ms
Execution Time: 1119.083 ms
Query plan from just running the query. This takes less than half the time.
WindowAgg (cost=44834.98..45593.32 rows=43334 width=25) (actual time=383.387..385.223 rows=2238 loops=1)
Planning Time: 2.512 ms
Execution Time: 387.403 ms
Buffers: shared hit=100446 read=50220
-> Sort (cost=44834.98..44943.31 rows=43334 width=17) (actual time=383.375..383.546 rows=2238 loops=1)
Sort Method: quicksort Memory: 271kB
Sort Key: (count(u.rank)) DESC
Buffers: shared hit=100446 read=50220
-> HashAggregate (cost=41064.22..41497.56 rows=43334 width=17) (actual time=381.088..381.906 rows=2238 loops=1)
" Group Key: u.name_id, u.country, n.gender"
Buffers: shared hit=100446 read=50220
-> Hash Join (cost=5352.15..40630.88 rows=43334 width=13) (actual time=60.710..352.646 rows=36271 loops=1)
Hash Cond: (u.name_id = n.id)
Buffers: shared hit=100446 read=50220
-> Index Scan using user_scores_rank_ix on user_scores u (cost=0.43..35077.55 rows=76796 width=11) (actual time=24.193..287.393 rows=69770 loops=1)
-> Hash (cost=5005.89..5005.89 rows=27667 width=6) (actual time=36.420..36.420 rows=27472 loops=1)
Rows Removed by Filter: 106521
Index Cond: (rank > 0)
Filter: ((created_at >= '2021-01-01 00:00:00+00'::timestamp with time zone) AND (country = 'sv'::text) AND (created_at <= now()))
Buffers: shared hit=99417 read=46856
Buffers: shared hit=1029 read=3364
Buckets: 32768 Batches: 1 Memory Usage: 1330kB
-> Seq Scan on names n (cost=0.00..5005.89 rows=27667 width=6) (actual time=0.022..24.447 rows=27472 loops=1)
Rows Removed by Filter: 21559
Filter: (gender = 'f'::bpchar)
Buffers: shared hit=1029 read=3364
I'm also confused on why it does a Seq scan on names n in the last step since names.id is the primary key and gender is indexed.

Related

Postgres index for aggregate query

SELECT count(e_id) AS count,
e_id
FROM test
WHERE created_at BETWEEN '2021-12-01 00:00:00' AND '2021-12-08 00:00:00'
AND std IN ( '1' )
AND section IN ( 'Sample' )
GROUP BY e_id
ORDER BY count DESC
LIMIT 4
The table has around 1 M records. The query execution is less than 40 ms but computation takes a hit at the group by and query cost high.
Limit (cost=26133.76..26133.77 rows=4 width=45) (actual time=52.300..52.303 rows=3 loops=1)
-> Sort (cost=26133.76..26134.77 rows=403 width=45) (actual time=52.299..52.301 rows=3 loops=1)
Sort Key: (count(e_id)) DESC
Sort Method: quicksort Memory: 25kB
-> GroupAggregate (cost=26120.66..26127.72 rows=403 width=45) (actual time=52.287..52.289 rows=3 loops=1)
Group Key: e_id
-> Sort (cost=26120.66..26121.67 rows=404 width=37) (actual time=52.281..52.283 rows=5 loops=1)
Sort Key: e_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on test (cost=239.19..26103.17 rows=404 width=37) (actual time=49.339..52.261 rows=5 loops=1)
Recheck Cond: ((section)::text = 'test'::text)
" Filter: ((created_at >= '2021-12-01 00:00:00'::timestamp without time zone) AND (created_at <= '2021-12-08 00:00:00'::timestamp without time zone) AND ((std)::text = ANY ('{1,2}'::text[])))"
Rows Removed by Filter: 38329
Heap Blocks: exact=33997
-> Bitmap Index Scan on index_test_on_section (cost=0.00..239.09 rows=7270 width=0) (actual time=6.815..6.815 rows=38334 loops=1)
Index Cond: ((section)::text = 'test'::text)
How can I optimize the group by and count, so that CPU does not shoot up?
The best index for this query is
CREATE INDEX ON test (section, created_at, std) INCLUDE (e_id);
Then VACUUM the table and try again.
Unless you have shown us the wrong plan, the slow step is not the group by, but rather the Bitmap Heap Scan
Your index on "section" returns 38334, of which all but 5 are filtered out. We can't tell if they are filtered out mostly by the "std" criterion or the "created_at" one. You need a more specific multicolumn index. The one i think is most likely to be effective is on (section, std, created_at).

Postgres Aggregate Query too slow

I have a query which is generating a trend response and getting me counts of devices for various dates. The query is going over almost 500k rows. The table has almost 17.5 million records. I have partitioned the table based on id so that it can only look for a specific partition but still it is quite slow. Each partition has almost 200k records. Any idea how to improve the performance of this.
Query
select start_date,end_date,average, fail_count, warning_count,pass_count, average from
(
select generate_series(timestamp '2021-01-18 00:00:00', timestamp '2021-02-12 00:00:00', interval '1 day')::date) t(start_date)
LEFT JOIN (
SELECT start_date, end_date, avg(score) as average
, count(*) FILTER (WHERE status = 'Fail') AS fail_count
, count(*) FILTER (WHERE status = 'Warning') AS warning_count
, count(*) FILTER (WHERE status = 'Pass') AS pass_count
FROM performance.tenant_based scd join performance.hierarchy dh on dh.id = scd.id and dh.tag = scd.tag
where dh.parent_id in (0,1,2,3,4,5,6,7,8,9,10) and dh.child_id in (0,1,2,3,4,5,6,7,8,9,10) and dh.desc in ('test')
and dh.id ='ita68f0c03880e4c6694859dfa74f1cdf6' AND start_date >= '2021-01-18 00:00:00' -- same date range as above
AND start_date <= '2021-02-12 00:00:00'
GROUP BY 1,2
) s USING (start_date)
ORDER BY 1;
The Query plan is below
Sort (cost=241350.02..241850.02 rows=200000 width=104) (actual time=3453.888..3453.890 rows=26 loops=1)
Sort Key: (((((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date))::timestamp without time zone)
Sort Method: quicksort Memory: 28kB
-> Merge Left Join (cost=201014.95..212802.88 rows=200000 width=104) (actual time=2901.012..3453.867 rows=26 loops=1)
Merge Cond: ((((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date) = scd.start_date)
-> Sort (cost=79.85..82.35 rows=1000 width=4) (actual time=0.015..0.024 rows=26 loops=1)
Sort Key: (((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date)
Sort Method: quicksort Memory: 26kB
-> Result (cost=0.00..20.02 rows=1000 width=4) (actual time=0.003..0.009 rows=26 loops=1)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=8) (actual time=0.002..0.006 rows=26 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
-> Materialize (cost=200935.11..209318.03 rows=40000 width=72) (actual time=2900.992..3453.789 rows=25 loops=1)
-> Finalize GroupAggregate (cost=200935.11..208818.03 rows=40000 width=72) (actual time=2900.990..3453.771 rows=25 loops=1)
Group Key: scd.start_date, scd.end_date
-> Gather Merge (cost=200935.11..207569.38 rows=49910 width=72) (actual time=2879.365..3453.827 rows=75 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=199935.08..200808.51 rows=24955 width=72) (actual time=2686.465..3228.313 rows=25 loops=3)
Group Key: scd.start_date, scd.end_date
-> Sort (cost=199935.08..199997.47 rows=24955 width=25) (actual time=2664.518..2860.477 rows=1666667 loops=3)
Sort Key: scd.start_date, scd.end_date
Sort Method: external merge Disk: 59840kB
-> Hash Join (cost=44891.11..198112.49 rows=24955 width=25) (actual time=111.653..1817.228 rows=1666667 loops=3)
Hash Cond: (scd.tag = (dh.tag)::text)
-> Append (cost=0.00..145159.33 rows=2083333 width=68) (actual time=0.006..591.818 rows=1666667 loops=3)
-> Parallel Seq Scan on ita68f0c03880e4c6694859dfa74f1cdf6 scd (cost=0.00..145159.33 rows=2083333 width=68) (actual time=0.006..455.525 rows=1666667 loops=3)
Filter: ((start_date >= '2021-01-18 00:00:00'::timestamp without time zone) AND (start_date <= '2021-02-12 00:00:00'::timestamp without time zone) AND ((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text))
-> Hash (cost=44638.71..44638.71 rows=20192 width=45) (actual time=111.502..111.502 rows=200000 loops=3)
Buckets: 65536 (originally 32768) Batches: 8 (originally 1) Memory Usage: 3585kB
-> Bitmap Heap Scan on hierarchy dh (cost=1339.01..44638.71 rows=20192 width=45) (actual time=26.542..62.078 rows=200000 loops=3)
Recheck Cond: (((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text) AND (parent_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND (child_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND ((desc)::text = 'test'::text))
Heap Blocks: exact=5717
-> Bitmap Index Scan on hierarchy_id_region_idx (cost=0.00..1333.96 rows=20192 width=0) (actual time=25.792..25.792 rows=200000 loops=3)
Index Cond: (((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text) AND (parent_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND (child_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND ((desc)::text = 'test'::text))
Planning time: 0.602 ms
Execution time: 3463.440 ms
After going through several trail and errors we landed on the materialized view for this query. The number of rows the query was scanning was almost 500k+ and no indexes and partitioning was helping. We tweaked the above query to create a Materialized view and then doing a select on top of it. We are now at 96ms. The generalized materialized view for the query in this question is shown below.
CREATE MATERIALIZED VIEW performance.daily_trends
TABLESPACE pg_default
AS SELECT s.id,
d.parent_id,
d.child_id,
d.desc,
s.start_date,
s.end_date,
count(*) FILTER (WHERE s.overall_status::text = 'Fail'::text) AS fail_count,
count(*) FILTER (WHERE s.overall_status::text = 'Warning'::text) AS warning_count,
count(*) FILTER (WHERE s.overall_status::text = 'Pass'::text) AS pass_count,
avg(s.score) AS average_score
FROM performance.tenant_based s
JOIN performance.hierarchy d ON s.id::text = d.id::text AND s.tag = d.tag::text
WHERE s.start_date >= (CURRENT_DATE - 45) AND s.start_date <= CURRENT_DATE
GROUP BY s.id, d.parent_id, d.child_id, d.desc, s.start_date, s.end_date
WITH DATA;
Thanks for all who tried helping on this.

Postgres Optimizer: Why it lies about costs? [EDIT] How to pick random_page_cost?

I've got following issue with Postgres:
Got two tables A and B:
A got 64 mln records
B got 16 mln records
A got b_id field which is indexed --> ix_A_b_id
B got datetime_field which is indexed --> ix_B_datetime
Got following query:
SELECT
A.id,
B.some_field
FROM
A
JOIN
B
ON A.b_id = B.id
WHERE
B.datetime_field BETWEEN 'from' AND 'to'
This query is fine when difference between from and to is small, in that case postgres use both indexes and i get results quite fast
When difference between dates is bigger query is slowing much, because postgres decides to use ix_B_datetime only and then Full Scan on table with 64 M records... which is simple stupid
I found point when optimizer decides that using Full Scan is faster.
For dates between
2019-03-10 17:05:00 and 2019-03-15 01:00:00
it got similar cost like for
2019-03-10 17:00:00 and 2019-03-15 01:00:00.
But fetching time for first query is something about 50 ms and for second almost 2 minutes.
Plans are below
Nested Loop (cost=1.00..3484455.17 rows=113057 width=8)
-> Index Scan using ix_B_datetime on B (cost=0.44..80197.62 rows=28561 width=12)
Index Cond: ((datetime_field >= '2019-03-10 17:05:00'::timestamp without time zone) AND (datetime_field < '2019-03-15 01:00:00'::timestamp without time zone))
-> Index Scan using ix_A_b_id on A (cost=0.56..112.18 rows=701 width=12)
Index Cond: (b_id = B.id)
Hash Join (cost=80615.72..3450771.89 rows=113148 width=8)
Hash Cond: (A.b_id = B.id)
-> Seq Scan on spot (cost=0.00..3119079.50 rows=66652050 width=12)
-> Hash (cost=80258.42..80258.42 rows=28584 width=12)
-> Index Scan using ix_B_datetime on B (cost=0.44..80258.42 rows=28584 width=12)
Index Cond: ((datetime_field >= '2019-03-10 17:00:00'::timestamp without time zone) AND (datetime_field < '2019-03-15 01:00:00'::timestamp without time zone))
So my question is why my Postgres lies about costs? Why it calculates something more expensive as it is actually? How to fix that?
Temporary I had to rewrite query to always use index on table A but I do not like following solution, because it's hacky, not clear and slower for small chunks of data but much faster for bigger chunks
with cc as (
select id, some_field from B WHERE B.datetime_field >= '2019-03-08'
AND B.datetime_field < '2019-03-15'
)
SELECT X.id, Y.some_field
FROM (SELECT b_id, id from A where b_id in (SELECT id from cc)) X
JOIN (SELECT id, some_field FROM cc) Y ON X.b_id = Y.id
EDIT:
So as #a_horse_with_no_name suggested I've played with RANDOM_PAGE_COST
I've modified query to count number of entries because fetching all was unnecessary so query looks following
SELECT count(*) FROM (
SELECT
A.id,
B.some_field
FROM
A
JOIN
B
ON A.b_id = B.id
WHERE
B.datetime_field BETWEEN '2019-03-01 00:00:00' AND '2019-03-15 01:00:00'
) A
And I've tested different levels of cost
RANDOM_PAGE_COST=0.25
Aggregate (cost=3491773.34..3491773.35 rows=1 width=8) (actual time=4166.998..4166.999 rows=1 loops=1)
Buffers: shared hit=1939402
-> Nested Loop (cost=1.00..3490398.51 rows=549932 width=0) (actual time=0.041..3620.975 rows=2462836 loops=1)
Buffers: shared hit=1939402
-> Index Scan using ix_B_datetime_field on B (cost=0.44..24902.79 rows=138927 width=8) (actual time=0.013..364.018 rows=313399 loops=1)
Index Cond: ((datetime_field >= '2019-03-01 00:00:00'::timestamp without time zone) AND (datetime_field < '2019-03-15 01:00:00'::timestamp without time zone))
Buffers: shared hit=311461
-> Index Only Scan using A_b_id_index on A (cost=0.56..17.93 rows=701 width=8) (actual time=0.004..0.007 rows=8 loops=313399)
Index Cond: (b_id = B.id)
Heap Fetches: 2462836
Buffers: shared hit=1627941
Planning time: 0.316 ms
Execution time: 4167.040 ms
RANDOM_PAGE_COST=1
Aggregate (cost=3918191.39..3918191.40 rows=1 width=8) (actual time=281236.100..281236.101 rows=1 loops=1)
" Buffers: shared hit=7531789 read=2567818, temp read=693 written=693"
-> Merge Join (cost=102182.07..3916816.56 rows=549932 width=0) (actual time=243755.551..280666.992 rows=2462836 loops=1)
Merge Cond: (A.b_id = B.id)
" Buffers: shared hit=7531789 read=2567818, temp read=693 written=693"
-> Index Only Scan using A_b_id_index on A (cost=0.56..3685479.55 rows=66652050 width=8) (actual time=0.010..263635.124 rows=64700055 loops=1)
Heap Fetches: 64700055
Buffers: shared hit=7220328 read=2567818
-> Materialize (cost=101543.05..102237.68 rows=138927 width=8) (actual time=523.618..1287.145 rows=2503965 loops=1)
" Buffers: shared hit=311461, temp read=693 written=693"
-> Sort (cost=101543.05..101890.36 rows=138927 width=8) (actual time=523.616..674.736 rows=313399 loops=1)
Sort Key: B.id
Sort Method: external merge Disk: 5504kB
" Buffers: shared hit=311461, temp read=693 written=693"
-> Index Scan using ix_B_datetime_field on B (cost=0.44..88589.92 rows=138927 width=8) (actual time=0.013..322.016 rows=313399 loops=1)
Index Cond: ((datetime_field >= '2019-03-01 00:00:00'::timestamp without time zone) AND (datetime_field < '2019-03-15 01:00:00'::timestamp without time zone))
Buffers: shared hit=311461
Planning time: 0.314 ms
Execution time: 281237.202 ms
RANDOM_PAGE_COST=2
Aggregate (cost=4072947.53..4072947.54 rows=1 width=8) (actual time=166896.775..166896.776 rows=1 loops=1)
" Buffers: shared hit=696849 read=2067171, temp read=194524 written=194516"
-> Hash Join (cost=175785.69..4071572.70 rows=549932 width=0) (actual time=29321.835..166332.812 rows=2462836 loops=1)
Hash Cond: (A.B_id = B.id)
" Buffers: shared hit=696849 read=2067171, temp read=194524 written=194516"
-> Seq Scan on A (cost=0.00..3119079.50 rows=66652050 width=8) (actual time=0.008..108959.789 rows=64700055 loops=1)
Buffers: shared hit=437580 read=2014979
-> Hash (cost=173506.11..173506.11 rows=138927 width=8) (actual time=29321.416..29321.416 rows=313399 loops=1)
Buckets: 131072 (originally 131072) Batches: 8 (originally 2) Memory Usage: 4084kB
" Buffers: shared hit=259269 read=52192, temp written=803"
-> Index Scan using ix_B_datetime_field on B (cost=0.44..173506.11 rows=138927 width=8) (actual time=1.676..29158.413 rows=313399 loops=1)
Index Cond: ((datetime_field >= '2019-03-01 00:00:00'::timestamp without time zone) AND (datetime_field < '2019-03-15 01:00:00'::timestamp without time zone))
Buffers: shared hit=259269 read=52192
Planning time: 7.367 ms
Execution time: 166896.824 ms
Still it's unclear for me, cost 0.25 is best for me but everywhere I can read that for ssd disk it should be 1-1.5. (I'm using AWS instance with ssd)
What is weird at cost 1 plan is worse than at 2 and 0.25
So what value to pick? Is there any possibility to calculate it?
Costs 0.25 > 2 > 1 efficiency in that case, what about other cases? How can I be sure that 0.25 which is good for my query won't break other queries. Do I need to write performance tests for every query I got?

Postgres' planning takes unproportional time for execution

postgres 9.6 running on amazon RDS.
I have 2 tables:
aggregate events - big table with 6 keys (ids)
campaign metadata - small table with campaign definition.
I join the 2 in order to filter on metadata like campaign-name.
The query is in order to get a report of displayed breakdown by campaign channel and date ( date is daily ).
No FK and not null. The report table has multiple lines per day per campaigns ( because the aggregation is based on 6 attribute key ).
When i join , query plan grow to 10s ( vs 300ms)
explain analyze select c.campaign_channel as channel,date as day , sum( displayed ) as displayed
from report_campaigns c
left join events_daily r on r.campaign_id = c.c_id
where provider_id = 7726 and c.p_id = 7726 and c.campaign_name <> 'test'
and date >= '20170513 12:00' and date <= '20170515 12:00'
group by c.campaign_channel,date;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=71461.93..71466.51 rows=229 width=22) (actual time=104.189..114.788 rows=6 loops=1)
Group Key: c.campaign_channel, r.date
-> Sort (cost=71461.93..71462.51 rows=229 width=18) (actual time=100.263..106.402 rows=31205 loops=1)
Sort Key: c.campaign_channel, r.date
Sort Method: quicksort Memory: 3206kB
-> Hash Join (cost=1092.52..71452.96 rows=229 width=18) (actual time=22.149..86.955 rows=31205 loops=1)
Hash Cond: (r.campaign_id = c.c_id)
-> Append (cost=0.00..70245.84 rows=29948 width=20) (actual time=21.318..71.315 rows=31205 loops=1)
-> Seq Scan on events_daily r (cost=0.00..0.00 rows=1 width=20) (actual time=0.005..0.005 rows=0 loops=1)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone) AND (provider_id =
-> Bitmap Heap Scan on events_daily_20170513 r_1 (cost=685.36..23913.63 rows=1 width=20) (actual time=17.230..17.230 rows=0 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Rows Removed by Filter: 13769
Heap Blocks: exact=10276
-> Bitmap Index Scan on events_daily_20170513_full_idx (cost=0.00..685.36 rows=14525 width=0) (actual time=2.356..2.356 rows=13769 loops=1)
Index Cond: (provider_id = 7726)
-> Bitmap Heap Scan on events_daily_20170514 r_2 (cost=689.08..22203.52 rows=14537 width=20) (actual time=4.082..21.389 rows=15281 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Heap Blocks: exact=10490
-> Bitmap Index Scan on events_daily_20170514_full_idx (cost=0.00..685.45 rows=14537 width=0) (actual time=2.428..2.428 rows=15281 loops=1)
Index Cond: (provider_id = 7726)
-> Bitmap Heap Scan on events_daily_20170515 r_3 (cost=731.84..24128.69 rows=15409 width=20) (actual time=4.297..22.662 rows=15924 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Heap Blocks: exact=11318
-> Bitmap Index Scan on events_daily_20170515_full_idx (cost=0.00..727.99 rows=15409 width=0) (actual time=2.506..2.506 rows=15924 loops=1)
Index Cond: (provider_id = 7726)
-> Hash (cost=1085.35..1085.35 rows=574 width=14) (actual time=0.815..0.815 rows=582 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 37kB
-> Bitmap Heap Scan on report_campaigns c (cost=12.76..1085.35 rows=574 width=14) (actual time=0.090..0.627 rows=582 loops=1)
Recheck Cond: (p_id = 7726)
Filter: ((campaign_name)::text <> 'test'::text)
Heap Blocks: exact=240
-> Bitmap Index Scan on report_campaigns_provider_id (cost=0.00..12.62 rows=577 width=0) (actual time=0.062..0.062 rows=582 loops=1)
Index Cond: (p_id = 7726)
Planning time: 9651.605 ms
Execution time: 115.092 ms
result:
channel | day | displayed
----------+---------------------+-----------
Pin | 2017-05-14 00:00:00 | 43434
Pin | 2017-05-15 00:00:00 | 3325325235
I seems to me this is because of summation forcing pre-computation before left joining.
Solution could be to impose filtering WHERE clauses in two nested sub-SELECT prior to left-joining and summation.
Hope this works:
SELECT channel, day, sum( displayed )
FROM
(SELECT campaign_channel AS channel, date AS day, displayed, p_id AS c_id
FROM report_campaigns WHERE p_id = 7726 AND campaign_name <> 'test' AND date >= '20170513 12:00' AND date <= '20170515 12:00') AS c,
(SELECT * FROM events_daily WHERE campaign_id = 7726) AS r
LEFT JOIN r.campaign_id = c.c_id
GROUP BY channel, day;

PostgreSQL 9.4 Index speed very slow on Select

The box running PostgreSQL 9.4 is a 32 core system at 3.5ghz per core with 128GB or ram with mirrored Samsung pro 850 SSD drives on FreeBSD with ZFS. This is no reason for this poor of performance from PostgreSQL!
psql (9.4.5)
forex=# \d pair_data;
Table "public.pair_data"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------------
id | integer | not null default nextval('pair_data_id_seq'::regclass)
pair_name | character varying(6) |
pair_price | numeric(9,6) |
ts | timestamp without time zone |
Indexes:
"pair_data_pkey" PRIMARY KEY, btree (id)
"date_idx" gin (ts)
"pair_data_gin_idx1" gin (pair_name)
"pair_data_gin_idx2" gin (ts)
With this select:
forex=# explain (analyze, buffers) select pair_name as name, pair_price as price, ts as date from pair_data where pair_name = 'EURUSD' order by ts desc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=201442.55..201442.57 rows=10 width=22) (actual time=10870.395..10870.430 rows=10 loops=1)
Buffers: shared hit=40 read=139150 dirtied=119
-> Sort (cost=201442.55..203787.44 rows=937957 width=22) (actual time=10870.390..10870.401 rows=10 loops=1)
Sort Key: ts
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Heap Scan on pair_data (cost=9069.17..181173.63 rows=937957 width=22) (actual time=614.976..8903.913 rows=951858 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13624055
Heap Blocks: exact=33464 lossy=105456
Buffers: shared hit=40 read=139150 dirtied=119
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8834.68 rows=937957 width=0) (actual time=593.701..593.701 rows=951858 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=40 read=230
Planning time: 0.387 ms
Execution time: 10871.419 ms
(16 rows)
Or this select:
forex=# explain (analyze, buffers) with intervals as ( select start, start + interval '4hr' as end from generate_series('2015-12-01 15:00', '2016-01-19 16:00', interval '4hr') as start ) select distinct intervals.start as date, min(pair_price) over w as low, max(pair_price) over w as high, first_value(pair_price) over w as open, last_value(pair_price) over w as close from intervals join pair_data mb on mb.pair_name = 'EURUSD' and mb.ts >= intervals.start and mb.ts < intervals.end window w as (partition by intervals.start order by mb.ts asc rows between unbounded preceding and unbounded following) order by intervals.start;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=64634864.43..66198331.09 rows=815054 width=23) (actual time=1379732.924..1384602.952 rows=82 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
CTE intervals
-> Function Scan on generate_series start (cost=0.00..12.50 rows=1000 width=8) (actual time=0.135..1.801 rows=295 loops=1)
-> Sort (cost=64634851.92..64895429.70 rows=104231111 width=23) (actual time=1379732.918..1381724.179 rows=951970 loops=1)
Sort Key: intervals.start, (min(mb.pair_price) OVER (?)), (max(mb.pair_price) OVER (?)), (first_value(mb.pair_price) OVER (?)), (last_value(mb.pair_price) OVER (?))
Sort Method: external sort Disk: 60808kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3332596 written=17050
-> WindowAgg (cost=41474743.35..44341098.90 rows=104231111 width=23) (actual time=1341744.405..1365946.672 rows=951970 loops=1)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=3324995 written=9449
-> Sort (cost=41474743.35..41735321.12 rows=104231111 width=23) (actual time=1341743.502..1343723.884 rows=951970 loops=1)
Sort Key: intervals.start, mb.ts
Sort Method: external sort Disk: 32496kB
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1154778 written=7975
-> Nested Loop (cost=9098.12..21180990.32 rows=104231111 width=23) (actual time=271672.696..1337526.628 rows=951970 loops=1)
Join Filter: ((mb.ts >= intervals.start) AND (mb.ts < intervals."end"))
Rows Removed by Join Filter: 279879180
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> CTE Scan on intervals (cost=0.00..20.00 rows=1000 width=16) (actual time=0.142..4.075 rows=295 loops=1)
-> Materialize (cost=9098.12..190496.52 rows=938080 width=15) (actual time=2.125..1962.153 rows=951970 loops=295)
Buffers: shared hit=8 read=139210 dirtied=22, temp read=1150716 written=3913
-> Bitmap Heap Scan on pair_data mb (cost=9098.12..181225.12 rows=938080 width=15) (actual time=622.818..11474.987 rows=951970 loops=1)
Recheck Cond: ((pair_name)::text = 'EURUSD'::text)
Rows Removed by Index Recheck: 13623989
Heap Blocks: exact=33485 lossy=105456
Buffers: shared hit=8 read=139210 dirtied=22
-> Bitmap Index Scan on pair_data_gin_idx1 (cost=0.00..8863.60 rows=938080 width=0) (actual time=601.158..601.158 rows=951970 loops=1)
Index Cond: ((pair_name)::text = 'EURUSD'::text)
Buffers: shared hit=8 read=269
Planning time: 0.454 ms
Execution time: 1384653.385 ms
(31 rows)
Table of pair_data only has:
forex=# select count(*) from pair_data;
count
----------
21833886
(1 row)
Why is this doing heap scans when their are indexes? I do not understand what is going on with the query plan? Dose anyone have an idea on where the problem might be?