Postgres Aggregate Query too slow - postgresql

I have a query which is generating a trend response and getting me counts of devices for various dates. The query is going over almost 500k rows. The table has almost 17.5 million records. I have partitioned the table based on id so that it can only look for a specific partition but still it is quite slow. Each partition has almost 200k records. Any idea how to improve the performance of this.
Query
select start_date,end_date,average, fail_count, warning_count,pass_count, average from
(
select generate_series(timestamp '2021-01-18 00:00:00', timestamp '2021-02-12 00:00:00', interval '1 day')::date) t(start_date)
LEFT JOIN (
SELECT start_date, end_date, avg(score) as average
, count(*) FILTER (WHERE status = 'Fail') AS fail_count
, count(*) FILTER (WHERE status = 'Warning') AS warning_count
, count(*) FILTER (WHERE status = 'Pass') AS pass_count
FROM performance.tenant_based scd join performance.hierarchy dh on dh.id = scd.id and dh.tag = scd.tag
where dh.parent_id in (0,1,2,3,4,5,6,7,8,9,10) and dh.child_id in (0,1,2,3,4,5,6,7,8,9,10) and dh.desc in ('test')
and dh.id ='ita68f0c03880e4c6694859dfa74f1cdf6' AND start_date >= '2021-01-18 00:00:00' -- same date range as above
AND start_date <= '2021-02-12 00:00:00'
GROUP BY 1,2
) s USING (start_date)
ORDER BY 1;
The Query plan is below
Sort (cost=241350.02..241850.02 rows=200000 width=104) (actual time=3453.888..3453.890 rows=26 loops=1)
Sort Key: (((((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date))::timestamp without time zone)
Sort Method: quicksort Memory: 28kB
-> Merge Left Join (cost=201014.95..212802.88 rows=200000 width=104) (actual time=2901.012..3453.867 rows=26 loops=1)
Merge Cond: ((((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date) = scd.start_date)
-> Sort (cost=79.85..82.35 rows=1000 width=4) (actual time=0.015..0.024 rows=26 loops=1)
Sort Key: (((generate_series('2021-01-18 00:00:00'::timestamp without time zone, '2021-02-12 00:00:00'::timestamp without time zone, '1 day'::interval)))::date)
Sort Method: quicksort Memory: 26kB
-> Result (cost=0.00..20.02 rows=1000 width=4) (actual time=0.003..0.009 rows=26 loops=1)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=8) (actual time=0.002..0.006 rows=26 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
-> Materialize (cost=200935.11..209318.03 rows=40000 width=72) (actual time=2900.992..3453.789 rows=25 loops=1)
-> Finalize GroupAggregate (cost=200935.11..208818.03 rows=40000 width=72) (actual time=2900.990..3453.771 rows=25 loops=1)
Group Key: scd.start_date, scd.end_date
-> Gather Merge (cost=200935.11..207569.38 rows=49910 width=72) (actual time=2879.365..3453.827 rows=75 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=199935.08..200808.51 rows=24955 width=72) (actual time=2686.465..3228.313 rows=25 loops=3)
Group Key: scd.start_date, scd.end_date
-> Sort (cost=199935.08..199997.47 rows=24955 width=25) (actual time=2664.518..2860.477 rows=1666667 loops=3)
Sort Key: scd.start_date, scd.end_date
Sort Method: external merge Disk: 59840kB
-> Hash Join (cost=44891.11..198112.49 rows=24955 width=25) (actual time=111.653..1817.228 rows=1666667 loops=3)
Hash Cond: (scd.tag = (dh.tag)::text)
-> Append (cost=0.00..145159.33 rows=2083333 width=68) (actual time=0.006..591.818 rows=1666667 loops=3)
-> Parallel Seq Scan on ita68f0c03880e4c6694859dfa74f1cdf6 scd (cost=0.00..145159.33 rows=2083333 width=68) (actual time=0.006..455.525 rows=1666667 loops=3)
Filter: ((start_date >= '2021-01-18 00:00:00'::timestamp without time zone) AND (start_date <= '2021-02-12 00:00:00'::timestamp without time zone) AND ((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text))
-> Hash (cost=44638.71..44638.71 rows=20192 width=45) (actual time=111.502..111.502 rows=200000 loops=3)
Buckets: 65536 (originally 32768) Batches: 8 (originally 1) Memory Usage: 3585kB
-> Bitmap Heap Scan on hierarchy dh (cost=1339.01..44638.71 rows=20192 width=45) (actual time=26.542..62.078 rows=200000 loops=3)
Recheck Cond: (((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text) AND (parent_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND (child_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND ((desc)::text = 'test'::text))
Heap Blocks: exact=5717
-> Bitmap Index Scan on hierarchy_id_region_idx (cost=0.00..1333.96 rows=20192 width=0) (actual time=25.792..25.792 rows=200000 loops=3)
Index Cond: (((id)::text = 'ita68f0c03880e4c6694859dfa74f1cdf6'::text) AND (parent_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND (child_id = ANY ('{0,1,2,3,4,5,6,7,8,9,10}'::integer[])) AND ((desc)::text = 'test'::text))
Planning time: 0.602 ms
Execution time: 3463.440 ms

After going through several trail and errors we landed on the materialized view for this query. The number of rows the query was scanning was almost 500k+ and no indexes and partitioning was helping. We tweaked the above query to create a Materialized view and then doing a select on top of it. We are now at 96ms. The generalized materialized view for the query in this question is shown below.
CREATE MATERIALIZED VIEW performance.daily_trends
TABLESPACE pg_default
AS SELECT s.id,
d.parent_id,
d.child_id,
d.desc,
s.start_date,
s.end_date,
count(*) FILTER (WHERE s.overall_status::text = 'Fail'::text) AS fail_count,
count(*) FILTER (WHERE s.overall_status::text = 'Warning'::text) AS warning_count,
count(*) FILTER (WHERE s.overall_status::text = 'Pass'::text) AS pass_count,
avg(s.score) AS average_score
FROM performance.tenant_based s
JOIN performance.hierarchy d ON s.id::text = d.id::text AND s.tag = d.tag::text
WHERE s.start_date >= (CURRENT_DATE - 45) AND s.start_date <= CURRENT_DATE
GROUP BY s.id, d.parent_id, d.child_id, d.desc, s.start_date, s.end_date
WITH DATA;
Thanks for all who tried helping on this.

Related

Postgres index for aggregate query

SELECT count(e_id) AS count,
e_id
FROM test
WHERE created_at BETWEEN '2021-12-01 00:00:00' AND '2021-12-08 00:00:00'
AND std IN ( '1' )
AND section IN ( 'Sample' )
GROUP BY e_id
ORDER BY count DESC
LIMIT 4
The table has around 1 M records. The query execution is less than 40 ms but computation takes a hit at the group by and query cost high.
Limit (cost=26133.76..26133.77 rows=4 width=45) (actual time=52.300..52.303 rows=3 loops=1)
-> Sort (cost=26133.76..26134.77 rows=403 width=45) (actual time=52.299..52.301 rows=3 loops=1)
Sort Key: (count(e_id)) DESC
Sort Method: quicksort Memory: 25kB
-> GroupAggregate (cost=26120.66..26127.72 rows=403 width=45) (actual time=52.287..52.289 rows=3 loops=1)
Group Key: e_id
-> Sort (cost=26120.66..26121.67 rows=404 width=37) (actual time=52.281..52.283 rows=5 loops=1)
Sort Key: e_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on test (cost=239.19..26103.17 rows=404 width=37) (actual time=49.339..52.261 rows=5 loops=1)
Recheck Cond: ((section)::text = 'test'::text)
" Filter: ((created_at >= '2021-12-01 00:00:00'::timestamp without time zone) AND (created_at <= '2021-12-08 00:00:00'::timestamp without time zone) AND ((std)::text = ANY ('{1,2}'::text[])))"
Rows Removed by Filter: 38329
Heap Blocks: exact=33997
-> Bitmap Index Scan on index_test_on_section (cost=0.00..239.09 rows=7270 width=0) (actual time=6.815..6.815 rows=38334 loops=1)
Index Cond: ((section)::text = 'test'::text)
How can I optimize the group by and count, so that CPU does not shoot up?
The best index for this query is
CREATE INDEX ON test (section, created_at, std) INCLUDE (e_id);
Then VACUUM the table and try again.
Unless you have shown us the wrong plan, the slow step is not the group by, but rather the Bitmap Heap Scan
Your index on "section" returns 38334, of which all but 5 are filtered out. We can't tell if they are filtered out mostly by the "std" criterion or the "created_at" one. You need a more specific multicolumn index. The one i think is most likely to be effective is on (section, std, created_at).

Postgres function slower than same ad hoc query

I have had several cases where a Postgres function that returns a table result from a query is much slower than running the actual query. Why is that?
This is one example, but I've found that function is slower than just the query in many cases.
create function trending_names(date_start timestamp with time zone, date_end timestamp with time zone, gender_filter character, country_filter text)
returns TABLE(name_id integer, gender character, country text, score bigint, rank bigint)
language sql
as
$$
select u.name_id,
n.gender,
u.country,
count(u.rank) as score,
row_number() over (order by count(u.rank) desc) as rank
from babynames.user_scores u
inner join babynames.names n on u.name_id = n.id
where u.created_at between date_start and date_end
and u.rank > 0
and n.gender = gender_filter
and u.country = country_filter
group by u.name_id, n.gender, u.country
$$;
This is the query plan for a select from the function:
Function Scan on trending_names (cost=0.25..10.25 rows=1000 width=84) (actual time=1118.673..1118.861 rows=2238 loops=1)
Buffers: shared hit=216509 read=29837
Planning Time: 0.078 ms
Execution Time: 1119.083 ms
Query plan from just running the query. This takes less than half the time.
WindowAgg (cost=44834.98..45593.32 rows=43334 width=25) (actual time=383.387..385.223 rows=2238 loops=1)
Planning Time: 2.512 ms
Execution Time: 387.403 ms
Buffers: shared hit=100446 read=50220
-> Sort (cost=44834.98..44943.31 rows=43334 width=17) (actual time=383.375..383.546 rows=2238 loops=1)
Sort Method: quicksort Memory: 271kB
Sort Key: (count(u.rank)) DESC
Buffers: shared hit=100446 read=50220
-> HashAggregate (cost=41064.22..41497.56 rows=43334 width=17) (actual time=381.088..381.906 rows=2238 loops=1)
" Group Key: u.name_id, u.country, n.gender"
Buffers: shared hit=100446 read=50220
-> Hash Join (cost=5352.15..40630.88 rows=43334 width=13) (actual time=60.710..352.646 rows=36271 loops=1)
Hash Cond: (u.name_id = n.id)
Buffers: shared hit=100446 read=50220
-> Index Scan using user_scores_rank_ix on user_scores u (cost=0.43..35077.55 rows=76796 width=11) (actual time=24.193..287.393 rows=69770 loops=1)
-> Hash (cost=5005.89..5005.89 rows=27667 width=6) (actual time=36.420..36.420 rows=27472 loops=1)
Rows Removed by Filter: 106521
Index Cond: (rank > 0)
Filter: ((created_at >= '2021-01-01 00:00:00+00'::timestamp with time zone) AND (country = 'sv'::text) AND (created_at <= now()))
Buffers: shared hit=99417 read=46856
Buffers: shared hit=1029 read=3364
Buckets: 32768 Batches: 1 Memory Usage: 1330kB
-> Seq Scan on names n (cost=0.00..5005.89 rows=27667 width=6) (actual time=0.022..24.447 rows=27472 loops=1)
Rows Removed by Filter: 21559
Filter: (gender = 'f'::bpchar)
Buffers: shared hit=1029 read=3364
I'm also confused on why it does a Seq scan on names n in the last step since names.id is the primary key and gender is indexed.

How to select a date range *and* the entries immediately before and after that range?

I'm working with a table where each row has a timestamp, and that timestamp is unique for a given set of other column values:
CREATE TEMPORARY TABLE time_series (
id SERIAL PRIMARY KEY,
created TIMESTAMP WITH TIME ZONE NOT NULL,
category TEXT,
value INT
);
CREATE UNIQUE INDEX ON time_series (created, category);
INSERT INTO time_series (created, category, value)
VALUES ('2000-01-01 00:00:00Z', 'foo', 1),
('2000-01-01 06:00:00Z', 'bar', 5),
('2000-01-01 12:00:00Z', 'bar', 5),
('2000-01-02 00:00:00Z', 'bar', 5),
('2000-01-02 12:34:45Z', 'bar', 2),
('2000-01-03 00:00:00Z', 'bar', 3),
('2000-01-04 00:00:00Z', 'bar', 3),
('2000-01-04 11:11:11Z', 'foo', 4),
('2000-01-04 22:22:22Z', 'bar', 5),
('2000-01-04 23:23:23Z', 'bar', 4),
('2000-01-05 00:00:00Z', 'foo', 1),
('2000-01-05 23:23:23Z', 'bar', 4);
The timestamps are not spaced uniformly. My task, given an arbitrary start and end datetime, is to get the entries between those datetimes and the entries immediately before and after that range. Basically, how do I simplify this query:
(SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created < '2000-01-02 06:00:00Z'
ORDER BY created DESC
LIMIT 1)
UNION
(SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created >= '2000-01-02 06:00:00Z'
AND created < '2000-01-04 12:00:00Z')
UNION
(SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created >= '2000-01-04 12:00:00Z'
ORDER BY created
LIMIT 1)
ORDER BY created;
created value
2000-01-02 00:00:00+00 5
2000-01-02 12:34:45+00 2
2000-01-03 00:00:00+00 3
2000-01-04 00:00:00+00 3
2000-01-04 22:22:22+00 5
The use case is getting the data points to display a graph: I know the datetimes of the left and right edges of the graph, but they will not in general align exactly with created datetimes, so in order to display a graph all the way to the edge I need a data point to either side of the range.
Fiddle
Non-solutions:
I can not simply select the whole range, because it might be huge.
I can not select some arbitrarily long period outside of the given range, because that data set might again be huge or whichever period I select might not be enough to get the next readings.
EDITED:
You can combine UNION ALL with ORDER BY and LIMIT and some clause bounds.
Something like this:
APPROACH 1:
SELECT created,
value
FROM (SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created < '2000-01-02 06:00:00Z'
ORDER BY created ASC LIMIT 1
) AS ub
UNION ALL
SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created >= '2000-01-02 06:00:00Z'
AND created < '2000-01-04 12:00:00Z'
UNION ALL
SELECT created,
value
FROM (SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created >= '2000-01-04 12:00:00Z'
ORDER BY created DESC LIMIT 1
) AS lb
ORDER BY 1;
EXPLAIN ANALYZE from approach 1:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=3.60..3.61 rows=3 width=12) (actual time=0.228..0.237 rows=5 loops=1)
Sort Key: time_series.created
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=3.55..3.58 rows=3 width=12) (actual time=0.182..0.195 rows=5 loops=1)
Group Key: time_series.created, time_series.value
-> Append (cost=1.16..3.53 rows=3 width=12) (actual time=0.073..0.163 rows=5 loops=1)
-> Limit (cost=1.16..1.16 rows=1 width=12) (actual time=0.070..0.073 rows=1 loops=1)
-> Sort (cost=1.16..1.16 rows=1 width=12) (actual time=0.065..0.067 rows=1 loops=1)
Sort Key: time_series.created DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on time_series (cost=0.00..1.15 rows=1 width=12) (actual time=0.026..0.035 rows=2 loops=1)
Filter: ((created < '2000-01-02 06:00:00+00'::timestamp with time zone) AND (category = 'bar'::text))
Rows Removed by Filter: 8
-> Seq Scan on time_series time_series_1 (cost=0.00..1.18 rows=1 width=12) (actual time=0.007..0.016 rows=3 loops=1)
Filter: ((created >= '2000-01-02 06:00:00+00'::timestamp with time zone) AND (created < '2000-01-04 12:00:00+00'::timestamp with time zone) AND (category = 'bar'::text))
Rows Removed by Filter: 7
-> Limit (cost=1.16..1.16 rows=1 width=12) (actual time=0.051..0.054 rows=1 loops=1)
-> Sort (cost=1.16..1.16 rows=1 width=12) (actual time=0.047..0.049 rows=1 loops=1)
Sort Key: time_series_2.created
Sort Method: quicksort Memory: 25kB
-> Seq Scan on time_series time_series_2 (cost=0.00..1.15 rows=1 width=12) (actual time=0.009..0.016 rows=2 loops=1)
Filter: ((created >= '2000-01-04 12:00:00+00'::timestamp with time zone) AND (category = 'bar'::text))
Rows Removed by Filter: 8
Planning time: 0.388 ms
Execution time: 0.438 ms
(25 rows)
Another similar approach can be used.
APPROACH 2:
SELECT created, value
FROM time_series
WHERE category = 'bar'
AND created >= (SELECT created
FROM time_series
WHERE category = 'bar'
AND created < '2000-01-02 06:00:00Z'
ORDER BY created ASC LIMIT 1)
AND created < (SELECT created
FROM time_series
WHERE category = 'bar'
AND created >= '2000-01-04 12:00:00Z'
ORDER BY created DESC LIMIT 1
)
EXPLAIN ANALYZE from approach 2:
--------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on time_series (cost=2.33..3.50 rows=1 width=12) (actual time=0.143..0.157 rows=6 loops=1)
Filter: ((created >= $0) AND (created < $1) AND (category = 'bar'::text))
Rows Removed by Filter: 4
InitPlan 1 (returns $0)
-> Limit (cost=1.16..1.16 rows=1 width=8) (actual time=0.066..0.069 rows=1 loops=1)
-> Sort (cost=1.16..1.16 rows=1 width=8) (actual time=0.061..0.062 rows=1 loops=1)
Sort Key: time_series_1.created
Sort Method: quicksort Memory: 25kB
-> Seq Scan on time_series time_series_1 (cost=0.00..1.15 rows=1 width=8) (actual time=0.008..0.015 rows=2 loops=1)
Filter: ((created < '2000-01-02 06:00:00+00'::timestamp with time zone) AND (category = 'bar'::text))
Rows Removed by Filter: 8
InitPlan 2 (returns $1)
-> Limit (cost=1.16..1.16 rows=1 width=8) (actual time=0.041..0.044 rows=1 loops=1)
-> Sort (cost=1.16..1.16 rows=1 width=8) (actual time=0.038..0.039 rows=1 loops=1)
Sort Key: time_series_2.created DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on time_series time_series_2 (cost=0.00..1.15 rows=1 width=8) (actual time=0.007..0.013 rows=2 loops=1)
Filter: ((created >= '2000-01-04 12:00:00+00'::timestamp with time zone) AND (category = 'bar'::text))
Rows Removed by Filter: 8
Planning time: 0.392 ms
Execution time: 0.288 ms
As you're using limit, the query will run fast.
APPROACH 3:
WITH a as (
SELECT created,
value,
lag(created, 1) OVER (ORDER BY created desc) AS ub,
lag(created, -1) OVER (ORDER BY created desc) AS lb
FROM time_series
WHERE category = 'bar'
) SELECT created,
value
FROM a
WHERE ub>='2000-01-02 06:00:00Z'
AND lb<'2000-01-04 12:00:00Z'
ORDER BY created
EXPLAIN ANALYZE from approach 3:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1.19..1.20 rows=1 width=12) (actual time=0.174..0.181 rows=5 loops=
1)
Sort Key: a.created
Sort Method: quicksort Memory: 25kB
CTE a
-> WindowAgg (cost=1.14..1.16 rows=1 width=28) (actual time=0.075..0.107 rows=7 loops=1)
-> Sort (cost=1.14..1.14 rows=1 width=12) (actual time=0.056..0.067 rows=7 loops=1)
Sort Key: time_series.created DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on time_series (cost=0.00..1.12 rows=1 width=12) (actual time=0.018..0.030 rows=7 loops=1)
Filter: (category = 'bar'::text)
Rows Removed by Filter: 3
-> CTE Scan on a (cost=0.00..0.03 rows=1 width=12) (actual time=0.088..0.131 rows=5 loops=1)
Filter: ((ub >= '2000-01-02 06:00:00+00'::timestamp with time zone) AND (lb < '2000-01-04 12:00:00+00'::timestamp with time zone))
Rows Removed by Filter: 2
Planning time: 0.175 ms
Execution time: 0.247 ms
(16 rows)

query gets very slow when :jsonb ?& operator is used

I have the following SQL query that works fast
select
phone_number.id,
phone_number.phone_number,
phone_number.account_id,
phone_number.used AS used,
(
now() AT TIME ZONE account.timezone
)
::time AS local_time
from
phone_number
INNER JOIN
account
ON account.id = phone_number.account_id
where
phone_number.used = false
AND phone_number.account_id IN
(
SELECT
phone_number.account_id
FROM
phone_number
WHERE
insert_timestamp < (now() - interval '10 hours')
)
AND
(
now() AT TIME ZONE account.timezone
)
::time BETWEEN
CASE
WHEN
EXTRACT(DOW
FROM
now() AT TIME ZONE account.timezone) IN
(
6,
0
)
THEN
'15:30'::time
ELSE
'17:30'::time
END
AND '22:10'::time
order by
random() limit 1
But when I add this to it account.residence_details::jsonb ?& array['city', 'state', 'streetName'] making the full query into
select
phone_number.id,
phone_number.phone_number,
phone_number.account_id,
phone_number.used AS used,
(
now() AT TIME ZONE account.timezone
)
::time AS local_time
from
phone_number
INNER JOIN
account
ON account.id = phone_number.account_id
where
phone_number.used = false
AND phone_number.account_id IN
(
SELECT
phone_number.account_id
FROM
phone_number
WHERE
insert_timestamp < (now() - interval '10 hours')
)
AND
(
now() AT TIME ZONE account.timezone
)
::time BETWEEN
CASE
WHEN
EXTRACT(DOW
FROM
now() AT TIME ZONE account.timezone) IN
(
6,
0
)
THEN
'15:30'::time
ELSE
'17:30'::time
END
AND '22:10'::time
AND account.residence_details::jsonb ?& array['city', 'state', 'streetName']
order by
random() limit 1
The query takes about 1 minute to complete
Below is EXPLAIN ANALYZE for the query WITHOUT account.residence_details::jsonb ?& array['city', 'state', 'streetName']
Limit (cost=15795.97..15795.97 rows=1 width=45) (actual time=382.995..382.995 rows=0 loops=1)
-> Sort (cost=15795.97..15796.18 rows=85 width=45) (actual time=382.993..382.993 rows=0 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=8742.24..15795.54 rows=85 width=45) (actual time=382.640..382.640 rows=0 loops=1)
Join Filter: (phone_number.account_id = account.id)
-> Hash Join (cost=8741.96..15403.38 rows=850 width=37) (actual time=347.011..368.677 rows=2099 loops=1)
Hash Cond: (phone_number.account_id = phone_number_1.account_id)
-> Seq Scan on phone_number (cost=0.00..6649.74 rows=850 width=29) (actual time=14.499..33.591 rows=2453 loops=1)
Filter: (NOT used)
Rows Removed by Filter: 190152
-> Hash (cost=8629.44..8629.44 rows=9001 width=8) (actual time=332.368..332.369 rows=9581 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 503kB
-> HashAggregate (cost=8539.43..8629.44 rows=9001 width=8) (actual time=320.550..326.757 rows=9581 loops=1)
Group Key: phone_number_1.account_id
-> Seq Scan on phone_number phone_number_1 (cost=0.00..8067.05 rows=188955 width=8) (actual time=0.010..169.126 rows=191615 loops=1)
Filter: (insert_timestamp < (now() - '10:00:00'::interval))
Rows Removed by Filter: 990
-> Index Scan using account_id_idx on account (cost=0.29..0.45 rows=1 width=25) (actual time=0.006..0.006 rows=0 loops=2099)
Index Cond: (id = phone_number_1.account_id)
Filter: (((timezone(timezone, now()))::time without time zone <= '22:10:00'::time without time zone) AND ((timezone(timezone, now()))::time without time zone >= CASE WHEN (date_part('dow'::text, timezone(timezone, now())) = ANY ('{6,0}'::double precision[])) THEN '15:30:00'::time without time zone ELSE '17:30:00'::time without time zone END))
Rows Removed by Filter: 1
Planning time: 2.025 ms
Execution time: 383.794 ms
Below is EXPLAIN ANALYZE for the query with account.residence_details::jsonb ?& array['city', 'state', 'streetName']
Limit (cost=15916.82..15916.83 rows=1 width=45) (actual time=258768.686..258768.696 rows=1 loops=1)
-> Sort (cost=15916.82..15916.83 rows=1 width=45) (actual time=258768.684..258768.685 rows=1 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop Semi Join (cost=0.29..15916.81 rows=1 width=45) (actual time=495.076..258755.141 rows=1715 loops=1)
Join Filter: (account.id = phone_number_1.account_id)
Rows Removed by Join Filter: 167271743
-> Nested Loop (cost=0.29..7634.96 rows=1 width=54) (actual time=65.620..229.670 rows=1737 loops=1)
-> Seq Scan on phone_number (cost=0.00..6649.74 rows=850 width=29) (actual time=59.234..98.326 rows=3772 loops=1)
Filter: (NOT used)
Rows Removed by Filter: 190333
-> Index Scan using account_id_idx on account (cost=0.29..1.16 rows=1 width=25) (actual time=0.029..0.029 rows=0 loops=3772)
Index Cond: (id = phone_number.account_id)
Filter: ((residence_details ?& '{city,state,streetName}'::text[]) AND ((timezone(timezone, now()))::time without time zone <= '22:10:00'::time without time zone) AND ((timezone(timezone, now()))::time without time zone >= CASE WHEN (date_part('dow'::text, timezone(timezone, now())) = ANY ('{6,0}'::double precision[])) THEN '15:30:00'::time without time zone ELSE '17:30:00'::time without time zone END))
Rows Removed by Filter: 1
-> Seq Scan on phone_number phone_number_1 (cost=0.00..8067.05 rows=188955 width=8) (actual time=0.004..87.357 rows=96300 loops=1737)
Filter: (insert_timestamp < (now() - '10:00:00'::interval))
Rows Removed by Filter: 21
Planning time: 1.712 ms
Execution time: 258768.781 ms
I can not figure out why it gets so slow after adding account.residence_details::jsonb ?& array['city', 'state', 'streetName']
I'd say that the additional condition makes PostgreSQL underestimate the result count of the first join so badly that it chooses a nested loop for the second join by mistake, which is where all the time is spent.
Maybe an index on the expression will help to get better estimates:
CREATE INDEX ON account USING gin (residence_details::jsonb);
ANALYZE account; -- to calculate statistics for the indexed expression

Postgres' planning takes unproportional time for execution

postgres 9.6 running on amazon RDS.
I have 2 tables:
aggregate events - big table with 6 keys (ids)
campaign metadata - small table with campaign definition.
I join the 2 in order to filter on metadata like campaign-name.
The query is in order to get a report of displayed breakdown by campaign channel and date ( date is daily ).
No FK and not null. The report table has multiple lines per day per campaigns ( because the aggregation is based on 6 attribute key ).
When i join , query plan grow to 10s ( vs 300ms)
explain analyze select c.campaign_channel as channel,date as day , sum( displayed ) as displayed
from report_campaigns c
left join events_daily r on r.campaign_id = c.c_id
where provider_id = 7726 and c.p_id = 7726 and c.campaign_name <> 'test'
and date >= '20170513 12:00' and date <= '20170515 12:00'
group by c.campaign_channel,date;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=71461.93..71466.51 rows=229 width=22) (actual time=104.189..114.788 rows=6 loops=1)
Group Key: c.campaign_channel, r.date
-> Sort (cost=71461.93..71462.51 rows=229 width=18) (actual time=100.263..106.402 rows=31205 loops=1)
Sort Key: c.campaign_channel, r.date
Sort Method: quicksort Memory: 3206kB
-> Hash Join (cost=1092.52..71452.96 rows=229 width=18) (actual time=22.149..86.955 rows=31205 loops=1)
Hash Cond: (r.campaign_id = c.c_id)
-> Append (cost=0.00..70245.84 rows=29948 width=20) (actual time=21.318..71.315 rows=31205 loops=1)
-> Seq Scan on events_daily r (cost=0.00..0.00 rows=1 width=20) (actual time=0.005..0.005 rows=0 loops=1)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone) AND (provider_id =
-> Bitmap Heap Scan on events_daily_20170513 r_1 (cost=685.36..23913.63 rows=1 width=20) (actual time=17.230..17.230 rows=0 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Rows Removed by Filter: 13769
Heap Blocks: exact=10276
-> Bitmap Index Scan on events_daily_20170513_full_idx (cost=0.00..685.36 rows=14525 width=0) (actual time=2.356..2.356 rows=13769 loops=1)
Index Cond: (provider_id = 7726)
-> Bitmap Heap Scan on events_daily_20170514 r_2 (cost=689.08..22203.52 rows=14537 width=20) (actual time=4.082..21.389 rows=15281 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Heap Blocks: exact=10490
-> Bitmap Index Scan on events_daily_20170514_full_idx (cost=0.00..685.45 rows=14537 width=0) (actual time=2.428..2.428 rows=15281 loops=1)
Index Cond: (provider_id = 7726)
-> Bitmap Heap Scan on events_daily_20170515 r_3 (cost=731.84..24128.69 rows=15409 width=20) (actual time=4.297..22.662 rows=15924 loops=1)
Recheck Cond: (provider_id = 7726)
Filter: ((date >= '2017-05-13 12:00:00'::timestamp without time zone) AND (date <= '2017-05-15 12:00:00'::timestamp without time zone))
Heap Blocks: exact=11318
-> Bitmap Index Scan on events_daily_20170515_full_idx (cost=0.00..727.99 rows=15409 width=0) (actual time=2.506..2.506 rows=15924 loops=1)
Index Cond: (provider_id = 7726)
-> Hash (cost=1085.35..1085.35 rows=574 width=14) (actual time=0.815..0.815 rows=582 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 37kB
-> Bitmap Heap Scan on report_campaigns c (cost=12.76..1085.35 rows=574 width=14) (actual time=0.090..0.627 rows=582 loops=1)
Recheck Cond: (p_id = 7726)
Filter: ((campaign_name)::text <> 'test'::text)
Heap Blocks: exact=240
-> Bitmap Index Scan on report_campaigns_provider_id (cost=0.00..12.62 rows=577 width=0) (actual time=0.062..0.062 rows=582 loops=1)
Index Cond: (p_id = 7726)
Planning time: 9651.605 ms
Execution time: 115.092 ms
result:
channel | day | displayed
----------+---------------------+-----------
Pin | 2017-05-14 00:00:00 | 43434
Pin | 2017-05-15 00:00:00 | 3325325235
I seems to me this is because of summation forcing pre-computation before left joining.
Solution could be to impose filtering WHERE clauses in two nested sub-SELECT prior to left-joining and summation.
Hope this works:
SELECT channel, day, sum( displayed )
FROM
(SELECT campaign_channel AS channel, date AS day, displayed, p_id AS c_id
FROM report_campaigns WHERE p_id = 7726 AND campaign_name <> 'test' AND date >= '20170513 12:00' AND date <= '20170515 12:00') AS c,
(SELECT * FROM events_daily WHERE campaign_id = 7726) AS r
LEFT JOIN r.campaign_id = c.c_id
GROUP BY channel, day;