In our PostgreSQL 9.6 application, we're currently rewriting one of the most performance-critical parts: A calendar showing some posts.
SELECT
assigned_user_id,
channel_id,
id,
row_number()
OVER (
PARTITION BY DATE(publication_at AT TIME ZONE 'Europe/Vienna')
ORDER BY publication_at ASC, id ASC ) AS "row_number",
TO_CHAR(publication_at AT TIME ZONE 'Europe/Vienna', 'YYYY-MM-DD') AS "day",
COUNT(*)
OVER (
PARTITION BY DATE(publication_at AT TIME ZONE 'Europe/Vienna') ) AS "count_per_day"
FROM posts
WHERE client_id = 159 AND publication_at BETWEEN '2016-12-25 23:00:00' AND '2017-02-05 22:59:59'
ORDER BY publication_at AT TIME ZONE 'Europe/Vienna', id ASC
Based on a specific month (in this case: January 2017), we select all posts with a specific publication_at (note: the passed values in the WHERE clause are already converted to local time). for some other reason (not relevant to this case here) we select row number and how many posts per local day we have in the database.
At the moment, our index (posts_client_id_publication_at_assigned_user_id_id) is only partially used because the AT TIME ZONE … does not work against the index (it's simply not sargable).
Here's the EXPLAIN ANALYSE output from my development machine - production table has almost 10 mio rows:
Sort (cost=268.52..268.74 rows=87 width=92) (actual time=2.464..2.517 rows=524 loops=1)
Sort Key: (timezone('Europe/Vienna'::text, publication_at)), id
Sort Method: quicksort Memory: 98kB
Buffers: shared hit=246 read=6
-> WindowAgg (cost=260.93..265.72 rows=87 width=92) (actual time=0.920..2.306 rows=524 loops=1)
Buffers: shared hit=246 read=6
-> WindowAgg (cost=260.93..263.32 rows=87 width=44) (actual time=0.899..1.173 rows=524 loops=1)
Buffers: shared hit=246 read=6
-> Sort (cost=260.93..261.15 rows=87 width=36) (actual time=0.895..0.952 rows=524 loops=1)
Sort Key: (date(timezone('Europe/Vienna'::text, publication_at))), publication_at, id
Sort Method: quicksort Memory: 65kB
Buffers: shared hit=246 read=6
-> Index Scan using posts_client_id_publication_at_assigned_user_id_id on posts (cost=0.41..258.13 rows=87 width=36) (actual time=0.050..0.774 rows=524 loops=1)
Index Cond: ((client_id = 159) AND (publication_at >= '2016-12-25 23:00:00+01'::timestamp with time zone) AND (publication_at <= '2017-02-05 22:59:59+01'::timestamp with time zone))
Buffers: shared hit=246 read=6
Planning time: 0.343 ms
Execution time: 2.641 ms
I'm aware that we should move the time zone part from the column to the right part, so that we still can use publication_at inside the index. But I'm simply not sure how we can do this with the given query here.
Hint: Our users can select different timezones, so we can't simply put one specific time zone value inside the index.
Related
Following tables:
Collections (around 100 records)
Events (related to collections, around 6000 per collection)
Sales (related to events, in total around 2m records)
I need to get all sales for a certain collection, sorted by either sales.timestamp DESC (datetime field) or sales.id DESC (since id is already in insertion order) but to filter by collection_id I need to join the events table in first, then filter by e.collection_id.
To help with this I created a separate index on timestamp: idx_sales_timestamp_desc (timestamp DESC NULLS LAST), alongside the usual pkey index on sales.id
EXPLAIN (analyze,buffers) SELECT * from sales s
LEFT JOIN events e ON s.sales_event = e.sales_event
WHERE e.collection_id = 9
ORDER BY s.id DESC -- identical results with s.timestamp
LIMIT 10;
Without the ORDER:
Limit (cost=0.85..196.61 rows=10 width=619) (actual time=0.069..2.416 rows=10 loops=1)
Buffers: shared hit=172 read=3
I/O Timings: read=1.810
-> Nested Loop (cost=0.85..122231.34 rows=6244 width=619) (actual time=0.068..2.413 rows=10 loops=1)
Buffers: shared hit=172 read=3
I/O Timings: read=1.810
-> Index Scan using idx_events_collection_id on events e (cost=0.43..32359.71 rows=9551 width=206) (actual time=0.027..0.074 rows=47 loops=1)
Index Cond: (collection_id = 9)
Buffers: shared hit=24
-> Index Scan using idx_sales_sales_event on sales s (cost=0.42..9.39 rows=2 width=413) (actual time=0.049..0.049 rows=0 loops=47)
Index Cond: (sales_event = e.sales_event)
Buffers: shared hit=148 read=3
I/O Timings: read=1.810
Planning:
Buffers: shared hit=20
Planning Time: 0.418 ms
Execution Time: 2.444 ms
With the order:
Limit (cost=1001.00..3353.78 rows=10 width=619) (actual time=1084.650..2353.191 rows=10 loops=1)
Buffers: shared hit=81908 read=6967 dirtied=1930
I/O Timings: read=3732.771
-> Gather Merge (cost=1001.00..1470076.56 rows=6244 width=619) (actual time=1084.649..2352.683 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=81908 read=6967 dirtied=1930
I/O Timings: read=3732.771
-> Nested Loop (cost=0.98..1468355.82 rows=2602 width=619) (actual time=622.297..1768.414 rows=6 loops=3)
Buffers: shared hit=81908 read=6967 dirtied=1930
I/O Timings: read=3732.771
-> Parallel Index Scan Backward using sales_pkey on sales s (cost=0.42..58907.93 rows=303693 width=413) (actual time=0.237..301.251 rows=6008 loops=3)
Buffers: shared hit=9958 read=1094 dirtied=1479
I/O Timings: read=513.609
-> Index Scan using events_pkey on events e (cost=0.55..4.64 rows=1 width=206) (actual time=0.243..0.243 rows=0 loops=18024)
Index Cond: (sales_event = s.sales_event)
Filter: (collection_id = 9)
Rows Removed by Filter: 0
Buffers: shared hit=71950 read=5873 dirtied=451
I/O Timings: read=3219.161
Planning:
Buffers: shared hit=20
Planning Time: 0.268 ms
Execution Time: 2354.905 ms
sales DDL:
-- Table Definition ----------------------------------------------
CREATE TABLE sales (
id BIGSERIAL PRIMARY KEY,
transaction text,
sales_event text,
price bigint,
name text,
timestamp timestamp with time zone,
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX sales_pkey ON sales(id int8_ops);
CREATE INDEX idx_sales_sales_event ON sales(sales_event text_ops);
CREATE INDEX idx_sales_timestamp_desc ON sales(timestamp timestamptz_ops DESC);
Events DDL:
-- Table Definition ----------------------------------------------
CREATE TABLE events (
created_at timestamp with time zone,
updated_at timestamp with time zone,
sales_event text PRIMARY KEY,
collection_id bigint REFERENCES collections(id) REFERENCES collections(id),
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX events_pkey ON events(sales_event text_ops);
Without the ORDER BY, I'm at around 500ms. With the ORDER BY, it easily ends up in the 2s-3m or longer category, depending on DB load, despite all indices being used according to the explain.
The default order when omitting ORDER BY altogether is not the one I want. I kept ANALYZE up to date as well.
How do I solve this in a good way?
I have a table in Postgres with more than 70 million records that relates temperature with a certain time (day) and space (meteorologic station). I need to do some calculations given a period of time and a set of meteorological stations, such as sum, average, quartile and normal value. I am using It is taking 30 seconds for returning. How can I improve this waiting?
This is the explain(analyze, buffers) select avg(p) as rain FROM waterbalances group by extract(month from date), extract(year from date);:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=3310337.68..3314085.15 rows=13836 width=24) (actual time=21252.008..21252.624 rows=478 loops=1)
Group Key: (date_part('month'::text, (date)::timestamp without time zone)), (date_part('year'::text, (date)::timestamp without time zone))
Buffers: shared hit=6335 read=734014
-> Gather Merge (cost=3310337.68..3313566.30 rows=27672 width=48) (actual time=21251.984..21261.693 rows=1432 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15841 read=2195624
-> Sort (cost=3309337.66..3309372.25 rows=13836 width=48) (actual time=21130.846..21130.862 rows=477 loops=3) Sort Key: (date_part('month'::text, (date)::timestamp without time zone)), (date_part('year'::text, (date)::timestamp without time zone))
Sort Method: quicksort Memory: 92kB
Worker 0: Sort Method: quicksort Memory: 92kB
Worker 1: Sort Method: quicksort Memory: 92kB
Buffers: shared hit=15841 read=2195624
-> Partial HashAggregate (cost=3308109.29..3308386.01 rows=13836 width=48) (actual time=21130.448..21130.618 rows=477 loops=3)
Group Key: date_part('month'::text, (date)::timestamp without time zone), date_part('year'::text, (date)::timestamp without time zone)
Buffers: shared hit=15827 read=2195624
-> Parallel Seq Scan on waterbalances (cost=0.00..3009020.66 rows=39878483 width=24) (actual time=1.528..15460.388 rows=31914000 loops=3)
Buffers: shared hit=15827 read=2195624
Planning Time: 7.621 ms
Execution Time: 21262.552 ms
(20 rows)
this is my first question on StackOverflow so forgive if the question may not be properly structured.
I have a table t_table with datetime column d_datetime and I need to filter data between the past 5 days. Both of the following work locally where I have less data:
query 1.
SELECT * FROM t_table
WHERE d_datetime
BETWEEN '2020-08-28T00:00:00.024Z' AND '2020-09-02T00:00:00.024Z';
query 2.
SELECT * FROM t_table
WHERE d_datetime
BETWEEN (NOW() - INTERVAL '5 days') AND NOW();
query 3.
SELECT * FROM t_table
WHERE d_datetime > NOW() - INTERVAL '5 days';
However, when I move to the live database, only the first query runs to completion in about 10 seconds. I cannot tell why but the other two just stick consuming too much processing power and I haven't once seen them run to completion, even after waiting upto 5 minutes on end.
I have tried automatically generating the strings used for the d_datetime shown in the first query using:
query 4.
SELECT * FROM t_table
WHERE d_datetime
BETWEEN
(TO_CHAR(NOW() - INTERVAL '5 days', 'YYYY-MM-ddThh:MI:SS.024Z'))
AND
(TO_CHAR(NOW(), 'YYYY-MM-ddThh:MI:SS.024Z'))
but it throws the following error:
operator does not exist: timestamp without time zone >= text
My questions are:
Is there any particular reason why query 1 so fast and the rest take an extremely longer period of time to run on a large dataset?
Why does query 4 fail when it practically generates the same string format as query 1 ('YYYY-MM-ddThh:mm:ss.024Z')?
The following is the result of the explain result to the first query
EXPLAIN SELECT * FROM t_table
WHERE d_datetime
BETWEEN '2020-08-28T00:00:00.024Z' AND '2020-09-02T00:00:00.024Z';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=31346.37..31788.13 rows=35341 width=22) (actual time=388.622..388.845 rows=6 loops=1)
Output: count(_hyper_12_67688_chunk.octets), _hyper_12_67688_chunk.application, (date_trunc('day'::text, _hyper_12_67688_chunk.entry_time))
Group Key: (date_trunc('day'::text, _hyper_12_67688_chunk.entry_time)), _hyper_12_67688_chunk.application
Buffers: shared hit=17193
-> Gather (cost=27105.45..31081.31 rows=35341 width=22) (actual time=377.109..398.285 rows=11 loops=1)
Output: _hyper_12_67688_chunk.application, (date_trunc('day'::text, _hyper_12_67688_chunk.entry_time)), (PARTIAL count(_hyper_12_67688_chunk.octets))
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=17193
-> Partial HashAggregate (cost=26105.45..26547.21 rows=35341 width=22) (actual time=174.272..174.535 rows=6 loops=2)
Output: _hyper_12_67688_chunk.application, (date_trunc('day'::text, _hyper_12_67688_chunk.entry_time)), PARTIAL count(_hyper_12_67688_chunk.octets)
Group Key: date_trunc('day'::text, _hyper_12_67688_chunk.entry_time), _hyper_12_67688_chunk.application
Buffers: shared hit=17193
Worker 0: actual time=27.942..28.206 rows=5 loops=1
Buffers: shared hit=579
-> Result (cost=1.73..25272.75 rows=111027 width=18) (actual time=0.805..141.094 rows=94662 loops=2)
Output: _hyper_12_67688_chunk.application, date_trunc('day'::text, _hyper_12_67688_chunk.entry_time), _hyper_12_67688_chunk.octets
Buffers: shared hit=17193
Worker 0: actual time=1.576..23.928 rows=6667 loops=1
Buffers: shared hit=579
-> Parallel Append (cost=1.73..23884.91 rows=111027 width=18) (actual time=0.800..114.488 rows=94662 loops=2)
Buffers: shared hit=17193
Worker 0: actual time=1.572..20.204 rows=6667 loops=1
Buffers: shared hit=579
-> Parallel Bitmap Heap Scan on _timescaledb_internal._hyper_12_67688_chunk (cost=1.73..11.23 rows=8 width=17) (actual time=1.570..1.618 rows=16 loops=1)
Output: _hyper_12_67688_chunk.octets, _hyper_12_67688_chunk.application, _hyper_12_67688_chunk.entry_time
Recheck Cond: ((_hyper_12_67688_chunk.entry_time >= '2020-08-28 05:45:03.024'::timestamp without time zone) AND (_hyper_12_67688_chunk.entry_time <= '2020-09-02 11:45:03.024'::timestamp without time zone))
Filter: ((_hyper_12_67688_chunk.application)::text = 'dns'::text)
Rows Removed by Filter: 32
Buffers: shared hit=11
Worker 0: actual time=1.570..1.618 rows=16 loops=1
Buffers: shared hit=11
-> Bitmap Index Scan on _hyper_12_67688_chunk_dpi_applications_entry_time_idx (cost=0.00..1.73 rows=48 width=0) (actual time=1.538..1.538 rows=48 loops=1)
Index Cond: ((_hyper_12_67688_chunk.entry_time >= '2020-08-28 05:45:03.024'::timestamp without time zone) AND (_hyper_12_67688_chunk.entry_time <= '2020-09-02 11:45:03.024'::timestamp without time zone))
Buffers: shared hit=2
Worker 0: actual time=1.538..1.538 rows=48 loops=1
Buffers: shared hit=2
-> Parallel Index Scan Backward using _hyper_12_64752_chunk_dpi_applications_entry_time_idx on _timescaledb_internal._hyper_12_64752_chunk (cost=0.14..2.36 rows=1 width=44) (actual time=0.040..0.076 rows=52 loops=1)
Output: _hyper_12_64752_chunk.octets, _hyper_12_64752_chunk.application, _hyper_12_64752_chunk.entry_time
Index Cond: ((_hyper_12_64752_chunk.entry_time >= '2020-08-28 05:45:03.024'::timestamp without time zone) AND (_hyper_12_64752_chunk.entry_time <= '2020-09-02 11:45:03.024'::timestamp without time zone))
Filter: ((_hyper_12_64752_chunk.application)::text = 'dns'::text)
Rows Removed by Filter: 52
Buffers: shared hit=
-- cut logs
-> Parallel Seq Scan on _timescaledb_internal._hyper_12_64814_chunk (cost=0.00..2.56 rows=14 width=17) (actual time=0.017..0.038 rows=32 loops=1)
Output: _hyper_12_64814_chunk.octets, _hyper_12_64814_chunk.application, _hyper_12_64814_chunk.entry_time
Filter: ((_hyper_12_64814_chunk.entry_time >= '2020-08-28 05:45:03.024'::timestamp without time zone) AND (_hyper_12_64814_chunk.entry_time <= '2020-09-02 11:45:03.024'::timestamp without time zone) AND ((_hyper_12_64814_chunk.application)::text = 'dns'::text))
Rows Removed by Filter: 40
Buffers: shared hit=2
-> Parallel Seq Scan on _timescaledb_internal._hyper_12_62262_chunk (cost=0.00..2.54 rows=9 width=19) (actual time=0.027..0.039 rows=15 loops=1)
Output: _hyper_12_62262_chunk.octets, _hyper_12_62262_chunk.application, _hyper_12_62262_chunk.entry_time
Filter: ((_hyper_12_62262_chunk.entry_time >= '2020-08-28 05:45:03.024'::timestamp without time zone) AND (_hyper_12_62262_chunk.entry_time <= '2020-09-02 11:45:03.024'::timestamp without time zone) AND ((_hyper_12_62262_chunk.application)::text = 'dns'::text))
Rows Removed by Filter: 37
Buffers: shared hit=2
Planning Time: 3367.445 ms
Execution Time: 417.245 ms
(7059 rows)
The Parallel Index Scan Backward using... log continues for all hypertable chunks in the table.
For the other three queries that have previously been mentioned to be unsuccessful, they are still not completing when queried and just end up eventually filling the memory. Thus I cannot post the EXPLAIN results of these queries, sorry.
Please let me know if my question has not been properly structured. Thanks.
You are using a partitioned table that likely has a lot of partitions, because the planning time for the query takes 3 seconds.
You are probably using PostgreSQL v11 or earlier. v12 introduced partition pruning at execution time, while v11 can only exclude partitions at query planning time.
In your first query the WHERE condition contains constants, so that works. In the other queries, the function now() is used, whose result value is only known at query execution time (it is STABLE, not IMMUTABLE), so partition pruning cannot take place at query planning time. Query planning and execution need not happen at the same time – think of prepared statements.
I need to make a query where it returns all requests that are not finished or canceled from the beginning of recordings to a specific date. The way I'm doing right now, take too much time and returns an error: 'User query might have needed to see row versions that must be removed'(my guess it's due of lack of RAM).
Below is the query I'm using, and here are some information:
T1 where each new entry is saved, with an ID, creation date, status(open,closed) and other keys for several tables.
T2 where each change made in each request is saved(in progress, waiting, rejected and closed), date of change and other keys for other tables.
SELECT T1.id_request,
T1.dt_created,
T1.status
FROM T1
LEFT JOIN T2
ON T1.id_request = T2.id_request
WHERE (T1.dt_created >= '2012-01-01 00:00:00' AND T1.dt_created <= '2020-05-31 23:59:59')
AND T1.id_request NOT IN (SELECT T2.di_request
FROM T2
WHERE ((T2.dt_change >= '2012-01-01 00:00:00'
AND T2.dt_change <= '2020-05-31 23:59:59')
OR T2.dt_change IS NULL)
AND T2.status IN ('Closed','Canceled','rejected'))
My thoughts were to get all that is received - T1(I can't just retrieve what is open, it will only work for today, not to a specific past date - what I want) between the beginning of the records and lets say end of May. Then use WHERE T1.ID NOT IN (T2.ID with STATUS 'closed', in the same period). But as I've said it takes forever and returns an error.
I use this same code to get what was open for a specific month(1st to 30rd) and works perfectly fine.
Maybe this approach is not the best approach, but I couldn't think of any other way(I'm not an expert with SQL). If there's not enough information to provide an answer fell free to ask.
As per request from #MikeOrganek here is the analyzer:
Nested Loop Left Join (cost=27985.55..949402.48 rows=227455 width=20) (actual time=2486.433..54832.280 rows=47726 loops=1)
Buffers: shared hit=293242 read=260670
Seq Scan on T1 (cost=27984.99..324236.82 rows=73753 width=20) (actual time=2467.499..6202.970 rows=16992 loops=1)
Filter: ((dt_created >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_created <= '2020-05-31 23:59:59-03'::timestamp with time zone) AND (NOT (hashed SubPlan 1)))
Rows Removed by Filter: 6085779
Buffers: shared hit=188489 read=250098
SubPlan 1
Nested Loop (cost=7845.36..27983.13 rows=745 width=4) (actual time=129.379..1856.518 rows=168690 loops=1)
Buffers: shared hit=60760
Seq Scan on T3(cost=0.00..5.21 rows=3 width=8) (actual time=0.057..0.104 rows=3 loops=1)
Filter: ((status_request)::text = ANY ('{Closed,Canceled,rejected}'::text[]))
Rows Removed by Filter: 125
Buffers: shared hit=7
Bitmap Heap Scan on T2(cost=7845.36..9321.70 rows=427 width=8) (actual time=477.324..607.171 rows=56230 loops=3)
Recheck Cond: ((dt_change >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_change <= '2020-05-31 23:59:59-03'::timestamp with time zone) AND (T2.ID_status= T3.ID_status))
Rows Removed by Index Recheck: 87203
Heap Blocks: exact=36359
Buffers: shared hit=60753
BitmapAnd (cost=7845.36..7845.36 rows=427 width=0) (actual time=473.864..473.864 rows=0 loops=3)
Buffers: shared hit=24394
Bitmap Index Scan on idx_ix_T2_dt_change (cost=0.00..941.81 rows=30775 width=0) (actual time=47.380..47.380 rows=306903 loops=3)
Index Cond: ((dt_change >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_change<= '2020-05-31 23:59:59-03'::timestamp with time zone))
Buffers: shared hit=2523
Bitmap Index Scan on idx_T2_ID_status (cost=0.00..6895.49 rows=262724 width=0) (actual time=418.942..418.942 rows=2105165 loops=3)
Index Cond: (ID_status = T3.ID_status )
Buffers: shared hit=21871
Index Only Scan using idx_ix_T2_id_request on T2 (cost=0.56..8.30 rows=18 width=4) (actual time=0.369..2.859 rows=3 loops=16992)
Index Cond: (id_request = t17.id_request )
Heap Fetches: 44807
Buffers: shared hit=104753 read=10572
Planning time: 23.424 ms
Execution time: 54841.261 ms
And here is the main difference with dt_change IS NULL:
Planning time: 34.320 ms
Execution time: 230683.865 ms
Thanks
It looks like the OR T2.dt_change is NULL is very costly in that it increased overall execution time by a factor of five.
The only option I can see is changing the not in to a not exists, as below.
SELECT T1.id_request,
T1.dt_created,
T1.status
FROM T1
LEFT JOIN T2
ON T1.id_request = T2.id_request
WHERE T1.dt_created >= '2012-01-01 00:00:00'
AND T1.dt_created <= '2020-05-31 23:59:59'
AND NOT EXISTS (SELECT 1
FROM T2
WHERE id_request = T1.id_request
AND ( ( dt_change >= '2012-01-01 00:00:00'
AND dt_change <= '2020-05-31 23:59:59')
OR dt_change IS NULL)
AND status IN ('Closed','Canceled','rejected'))
But I expect that to give you only a marginal improvement. Can you please see how much this change helps?
I have a table with a schema that looks like this:
id (uuid; pk)
timestamp (timestamp)
category (bpchar)
flaged_as_spam (bool)
flagged_as_bot (bool)
... (other metadata)
I have an index on this table that looks like this:
CREATE INDEX time_index ON events_table USING btree (flagged_as_bot, flagged_as_spam, category, "timestamp") WHERE ((flagged_as_bot = false) AND (flagged_as_spam = false))
I run queries against this table to generate line charts representing the number of events that occurred each day. However, I want the line chart to be adjusted for the user's timezone. Currently, I have a query that looks like this:
SELECT
date_trunc('day', timestamp + INTERVAL '-5 hour') AS ts,
category,
COUNT(*) AS count
FROM
events_table
WHERE
category = 'the category'
AND flagged_as_bot = FALSE
AND flagged_as_spam = FALSE
AND timestamp >= '2018-05-04T00:00:00'::timestamp
AND timestamp < '2018-10-31T17:57:59.661664'::timestamp
GROUP BY
ts,
category
ORDER BY
1 ASC
In most cases, for categories with under 100,000 records, this is quite fast:
GroupAggregate (cost=8908.56..8958.18 rows=1985 width=70) (actual time=752.886..753.301 rows=124 loops=1)
Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
-> Sort (cost=8908.56..8913.52 rows=1985 width=62) (actual time=752.878..752.983 rows=797 loops=1)
Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
Sort Method: quicksort Memory: 137kB
-> Bitmap Heap Scan on listens (cost=552.79..8799.83 rows=1985 width=62) (actual time=748.683..752.568 rows=797 loops=1)
Recheck Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam))
Rows Removed by Filter: 1576
Heap Blocks: exact=1906
-> Bitmap Index Scan on time_index (cost=0.00..552.30 rows=2150 width=0) (actual time=748.324..748.324 rows=2373 loops=1)
Index Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
Planning time: 0.628 ms
Execution time: 753.362 ms"
For categories with a very large number of records (>100,000), the index is not used and the query is very slow:
GroupAggregate (cost=1232229.95..1287491.60 rows=2126204 width=70) (actual time=14649.671..17178.955 rows=181 loops=1)
Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
-> Sort (cost=1232229.95..1238072.10 rows=2336859 width=62) (actual time=14643.887..16031.031 rows=3070695 loops=1)
Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
Sort Method: external merge Disk: 216200kB
-> Seq Scan on listens (cost=0.00..809314.38 rows=2336859 width=62) (actual time=0.015..9572.722 rows=3070695 loops=1)
Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone) AND (category = '3b634b32-bb82-4f56-ada4-f4b7bc4288a5'::bpchar))
Rows Removed by Filter: 8788028
Planning time: 0.239 ms
Execution time: 17228.314 ms
My assumption is that this the index is not used because the overhead of using the index is far higher than simply performing a table scan. And of course, I imagine this to be because of the use of date_trunc to calculate the date to group by.
I've considered what could be done here. Here are some of my thoughts:
Most simply, I could create an expression index for each timezone offset that I care about (generally GMT/EST/CST/MST/PST). This would take up a good deal of space and each index would be infrequently used, but it would theoretically allow for index-only scans.
I could create an expression index truncated by hour. I don't know if this would help Postgres to optimize the query, though.
I could calculate each of the date ranges in advance and using some subquery magic, query the event count per range. I also don't know if this will yield any improvement.
Before I go down a rabbit hole, I figured I'd reach out to see if anyone has any thoughts.