Optimizing date queries in postgresql - postgresql

I'm having a hard time to optimizing queries on a very big table. Basically all of the them filter the set of results by the date:
SELECT FROM bigtable WHERE date >= '2015-01-01' AND date <= '2016-01-01' ORDER BY date desc;
Adding the following date index actually makes things worse:
CREATE INDEX CONCURRENTLY bigtable_date_index ON bigtable(date(date));
That is, without the index it takes about 1s to run and with it it takes about 10s to run. But with bigger ranges and filtering it is very slow even without that index.
I'm using postgresql 9.4 and I see that 9.5 has some improvements for sorting that might help?
Does BRIN indexes should help in this case?

For an index to be effective, it needs to index the same thing you're filtering by. In this case, you're filtering by date, but you appears to have indexed date(date), so the index can't be used.
Either filter your table using date(date):
SELECT FROM bigtable
WHERE date(date) >= '2015-01-01' AND date(date) <= '2016-01-01'
ORDER BY date(date) desc;
Or index the naked date:
CREATE INDEX CONCURRENTLY bigtable_date_index ON bigtable(date);

Related

PSQL - Performance of query filtered by intervals on two separate fields

I have got a PostgreSQL table that covers time intervals.
This is a simplified structure of my table
CREATE TABLE intervals (
name varchar(40),
time_from timestamp,
time_to timestamp
);
The table contains millions of records, but, if you apply a filter in a specific point of time in the past, the number of records for which
time_from <= [requested time] <= time_to
are always very limited in number (not more than 3k results). So, a query like this one
SELECT *
FROM intervals
WHERE time_from <= '2020-01-01T10:00:00' and time_to >= '2020-01-01T10:00:00'
is supposed to return a relatively small amount of results, and, in theory, it should be quite fast if I used the correct index. But it's not fast at all
I tried adding a combined index on time_from and time_to, but the engine doesn't pick it.
Seq Scan on intervals (cost=0.00..156152.46 rows=428312 width=32) (actual time=13.223..3599.840 rows=4981 loops=1)
Filter: ((time_from <= '2020-01-01T10:00:00') AND (time_to >= '2020-01-01T10:00:00'))
Rows Removed by Filter: 2089650
Planning Time: 0.159 ms
Execution Time: 3600.618 ms
What type of index should I add, in order to speed up this query?
A btree index cannot be very efficient here. It can quickly throw out everything whose time_from > '2020-01-01T10:00:00', but that is probably not all that much of the table (at least, not if your table goes back for many years). Once the first column of the index has been consumed in this way, the next column cannot be used very efficiently. It can only jump to a specific part of time_to values within the ties of time_from, and that is just not very useful as there are probably not all that many ties. (At least, not that it can prove to itself while planning your query).
What you need is a gist index, which specializes in this kind of multi-dimensional thing:
create extension btree_gist ;
create index on intervals using gist (time_from,time_to);
This index will support your query as written. Another possibility is to index the time ranges and index those, rather than separate begin and end point.
-- this one does not need btree_gist.
create index on intervals using gist (tsrange(time_from,time_to));
But this index forces you to write the query differently:
SELECT * FROM intervals
WHERE tsrange(time_from,time_to) #> '2020-01-01T10:00:00'::timestamp

Postgres query planner filter order affected by using now on sequential scan

I have a query whereby, when enabling sequential scanning on the postgres database and using now() in the where clause, the query planner will prefer a sequential scan of the table and then filter:
EXPLAIN ANALYZE
SELECT
action_id
FROM
events
WHERE
started_at IS NULL
AND deleted_at IS NULL
AND due_at < now()
AND due_at > now() - interval '14 days'
LIMIT 1
FOR UPDATE
SKIP LOCKED;
Example:
https://explain.depesz.com/s/xLlM
Query with enable_seqscan db parameter set to false:
https://explain.depesz.com/s/e8Fe
I am looking to help the query optimiser use the index.
I suspect the fact that the number of rows that match started_at is null and deleted_at is null filter roughly makes up 13% of the total table rows (and due_at column is completed unique and uncorrelated) means that the query optimiser is pessimistic about finding a match quickly enough using the index but in fact that's not the case.
EDIT: For the time being I have restructured the query like so:
SELECT
id,
previous_event_id,
due_at,
action_id,
subscription_url
FROM (
SELECT
id,
previous_event_id,
due_at,
action_id,
subscription_url from events
WHERE
started_at is null
AND deleted_at is null
LIMIT 100
FOR update SKIP LOCKED
) events_to_pick_from
WHERE EXISTS (
SELECT 1
FROM events
WHERE
events_to_pick_from.due_at < now()
AND events_to_pick_from.due_at > now() - interval '14 days'
AND events.action_id = events_to_pick_from.action_id
)
LIMIT 1
FOR UPDATE SKIP LOCKED;
https://explain.depesz.com/s/fz2h
But would be grateful for other suggestions
Both queries have the same execution plan.
The difference is that the query with the constants happens to find a row that matches the condition quickly, after reading only 27 rows from the table.
The query using now() does not find a single matching row in the table (actual rows=0), but it has to scan all 7 million rows before it knows for sure.
An index on due_at should improve the performance considerably.

How to keep postgres statistics up to date to encourage the best index to be selected

I have a Notifications table with approximately 7,000,000 records where the relevant columns are:
id: integer
time_created: timestamp with time zone
device_id: integer (foreign key to another table)
And the indexes:
CREATE INDEX notifications_device ON notifications (device_id);
CREATE INDEX notifications_time ON notifications (time_created);
And my query:
SELECT COUNT(*) AS "__count"
FROM "notifications"
WHERE ("notifications"."device_id" IN (
SELECT "id" FROM device WHERE (
device."device_type" = 'iOS' AND
device."registration_id" IN (
'XXXXXXX',
'YYYYYYY',
'ZZZZZZZ'
)
)
)
AND "notifications"."time_created" BETWEEN
'2020-10-26 00:00:00' AND '2020-10-26 17:33:00')
;
For most of the day, this query will use the index on device_id, and will run in under 1ms. But once the table is written to very quickly (logging notifications sent) the planner switches to using the index on time_created and the query blows out to 300ms.
Running an ANALYZE NOTIFICATIONS immediately fixes the problem, and the index on device_id is used again.
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
Can I fix this issue, so that the planner always chooses the index on device_id, by forcing postgres to maintain better statistics on this table? Alternatively, can I re-write the time_created index (perhaps by using a different index type like BRIN) so that it'd only be considered for a WHERE clause like time_created < ..30 days ago.. and not WHERE time_created BETWEEN midnight and now?
EXPLAIN ANALYZE stats:
Bad Plan (time_created):
Rows Removed by Filter = 20926
Shared Hit Blocks = 143934
Plan Rows = 38338
Actual Rows = 84479
Good Plan (device_id):
Rows Removed by Filter = 95
Shared Hit Blocks = 34
Plan Rows = 1
Actual Rows = 0
I would actually suggest a composite index on the notifications table:
CREATE INDEX idx1 ON notifications (device_id, time_created);
This index would cover both restrictions in the current WHERE clause. I would also add an index on the device table:
CREATE INDEX idx2 ON device (device_type, registration_id, id);
The first two columns of this 3-column index would cover the WHERE clause of the subquery. It also includes the id column to completely cover the SELECT clause. If used, Postgres could more rapidly evaluate the subquery on the device table.
You could also play around with some slight variants of the above two indices, by changing column order. For example, you could also try:
CREATE INDEX idx1 ON notifications (time_created, device_id);
CREATE INDEX idx2 ON device (registration_id , device_type, id);
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
But, is that a good reason to have the index? Does it matter if the nightly query takes a little longer? Indeed, for deleting 3% of a table, does it even use the index and if it does, does that actually make it faster? Maybe you could replace the index with partitioning, or with nothing.
In any case, you can use this ugly hack to force it not to use the index:
AND "notifications"."time_created" + interval '0 seconds' BETWEEN '2020-10-26 00:00:00' AND '2020-10-26 17:33:00'

Postgres table partitioning not improving query speed

I have a query that's run on a table (TABLE_A) that's partitioned by recTime……
WITH subquery AS (
select count(*) AS cnt, date_trunc('day',recTime) AS recTime
from TABLE_A
WHERE (recTime >= to_timestamp('2018-Nov-03 00:00:00','YYYY-Mon-DD HH24:MI:SS')
AND recTime <= to_timestamp('2018-Nov-03 23:59:59','YYYY-Mon-DD HH24:MI:SS'))
GROUP BY date_trunc('day',recTime)
)
UPDATE sumDly
SET fixes = subquery.cnt
FROM subquery
WHERE
sumDly.Day = subquery.recTime
If I do an explain on the query as shown above it's apparent that the database is doing an index scan on each of the partition in the parent table. The associated cost is high and the elapsed time is ridiculous.
If I explicitly force the use of the partition that actually has the data in by replacing….
from TABLE_A
With….
from TABLE_A_20181103
Then the explain only uses the required partition, and the query takes only a few minutes (and the returned results are the same as before)
QUESTION - Why does the database want to scan all the partitions in the table? I though the whole idea of partitioning was to help the database eliminate vast swathes of unneeded data in the first pass rather than forcing a scan on all the indexes in individual partitions?
UPDATE - I am using version 10.5 of postgres
SET constraint_exclusion = on and make sure to hard-code the constraints into the query. Note that you have indeed hard-coded the constrains in the query:
...
WHERE (recTime >= to_timestamp('2018-Nov-03 00:00:00','YYYY-Mon-DD HH24:MI:SS')
AND recTime <= to_timestamp('2018-Nov-03 23:59:59','YYYY-Mon-DD HH24:MI:SS'))
...

Postgres combining date and time fields, is this efficient

I am selecting rows based on a date range which is held in a string using the below SQL which works but is this a efficient way of doing it. As you can see the date and time is held in different fields. From my memory or doing Oracle work as soon as you put a function around a attribute it cant use indexes.
select *
from events
where venue_id = '2'
and EXTRACT(EPOCH FROM (start_date + start_time))
between EXTRACT(EPOCH FROM ('2017-09-01 00:00')::timestamp)
and EXTRACT(EPOCH FROM ('2017-09-30 00:00')::timestamp)
So is there a way of doing this that can use indexes?
Preface: Since your query is limited to a single venue_id, both examples below create a compound index with venue_id first.
If you want an index for improving that query, you can create an expression index:
CREATE INDEX events_start_idx
ON events (venue_id, (EXTRACT(EPOCH FROM (start_date + start_time))));
If you don't want a dedicated function index, you can create a normal index on the start_date column, and add extra logic to use the index. The index will then limit access plan to date range, and fringe records with wrong time of day on first and last dates are filtered out.
In the following, I'm also eliminating the unnecessary extraction of epoch.
CREATE INDEX events_venue_start
ON events (venue_id, start_date);
SELECT *
FROM events
WHERE venue_id = '2'
AND start_date BETWEEN '2017-09-01'::date AND '2017-09-30'::date
AND start_date + start_time BETWEEN '2017-09-01 00:00'::timestamp
AND '2017-09-30 00:00'::timestamp
The first two parts of the WHERE clause will use the index to full benefit. the last part is then use the filter the records found by the index.