Index for self-join on timestamp range and user_id - postgresql

I have a table in a postgresql (10.2) database something like this...
create table (user_id text, event_time timestamp, ...);
I'd like to use this table in a self join, to match records to other records from the same user_id and an event_time within the next 5 minutes. Something like this...
select
*
from
test as a
inner join
test as b
on
a.user_id = b.user_id
and a.event_time < b.event_time
and a.event_time > b.event_time - interval '5 minutes';
This works fine, but I'd ideally like to make it faster. I've gotten the join to use an index on user_id, but I'm wondering if it's possible to make an index to use both user_id AND the timestamp?
I've tried making a gist index on a tsrange from the event time to the event time plus 5 minutes, but Postgres seemed to just use the user_id index in that case. I tried making a multi-column index on the user_id and the tsrange, but that doesn't seem supported.
Finally, I tried making an index on just the timestamp.
None of that seemed to help.
However, the timestamps cover a long time period, and I'm only interested in a 5-minute window, which intuitively feels like something a good index should help with.
Can this be accomplished?

A multi-column index on the user_id text and the event_time timestamp should work. A gist index on the range would need to include the user id as well, and it would be less versatile since it would work only with the fixed interval of 5 minutes. I wouldn't use it unless what you actually want is to establish an exclusion constraint on the table.

Related

Get max timestamps efficiently for large table for a set of ids

I have a large PostgreSQL db table (Actually lots of partition tables divided up by yearly quarters) that for simplicity sake is defined something like
id bigint
ts (timestamp)
value (float)
For a particular set of ids what is an efficient way of finding the last timestamp in the table for each specified id ?
The table is indexed by (id, timestamp)
If I do something naive like
SELECT sensor_id, MAX(ts)
FROM sensor_values
WHERE ts >= (NOW() + INTERVAL '-100 days') :: TIMESTAMPTZ
GROUP BY 1;
Things are pretty slow.
Is there a way of perhaps narrowing down the times first by a binary search of one id
(I can assume the timestamps are similar for a particular set of ids)
I am accessing the db through psycopg so the solution can be in code or SQL if I am missing something easy to speed this up.
The explain for the query can be seen here. https://explain.depesz.com/s/PVqg
Any ideas appreciated.

How to keep postgres statistics up to date to encourage the best index to be selected

I have a Notifications table with approximately 7,000,000 records where the relevant columns are:
id: integer
time_created: timestamp with time zone
device_id: integer (foreign key to another table)
And the indexes:
CREATE INDEX notifications_device ON notifications (device_id);
CREATE INDEX notifications_time ON notifications (time_created);
And my query:
SELECT COUNT(*) AS "__count"
FROM "notifications"
WHERE ("notifications"."device_id" IN (
SELECT "id" FROM device WHERE (
device."device_type" = 'iOS' AND
device."registration_id" IN (
'XXXXXXX',
'YYYYYYY',
'ZZZZZZZ'
)
)
)
AND "notifications"."time_created" BETWEEN
'2020-10-26 00:00:00' AND '2020-10-26 17:33:00')
;
For most of the day, this query will use the index on device_id, and will run in under 1ms. But once the table is written to very quickly (logging notifications sent) the planner switches to using the index on time_created and the query blows out to 300ms.
Running an ANALYZE NOTIFICATIONS immediately fixes the problem, and the index on device_id is used again.
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
Can I fix this issue, so that the planner always chooses the index on device_id, by forcing postgres to maintain better statistics on this table? Alternatively, can I re-write the time_created index (perhaps by using a different index type like BRIN) so that it'd only be considered for a WHERE clause like time_created < ..30 days ago.. and not WHERE time_created BETWEEN midnight and now?
EXPLAIN ANALYZE stats:
Bad Plan (time_created):
Rows Removed by Filter = 20926
Shared Hit Blocks = 143934
Plan Rows = 38338
Actual Rows = 84479
Good Plan (device_id):
Rows Removed by Filter = 95
Shared Hit Blocks = 34
Plan Rows = 1
Actual Rows = 0
I would actually suggest a composite index on the notifications table:
CREATE INDEX idx1 ON notifications (device_id, time_created);
This index would cover both restrictions in the current WHERE clause. I would also add an index on the device table:
CREATE INDEX idx2 ON device (device_type, registration_id, id);
The first two columns of this 3-column index would cover the WHERE clause of the subquery. It also includes the id column to completely cover the SELECT clause. If used, Postgres could more rapidly evaluate the subquery on the device table.
You could also play around with some slight variants of the above two indices, by changing column order. For example, you could also try:
CREATE INDEX idx1 ON notifications (time_created, device_id);
CREATE INDEX idx2 ON device (registration_id , device_type, id);
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
But, is that a good reason to have the index? Does it matter if the nightly query takes a little longer? Indeed, for deleting 3% of a table, does it even use the index and if it does, does that actually make it faster? Maybe you could replace the index with partitioning, or with nothing.
In any case, you can use this ugly hack to force it not to use the index:
AND "notifications"."time_created" + interval '0 seconds' BETWEEN '2020-10-26 00:00:00' AND '2020-10-26 17:33:00'

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

SARGable way to find records near each other based on time window?

We have events insert into a table - a start event and an end event. Related events have the same internal_id number, and are inserted within a 90 second window. We frequently do a self-join on the table:
create table mytable (id bigint identity, internal_id bigint,
internal_date datetime, event_number int, field_a varchar(50))
select * from mytable a inner join mytable b on a.internal_id = b.internal_id
and a.event_number = 1 and b.event_number = 2
However, we can have millions of linked events each day. Our clustered key is the internal_date, so we can filter down to a partition level, but the performance can still be mediocre:
and a.internal_date >='20120807' and a.internal_date < '20120808'
and b.internal_date >='20120807' and b.internal_date < '20120808'
Is there a SARGable way to narrow it down further?
Adding this doesn't work - non-SARGable:
and a.internal_date <= b.internal_date +.001 --about 90 seconds
and a.internal_date > b.internal_date - .001 --make sure they're within the window
This isn't for a point query, so doing one-offs doesn't help - we're searching for thousands of records and need event details from the start event and the end event.
Thanks!
With this index your query will be much cheaper:
CREATE UNIQUE INDEX idx_iid on mytable(event_number, internal_id)
INCLUDE (id, internal_date, field_a);
The index allows you to seek on event_number rather than doing a clustered index scan, as well as enables you to do a merge join on internal_id rather than a hash join. The uniqueness constraint makes merge join even cheaper by eliminating possibility of many-to-many join.
See this for a more detailed explanation of merge join.

Date Table/Dimension Querying and Indexes

I'm creating a robust date table want to know the best way to link to it. The Primary Key Clustered Index will be on the smart date integer key (per Kimball spec) with a name of DateID. Until now I have been running queries against it like so:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON DTE.date = CAST(FLOOR(CAST(Foo.forderdate AS FLOAT)) AS DATETIME)
Keep in mind that Date is a nonclustered index field with values such as:
2000-01-01 00:00:00.000
It just occured to me that since I have a clustered integer index (DATEID) that perhaps I should be converting the datetime in my database field to match it and linking based upon that field.
What do you folks think?
Also, depending on your first answer, if I am typically pulling those fields from the date table, what kind of index how can I optimize the retrieval of those fields? Covering index?
Even without changing the database structure, you'd get much better performance using a date range join like this:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON Foo.forderdate >= DTE.date AND Foo.forderdate < DATEADD(dd, 1, DTE.date)
However, if you can change it so that your Foo table includes a DateID field then, yes, you'd get the best performance by joining with that instead of any converted date value or date range.
If you change it to join on DateID and DateID is the first column of the clustered index of the MyDateTable then it's already covering (a clustered index always includes all other fields).