SARGable way to find records near each other based on time window? - tsql

We have events insert into a table - a start event and an end event. Related events have the same internal_id number, and are inserted within a 90 second window. We frequently do a self-join on the table:
create table mytable (id bigint identity, internal_id bigint,
internal_date datetime, event_number int, field_a varchar(50))
select * from mytable a inner join mytable b on a.internal_id = b.internal_id
and a.event_number = 1 and b.event_number = 2
However, we can have millions of linked events each day. Our clustered key is the internal_date, so we can filter down to a partition level, but the performance can still be mediocre:
and a.internal_date >='20120807' and a.internal_date < '20120808'
and b.internal_date >='20120807' and b.internal_date < '20120808'
Is there a SARGable way to narrow it down further?
Adding this doesn't work - non-SARGable:
and a.internal_date <= b.internal_date +.001 --about 90 seconds
and a.internal_date > b.internal_date - .001 --make sure they're within the window
This isn't for a point query, so doing one-offs doesn't help - we're searching for thousands of records and need event details from the start event and the end event.
Thanks!

With this index your query will be much cheaper:
CREATE UNIQUE INDEX idx_iid on mytable(event_number, internal_id)
INCLUDE (id, internal_date, field_a);
The index allows you to seek on event_number rather than doing a clustered index scan, as well as enables you to do a merge join on internal_id rather than a hash join. The uniqueness constraint makes merge join even cheaper by eliminating possibility of many-to-many join.
See this for a more detailed explanation of merge join.

Related

Postgres Index to speed up LEFT OUTER JOIN

Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.

How to keep postgres statistics up to date to encourage the best index to be selected

I have a Notifications table with approximately 7,000,000 records where the relevant columns are:
id: integer
time_created: timestamp with time zone
device_id: integer (foreign key to another table)
And the indexes:
CREATE INDEX notifications_device ON notifications (device_id);
CREATE INDEX notifications_time ON notifications (time_created);
And my query:
SELECT COUNT(*) AS "__count"
FROM "notifications"
WHERE ("notifications"."device_id" IN (
SELECT "id" FROM device WHERE (
device."device_type" = 'iOS' AND
device."registration_id" IN (
'XXXXXXX',
'YYYYYYY',
'ZZZZZZZ'
)
)
)
AND "notifications"."time_created" BETWEEN
'2020-10-26 00:00:00' AND '2020-10-26 17:33:00')
;
For most of the day, this query will use the index on device_id, and will run in under 1ms. But once the table is written to very quickly (logging notifications sent) the planner switches to using the index on time_created and the query blows out to 300ms.
Running an ANALYZE NOTIFICATIONS immediately fixes the problem, and the index on device_id is used again.
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
Can I fix this issue, so that the planner always chooses the index on device_id, by forcing postgres to maintain better statistics on this table? Alternatively, can I re-write the time_created index (perhaps by using a different index type like BRIN) so that it'd only be considered for a WHERE clause like time_created < ..30 days ago.. and not WHERE time_created BETWEEN midnight and now?
EXPLAIN ANALYZE stats:
Bad Plan (time_created):
Rows Removed by Filter = 20926
Shared Hit Blocks = 143934
Plan Rows = 38338
Actual Rows = 84479
Good Plan (device_id):
Rows Removed by Filter = 95
Shared Hit Blocks = 34
Plan Rows = 1
Actual Rows = 0
I would actually suggest a composite index on the notifications table:
CREATE INDEX idx1 ON notifications (device_id, time_created);
This index would cover both restrictions in the current WHERE clause. I would also add an index on the device table:
CREATE INDEX idx2 ON device (device_type, registration_id, id);
The first two columns of this 3-column index would cover the WHERE clause of the subquery. It also includes the id column to completely cover the SELECT clause. If used, Postgres could more rapidly evaluate the subquery on the device table.
You could also play around with some slight variants of the above two indices, by changing column order. For example, you could also try:
CREATE INDEX idx1 ON notifications (time_created, device_id);
CREATE INDEX idx2 ON device (registration_id , device_type, id);
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
But, is that a good reason to have the index? Does it matter if the nightly query takes a little longer? Indeed, for deleting 3% of a table, does it even use the index and if it does, does that actually make it faster? Maybe you could replace the index with partitioning, or with nothing.
In any case, you can use this ugly hack to force it not to use the index:
AND "notifications"."time_created" + interval '0 seconds' BETWEEN '2020-10-26 00:00:00' AND '2020-10-26 17:33:00'

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

optimizing a slow postgresql query against multiple tables

One of our PostgreSQL queries started getting slow (~15 seconds) so we looked at migrating to a Graph database. Early tests show significantly faster speeds, so AWESOME.
Here's the problem- we still need to store a backup of the data in Postgres for non-analytics needs. The Graph database is just for analytics, and we'd prefer for it to remain a secondary data store. Because our business logic changed quite a bit during this migration, two existing tables turned into 4 -- and the current 'backup' selects in Postgres take anywhere from 1 to 6 minutes to run.
I've tried a few ways to optimize this, and the best seems to be turning this into two queries. If anyone can suggest obvious mistakes here , I'd love to hear a suggestion. I've tried switching up left/right/inner joins with little difference in the query planner. The join order does affect a difference ; I think I'm just not getting this correctly.
I'll go into details.
Goal : Retrieve the last 10 attachments sent to a given person
Database Structure :
CREATE TABLE message (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE attachments (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
);
CREATE TABLE mailings (
id SERIAL PRIMARY KEY NOT NULL ,
event_timestamp TIMESTAMP not null ,
recipient_id INT NOT NULL ,
message_id INT NOT NULL REFERENCES message(id)
);
sidenote: the reason why a mailing is abstracted from the message is that a mailing often has more than one recipient /and/ a single message can go out to multiple recipients
This query takes about 5 minutes on a relatively small dataset (query planner time is the comment above each item ) :
-- 159374.75
EXPLAIN ANALYZE SELECT attachments.*
FROM attachments
JOIN message_2_attachments ON attachments.id = message_2_attachments.attachment_id
JOIN message ON message_2_attachments.message_id = message.id
JOIN mailings ON mailings.message_id = message.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
Splitting it up into 2 queries only takes 1/8 the time :
-- 19123.22
EXPLAIN ANALYZE SELECT message_2_attachments.attachment_id
FROM mailings
JOIN message ON mailings.message_id = message.id
JOIN message_2_attachments ON message.id = message_2_attachments.message_id
JOIN attachments ON message_2_attachments.attachment_id = attachments.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
-- 1.089
EXPLAIN ANALYZE SELECT * FROM attachments WHERE id IN ( results of above query )
I've tried re-writing the queries a handful of times -- different join orders, different types of joins, etc. I can't seem to make this anywhere nearly as efficient in a single query as it can be in two.
UPDATED Github has better formatting, so the full output of explain is here - https://gist.github.com/jvanasco/bc1dd38ca06e52c9a090
Plugged in the output of your explain here : http://explain.depesz.com/s/hqPT
As you can see, the :
Hash Join (cost=96588.85..158413.71 rows=44473 width=3201) (actual time=22590.630..30761.213 rows=44292 loops=1)
Hash Cond: (message_2_attachment.attachment_id = attachment.id)
is taking a good amount of time. I'd try to add indexes to the foreign keys as well with :
CREATE INDEX idx_message_2_attachments_attachment_id ON "message_2_attachments" USING btree (attachment_id);
CREATE INDEX idx_message_2_attachments_message_id ON "message_2_attachments" USING btree (message_id);`
CREATE INDEX idx_mailings_message_id ON "mailings" USING btree (message_id);
The junction table is missing a primary key. Also it is advisable to add a reversed index on this PK:
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
, PRIMARY KEY (message_id,attachment_id) -- <<== here
);
CREATE UNIQUE INDEX ON message_2_attachments(attachment_id,message_id); -- <<== here
For the mailings table, the situation is not so clear. It looks like some combination of {event_timestamp, recipient_id, message_id} could function as a candidate key. The id field merely functions as a surrogate.