PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min? - postgresql

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.

Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.

Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

Related

How to keep postgres statistics up to date to encourage the best index to be selected

I have a Notifications table with approximately 7,000,000 records where the relevant columns are:
id: integer
time_created: timestamp with time zone
device_id: integer (foreign key to another table)
And the indexes:
CREATE INDEX notifications_device ON notifications (device_id);
CREATE INDEX notifications_time ON notifications (time_created);
And my query:
SELECT COUNT(*) AS "__count"
FROM "notifications"
WHERE ("notifications"."device_id" IN (
SELECT "id" FROM device WHERE (
device."device_type" = 'iOS' AND
device."registration_id" IN (
'XXXXXXX',
'YYYYYYY',
'ZZZZZZZ'
)
)
)
AND "notifications"."time_created" BETWEEN
'2020-10-26 00:00:00' AND '2020-10-26 17:33:00')
;
For most of the day, this query will use the index on device_id, and will run in under 1ms. But once the table is written to very quickly (logging notifications sent) the planner switches to using the index on time_created and the query blows out to 300ms.
Running an ANALYZE NOTIFICATIONS immediately fixes the problem, and the index on device_id is used again.
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
Can I fix this issue, so that the planner always chooses the index on device_id, by forcing postgres to maintain better statistics on this table? Alternatively, can I re-write the time_created index (perhaps by using a different index type like BRIN) so that it'd only be considered for a WHERE clause like time_created < ..30 days ago.. and not WHERE time_created BETWEEN midnight and now?
EXPLAIN ANALYZE stats:
Bad Plan (time_created):
Rows Removed by Filter = 20926
Shared Hit Blocks = 143934
Plan Rows = 38338
Actual Rows = 84479
Good Plan (device_id):
Rows Removed by Filter = 95
Shared Hit Blocks = 34
Plan Rows = 1
Actual Rows = 0
I would actually suggest a composite index on the notifications table:
CREATE INDEX idx1 ON notifications (device_id, time_created);
This index would cover both restrictions in the current WHERE clause. I would also add an index on the device table:
CREATE INDEX idx2 ON device (device_type, registration_id, id);
The first two columns of this 3-column index would cover the WHERE clause of the subquery. It also includes the id column to completely cover the SELECT clause. If used, Postgres could more rapidly evaluate the subquery on the device table.
You could also play around with some slight variants of the above two indices, by changing column order. For example, you could also try:
CREATE INDEX idx1 ON notifications (time_created, device_id);
CREATE INDEX idx2 ON device (registration_id , device_type, id);
The table is pruned to the last 30 days each night, which is why there is a separate index on the time_created column.
But, is that a good reason to have the index? Does it matter if the nightly query takes a little longer? Indeed, for deleting 3% of a table, does it even use the index and if it does, does that actually make it faster? Maybe you could replace the index with partitioning, or with nothing.
In any case, you can use this ugly hack to force it not to use the index:
AND "notifications"."time_created" + interval '0 seconds' BETWEEN '2020-10-26 00:00:00' AND '2020-10-26 17:33:00'

Postgresql race conditions

I have a table which stores geo position of user. Which looks like this:
|id|coords|create_time|
And i have controller which saves record in database, but user can save record only once per 5 hours. Simple "if" check is not working, cause if you send a request within like 10ms lets say 100 times, check is going to fail, because record not in db (saving takes some time). So there is simple race condition. How to solve this problem on database level?
One solution would be to use the SERIALIZABLE transaction isolation level throughout.
Then your transactions could be as simple as:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM mytable
WHERE create_time > current_timestamp - INTERVAL '5 hours';
-- throw an error of the result is not 0
INSERT INTO mytable (oords, create_time) VALUES (..., current_timestamp);
COMMIT;
SERIALIZABLE isolation will guarantee a serialization error in one of two concurrent transactions like that.
Now SERIALIZABLE is simple to use, but it saps performance somewhat, needs a bigger lock table and you have to be ready to repeat transactions that receive a serialization error.
A second solution that works with the default READ COMMITTED isolation level would be an exclusion constraint:
ALTER TABLE mytable ADD EXCLUDE USING gist (
tstzrange(create_time, create_time + INTERVAL '5 hours') WITH &&
);
Here && is the range “overlaps” operator, and the condition would exclude any two entries in the table which are closer than 5 hour.
tstzrange is a “timestamp with time zone-range” and is the appropriate type if create_time is of that type; for timestamp without time zone use tsrange.
This is automatically safe from race conditions, and one of two concurrent INSERTs would receive a constraint violation error.
If you need to have that overlap check per person, let's assume that there is a person_id column as well. Then you need to extend the exclusion constraint:
CREATE EXTENSION btree_gist; -- for GiST indexes on bigint columns
ALTER TABLE mytable ADD EXCLUDE USING gist (
person_id WITH =,
tstzrange(create_time, create_time + INTERVAL '5 hours') WITH &&
);

Update a very large table in PostgreSQL without locking

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:
UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null
This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.
I would appreciate a sample procedure code.
Thanks.
There is no update without locking, but you can strive to keep the row locks few and short.
You could simply run batches of this:
UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
FROM mytable
WHERE col1 IS NOT NULL
AND col2 IS DISTINCT FROM 'ABCD'
LIMIT 10000) AS part
WHERE mytable.id = part.id;
Just keep repeating that statement until it modifies less than 10000 rows, then you are done.
Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.
To make that performant, an index like this would help:
CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;
Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.
create table indexer (mytable_id integer primary key);
insert into indexer(mytable_id)
select mytable_id
from mytable
where col1 is null
and col2 is null;
The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.
create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
with idx as
( update mytable
set col2 = 'ABCD'
where col2 is null
and mytable_id in (select mytable_if
from indexer
limit rows_to_process_in
)
returning mytable_id
)
delete from indexer
where mytable_id in (select mytable_id from idx);
select count(*) from indexer;
$$;
When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.
Edited
Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.
Do $$
declare
rows_remaining bigint;
begin
loop
rows_remaining = set_mytable_col2(200000);
commit;
exit when rows_remaining = 0;
end loop;
end; $$;

optimizing a slow postgresql query against multiple tables

One of our PostgreSQL queries started getting slow (~15 seconds) so we looked at migrating to a Graph database. Early tests show significantly faster speeds, so AWESOME.
Here's the problem- we still need to store a backup of the data in Postgres for non-analytics needs. The Graph database is just for analytics, and we'd prefer for it to remain a secondary data store. Because our business logic changed quite a bit during this migration, two existing tables turned into 4 -- and the current 'backup' selects in Postgres take anywhere from 1 to 6 minutes to run.
I've tried a few ways to optimize this, and the best seems to be turning this into two queries. If anyone can suggest obvious mistakes here , I'd love to hear a suggestion. I've tried switching up left/right/inner joins with little difference in the query planner. The join order does affect a difference ; I think I'm just not getting this correctly.
I'll go into details.
Goal : Retrieve the last 10 attachments sent to a given person
Database Structure :
CREATE TABLE message (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE attachments (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
);
CREATE TABLE mailings (
id SERIAL PRIMARY KEY NOT NULL ,
event_timestamp TIMESTAMP not null ,
recipient_id INT NOT NULL ,
message_id INT NOT NULL REFERENCES message(id)
);
sidenote: the reason why a mailing is abstracted from the message is that a mailing often has more than one recipient /and/ a single message can go out to multiple recipients
This query takes about 5 minutes on a relatively small dataset (query planner time is the comment above each item ) :
-- 159374.75
EXPLAIN ANALYZE SELECT attachments.*
FROM attachments
JOIN message_2_attachments ON attachments.id = message_2_attachments.attachment_id
JOIN message ON message_2_attachments.message_id = message.id
JOIN mailings ON mailings.message_id = message.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
Splitting it up into 2 queries only takes 1/8 the time :
-- 19123.22
EXPLAIN ANALYZE SELECT message_2_attachments.attachment_id
FROM mailings
JOIN message ON mailings.message_id = message.id
JOIN message_2_attachments ON message.id = message_2_attachments.message_id
JOIN attachments ON message_2_attachments.attachment_id = attachments.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
-- 1.089
EXPLAIN ANALYZE SELECT * FROM attachments WHERE id IN ( results of above query )
I've tried re-writing the queries a handful of times -- different join orders, different types of joins, etc. I can't seem to make this anywhere nearly as efficient in a single query as it can be in two.
UPDATED Github has better formatting, so the full output of explain is here - https://gist.github.com/jvanasco/bc1dd38ca06e52c9a090
Plugged in the output of your explain here : http://explain.depesz.com/s/hqPT
As you can see, the :
Hash Join (cost=96588.85..158413.71 rows=44473 width=3201) (actual time=22590.630..30761.213 rows=44292 loops=1)
Hash Cond: (message_2_attachment.attachment_id = attachment.id)
is taking a good amount of time. I'd try to add indexes to the foreign keys as well with :
CREATE INDEX idx_message_2_attachments_attachment_id ON "message_2_attachments" USING btree (attachment_id);
CREATE INDEX idx_message_2_attachments_message_id ON "message_2_attachments" USING btree (message_id);`
CREATE INDEX idx_mailings_message_id ON "mailings" USING btree (message_id);
The junction table is missing a primary key. Also it is advisable to add a reversed index on this PK:
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
, PRIMARY KEY (message_id,attachment_id) -- <<== here
);
CREATE UNIQUE INDEX ON message_2_attachments(attachment_id,message_id); -- <<== here
For the mailings table, the situation is not so clear. It looks like some combination of {event_timestamp, recipient_id, message_id} could function as a candidate key. The id field merely functions as a surrogate.

SARGable way to find records near each other based on time window?

We have events insert into a table - a start event and an end event. Related events have the same internal_id number, and are inserted within a 90 second window. We frequently do a self-join on the table:
create table mytable (id bigint identity, internal_id bigint,
internal_date datetime, event_number int, field_a varchar(50))
select * from mytable a inner join mytable b on a.internal_id = b.internal_id
and a.event_number = 1 and b.event_number = 2
However, we can have millions of linked events each day. Our clustered key is the internal_date, so we can filter down to a partition level, but the performance can still be mediocre:
and a.internal_date >='20120807' and a.internal_date < '20120808'
and b.internal_date >='20120807' and b.internal_date < '20120808'
Is there a SARGable way to narrow it down further?
Adding this doesn't work - non-SARGable:
and a.internal_date <= b.internal_date +.001 --about 90 seconds
and a.internal_date > b.internal_date - .001 --make sure they're within the window
This isn't for a point query, so doing one-offs doesn't help - we're searching for thousands of records and need event details from the start event and the end event.
Thanks!
With this index your query will be much cheaper:
CREATE UNIQUE INDEX idx_iid on mytable(event_number, internal_id)
INCLUDE (id, internal_date, field_a);
The index allows you to seek on event_number rather than doing a clustered index scan, as well as enables you to do a merge join on internal_id rather than a hash join. The uniqueness constraint makes merge join even cheaper by eliminating possibility of many-to-many join.
See this for a more detailed explanation of merge join.