Postgresql race conditions - postgresql

I have a table which stores geo position of user. Which looks like this:
|id|coords|create_time|
And i have controller which saves record in database, but user can save record only once per 5 hours. Simple "if" check is not working, cause if you send a request within like 10ms lets say 100 times, check is going to fail, because record not in db (saving takes some time). So there is simple race condition. How to solve this problem on database level?

One solution would be to use the SERIALIZABLE transaction isolation level throughout.
Then your transactions could be as simple as:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM mytable
WHERE create_time > current_timestamp - INTERVAL '5 hours';
-- throw an error of the result is not 0
INSERT INTO mytable (oords, create_time) VALUES (..., current_timestamp);
COMMIT;
SERIALIZABLE isolation will guarantee a serialization error in one of two concurrent transactions like that.
Now SERIALIZABLE is simple to use, but it saps performance somewhat, needs a bigger lock table and you have to be ready to repeat transactions that receive a serialization error.
A second solution that works with the default READ COMMITTED isolation level would be an exclusion constraint:
ALTER TABLE mytable ADD EXCLUDE USING gist (
tstzrange(create_time, create_time + INTERVAL '5 hours') WITH &&
);
Here && is the range “overlaps” operator, and the condition would exclude any two entries in the table which are closer than 5 hour.
tstzrange is a “timestamp with time zone-range” and is the appropriate type if create_time is of that type; for timestamp without time zone use tsrange.
This is automatically safe from race conditions, and one of two concurrent INSERTs would receive a constraint violation error.
If you need to have that overlap check per person, let's assume that there is a person_id column as well. Then you need to extend the exclusion constraint:
CREATE EXTENSION btree_gist; -- for GiST indexes on bigint columns
ALTER TABLE mytable ADD EXCLUDE USING gist (
person_id WITH =,
tstzrange(create_time, create_time + INTERVAL '5 hours') WITH &&
);

Related

Best way to model state changes for point in time queries

I'm working on a system that needs to be able to find the "state" of an item at a particular time in history. The state is binary (either on or off). In this case it's to determine where to direct (to a particular "keyspace") a piece of timestamped data as determined by the timestamp of the data. I'm having a hard time deciding what the best way to model the data is.
Method 1 is to use the tstzrange with state being implied by the bounds of the range:
create extension btree_gist;
create table core.range_director (
range tstzrange,
directee_id text,
keyspace text,
-- allow a directee to be directed to multiple keyspaces at once
exclude using gist (directee_id with =, keyspace with =, range with &&)
);
insert into core.range_director values
('[2021-01-15 00:00:00 -0:00,2021-01-20 00:00:00 -0:00)', 'THING_ID', 'KEYSPACE_1'),
('[2021-01-15 00:00:00 -0:00,)', 'THING_ID', 'KEYSPACE_2');
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-15'::timestamptz;
-- returns KEYSPACE_1 and KEYSPACE_2
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-21'::timestamptz;
-- returns KEYSPACE_2
Method 2 is to have explicit state changes:
create table core.status_director (
status_time timestamptz,
status text,
directee_id text,
keyspace text
); -- not sure what pk to use for this method
insert into core.status_director values
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_1'),
('2021-01-20 00:00:00 -0:00','Closed','THING_ID','KEYSPACE_1'),
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_2');
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-16'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Open KEYSPACE_2:Open
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-21'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Closed, KEYSPACE_2:Open
-- so, client code has to ensure that it only directs to status=Open keyspaces
Maybe there are other methods that would work as well, but these two seem to make the most sense to me. The benefit of the first method is the really easy query, but the down side is that you now have to update rows to close the state whereas in the second method you can just post new states which seems easier.
The table could conceivable grow into thousands or tens of thousands of rows, but will probably not grow into millions (but does the best method change depending on the expected row count?). I have a couple of similar tables with the same point-in-time "state" queries so it's really important that I get the model for them right.
My instinct is to go with Method 1, but are there any footguns or performance considerations that I'm not thinking of that would urge the use case towards Method 2 (or another method I haven't considered?)
No footguns with Method 1, just great big huge cannons. With that method how do you determine the current status. You need to scan each status change and for each one toggle the status, or perhaps use something like "count(*)%2" odd gives one state even another. What happens if any row gets deleted, or data purged and you do not know how many state transactions there were. With the Method 2 you retrieve the greatest date and directly obtain the status.
For myself I would do Method 3. That being Method1 + Method 2. Yes I would have a date range of the status and the status value itself. That gives me complex historical analysis as I have the complete history as well as direct access to current status at any time.
So after doing a bunch of research on the topic I found that my case is a variation of a "Valid-Time State Table". See ch. 2 and ch. 5 of Developing Time-Oriented Database Applications in SQL by Richard Snodgrass.
The support for these tables isn't great but it's not terrible either (at least PostgreSQL has tstzranges to work with). Method 1 of my post is largely sufficient - the main wrinkle is between the state table and other tables.
Since PostgreSQL doesn't have native support for these kinds of temporal tables, you have to build referential integrity yourself. There's a bunch of ways to do this, but for anyone in the future looking for some direction, here is an example of what that might look like for a referential query on two bitemporal tables:
create table a (
row_id bigserial, -- to track individual rows
id int,
pov tstzrange, -- period of validity
pop tstzrange -- period of presence
);
create table b (
row_id bigserial,
id int,
pov tstzrange,
pop tstzrange,
a_id int
);
-- are we good?
with each_pov as (
select bool_or(a.pov #> b.pov) as ok
from a
join b on a.id = b.a_id
and upper(a.pop) is null
and upper(b.pop) is null
group by b.pov
) select coalesce(
bool_and(each_pov.ok),
(select count(*) = 0 from b where upper(pop) is null)
) from each_pov;
You can put the query into a constraint trigger on both the main table and the referenced table to get something approaching sequenced referential integrity for the current period of presence.

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

Proper table to track employee changes over time?

I have been using Python to do this in memory, but I would like to know the proper way to set up an employee mapping table in Postgres.
row_id | employee_id | other_id | other_dimensions | effective_date | expiration_date | is_current
Unique constraint on (employee_id, other_id), so a new row would be inserted whenever there is a change
I would want the expiration date from the previous row to be updated to the new effective_date minus 1 day, and the is_current should be updated to False
Ultimate purpose is to be able to map each employee back accurately on a given date
Would love to hear some best practices so I can move away from my file-based method where I read the whole roster into memory and use pandas to make changes, then truncate the original table and insert the new one.
Here's a general example built using the column names you provided that I think does more or less what you want. Don't treat it as a literal ready-to-run solution, but rather an example of how to make something like this work that you'll have to modify a bit for your own actual use case.
The rough idea is to make an underlying raw table that holds all your data, and establish a view on top of this that gets used for ordinary access. You can still use the raw table to do anything you need to do to or with the data, no matter how complicated, but the view provides more restrictive access for regular use. Rules are put in place on the view to enforce these restrictions and perform the special operations you want. While it doesn't sound like it's significant for your current application, it's important to note that these restrictions can be enforced via PostgreSQL's roles and privileges and the SQL GRANT command.
We start by making the raw table. Since the is_current column is likely to be used for reference a lot, we'll put an index on it. We'll take advantage of PostgreSQL's SERIAL type to manage our raw table's row_id for us. The view doesn't even need to reference the underlying row_id. We'll default the is_current to a True value as we expect most of the time we'll be adding current records, not past ones.
CREATE TABLE raw_employee (
row_id SERIAL PRIMARY KEY,
employee_id INTEGER,
other_id INTEGER,
other_dimensions VARCHAR,
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN DEFAULT TRUE
);
CREATE INDEX employee_is_current_index ON raw_employee (is_current);
Now we define our view. To most of the world this will be the normal way to access employee data. Internally it's a special SELECT run on-demand against the underlying raw_employee table that we've already defined. If we had reason to, we could further refine this view to hide more data (it's already hiding the low-level row_id as mentioned earlier) or display additional data produced either via calculation or relations with other tables.
CREATE OR REPLACE VIEW employee AS
SELECT employee_id, other_id,
other_dimensions, effective_date, expiration_date,
is_current
FROM raw_employee;
Now our rules. We construct these so that whenever someone tries an operation against our view, internally it'll perform a operation against our raw table according to the restrictions we define. First INSERT; it mostly just passes the data through without change, but it has to account for the hidden row_id:
CREATE OR REPLACE RULE employee_insert AS ON INSERT TO employee DO INSTEAD
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
NEW.employee_id, NEW.other_id,
NEW.other_dimensions,
NEW.effective_date, NEW.expiration_date,
NEW.is_current
);
The NEXTVAL part enables us to lean on PostgreSQL for row_id handling. Next is our most complicated one: UPDATE. Per your described intent, it has to match against employee_id, other_id pairs and perform two operations: updating the old record to be no longer current, and inserting a new record with updated dates. You didn't specify how you wanted to manage new expiration dates, so I took a guess. It's easy to change it.
CREATE OR REPLACE RULE employee_update AS ON UPDATE TO employee DO INSTEAD (
UPDATE raw_employee SET is_current = FALSE
WHERE raw_employee.employee_id = OLD.employee_id AND
raw_employee.other_id = OLD.other_id;
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
COALESCE(NEW.employee_id, OLD.employee_id),
COALESCE(NEW.other_id, OLD.other_id),
COALESCE(NEW.other_dimensions, OLD.other_dimensions),
COALESCE(NEW.effective_date, OLD.expiration_date - '1 day'::INTERVAL),
COALESCE(NEW.expiration_date, OLD.expiration_date + '1 year'::INTERVAL),
TRUE
);
);
The use of COALESCE enables us to update columns that have explicit updates, but keep old values for ones that don't. Finally, we need to make a rule for DELETE. Since you said you want to ensure you can track employee histories, the best way to do this is also the simplest: we just disable it.
CREATE OR REPLACE RULE employee_delete_protect AS
ON DELETE TO employee DO INSTEAD NOTHING;
Now we ought to be able to insert data into our raw table by performing INSERT operations on our view. Here are two sample employees; the first has a few weeks left but the second is about to expire. Note that at this level we don't need to care about the row_id. It's an internal implementation detail of the lower level raw table.
INSERT INTO employee VALUES (
1, 1,
'test', CURRENT_DATE - INTERVAL '1 week', CURRENT_DATE + INTERVAL '3 weeks',
TRUE
);
INSERT INTO employee VALUES (
2, 2,
'another test', CURRENT_DATE - INTERVAL '1 month', CURRENT_DATE,
TRUE
);
The final example is deceptively simple after all the build-up that we've done. It performs an UPDATE operation on the view, and internally it results in an update to the existing employee #2 plus a new entry for employee #2.
UPDATE employee SET expiration_date = CURRENT_DATE + INTERVAL '1 year'
WHERE employee_id = 2 AND other_id = 2;
Again I'll stress that this isn't meant to just take and use without modification. There should be enough info here though for you to make something work for your specific case.

Enforce Atomic Operations x Locks

I have a database model that works similar to a banking account (one table for operations, a nd a trigger to update the balance). I'm currently using SQL SERVER 2008 R2.
TABLE OPERATIONS
----------------
VL_CREDIT decimal(10,2)
VL_DEBIT decimal(10,2)
TABLE BALANCE
-------------
DT_OPERATION datetime
VL_CURRENT decimal(10,2)
PROCEDURE INSERT_OPERATION
--------------------------
GET LAST BALANCE BY DATE
CHECK IF VALUE OF OPERATION > BALANCE
IF > RETURN ERROR
ELSE INSERT INTO OPERATION(...,....)
The issue I have is the following:
The procedure to insert the operation has to check the balance to see if there's money available before inserting the operation, so the balance never gets negative. If there's no balance, I return some code to tell the user the balance is not enough.
My concern is: If this procedure gets called multiple times in a row, how can I guarantee that it's atomic?
I have some ideas, but as I am not sure which would guarantee it:
BEGIN TRANSACTION on the OPERATION PROCEDURE
Some sort of lock on selecting the BALANCE table, but it must hold until the end of procedure execution
Can you suggest some approach to guarantee that? Thanks in advance.
UPDATE
I read on MSDN (http://technet.microsoft.com/en-us/library/ms187373.aspx) that if my procedure has BEGIN/END TRANSACTION, and the SELECT on table BALANCE has WITH(TABLOCKX), it locks the table until the end of the transaction, so if a subsequent call to this procedure is made during the execution of the first, it will wait, and then guarantee that the value is always the last updated. Will it work? And if so, is it the best practice?
If you're amenable to changing your table structures, I'd build it this way:
create table Transactions (
SequenceNo int not null,
OpeningBalance decimal(38,4) not null,
Amount decimal(38,4) not null,
ClosingBalance as CONVERT(decimal(38,4),OpeningBalance + Amount) persisted,
PrevSequenceNo as CASE WHEN SequenceNo > 1 THEN SequenceNo - 1 END persisted,
constraint CK_Transaction_Sequence CHECK (SequenceNo > 0),
constraint PK_Transaction_Sequence PRIMARY KEY (SequenceNo),
constraint CK_Transaction_NotNegative CHECK (OpeningBalance + Amount >= 0),
constraint UQ_Transaction_BalanceCheck UNIQUE (SequenceNo, ClosingBalance),
constraint FK_Transaction_BalanceCheck FOREIGN KEY
(PrevSequenceNo, OpeningBalance)
references Transactions
(SequenceNo,ClosingBalance)
/* Optional - another check that Transaction 1 has 0 amount and
0 opening balance, if required */
)
Where you just apply credits and debits as +ve or -ve values for Amount. The above structure is enough to enforce the "not going negative" requirement (via CK_Transaction_NotNegative), and it also ensures that you know the current balance (by finding the row with the highest SequenceNo and taking the ClosingBalance value. Together, UQ_Transaction_BalanceCheck and FK_Transaction_BalanceCheck (and the computed columns) ensure that the entire sequence of transactions is valid, and PK_Transaction_Sequence keeps everything building in order
So, if we populate it with some data:
insert into Transactions (SequenceNo,OpeningBalance,Amount) values
(1,0.0,10.0),
(2,10.0,-5.50),
(3,4.50,2.75)
And now we can attempt an insert (this could be INSERT_PROCEDURE with #NewAmount passed as a parameter):
declare #NewAmount decimal(38,4)
set #NewAmount = -15.50
;With LastTransaction as (
select SequenceNo,ClosingBalance,
ROW_NUMBER() OVER (ORDER BY SequenceNo desc) as rn
from Transactions
)
insert into Transactions (SequenceNo,OpeningBalance,Amount)
select SequenceNo + 1, ClosingBalance, #NewAmount
from LastTransaction
where rn = 1
This insert fails because it would have caused the balance to go negative. But if #NewAmount was small enough, it would have succeeded. And if two inserts are attempted at "the same time" then either a) They're just far enough apart in reality that they both succeed, and the balances are all kept correct, or b) One of them will receive a PK violation error.

Add datetime constraint to a PostgreSQL multi-column partial index

I've got a PostgreSQL table called queries_query, which has many columns.
Two of these columns, created and user_sid, are frequently used together in SQL queries by my application to determine how many queries a given user has done over the past 30 days. It is very, very rare that I query these stats for any time older than the most recent 30 days.
Here is my question:
I've currently created my multi-column index on these two columns by running:
CREATE INDEX CONCURRENTLY some_index_name ON queries_query (user_sid, created)
But I'd like to further restrict the index to only care about those queries in which the created date is within the past 30 days. I've tried doing the following:
CREATE INDEX CONCURRENTLY some_index_name ON queries_query (user_sid, created)
WHERE created >= NOW() - '30 days'::INTERVAL`
But this throws an exception stating that my function must be immutable.
I'd love to get this working so that I can optimize my index, and cut back on the resources Postgres needs to do these repeated queries.
You get an exception using now() because the function is not IMMUTABLE (obviously) and, quoting the manual:
All functions and operators used in an index definition must be "immutable" ...
I see two ways to utilize a (much more efficient) partial index:
1. Partial index with condition using constant date:
CREATE INDEX queries_recent_idx ON queries_query (user_sid, created)
WHERE created > '2013-01-07 00:00'::timestamp;
Assuming created is actually defined as timestamp. It wouldn't work to provide a timestamp constant for a timestamptz column (timestamp with time zone). The cast from timestamp to timestamptz (or vice versa) depends on the current time zone setting and is not immutable. Use a constant of matching data type. Understand the basics of timestamps with / without time zone:
Ignoring time zones altogether in Rails and PostgreSQL
Drop and recreate that index at hours with low traffic, maybe with a cron job on a daily or weekly basis (or whatever is good enough for you). Creating an index is pretty fast, especially a partial index that is comparatively small. This solution also doesn't need to add anything to the table.
Assuming no concurrent access to the table, automatic index recreation could be done with a function like this:
CREATE OR REPLACE FUNCTION f_index_recreate()
RETURNS void
LANGUAGE plpgsql AS
$func$
BEGIN
DROP INDEX IF EXISTS queries_recent_idx;
EXECUTE format('
CREATE INDEX queries_recent_idx
ON queries_query (user_sid, created)
WHERE created > %L::timestamp'
, LOCALTIMESTAMP - interval '30 days'); -- timestamp constant
-- , now() - interval '30 days'); -- alternative for timestamptz
END
$func$;
Call:
SELECT f_index_recreate();
now() (like you had) is the equivalent of CURRENT_TIMESTAMP and returns timestamptz. Cast to timestamp with now()::timestamp or use LOCALTIMESTAMP instead.
Select today's (since midnight) timestamps only
db<>fiddle here
Old sqlfiddle
If you have to deal with concurrent access to the table, use DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY. But you can't wrap these commands into a function because, per documentation:
... a regular CREATE INDEX command can be performed within a
transaction block, but CREATE INDEX CONCURRENTLY cannot.
So, with two separate transactions:
CREATE INDEX CONCURRENTLY queries_recent_idx2 ON queries_query (user_sid, created)
WHERE created > '2013-01-07 00:00'::timestamp; -- your new condition
Then:
DROP INDEX CONCURRENTLY IF EXISTS queries_recent_idx;
Optionally, rename to old name:
ALTER INDEX queries_recent_idx2 RENAME TO queries_recent_idx;
2. Partial index with condition on "archived" tag
Add an archived tag to your table:
ALTER queries_query ADD COLUMN archived boolean NOT NULL DEFAULT FALSE;
UPDATE the column at intervals of your choosing to "retire" older rows and create an index like:
CREATE INDEX some_index_name ON queries_query (user_sid, created)
WHERE NOT archived;
Add a matching condition to your queries (even if it seems redundant) to allow it to use the index. Check with EXPLAIN ANALYZE whether the query planner catches on - it should be able to use the index for queries on an newer date. But it won't understand more complex conditions not matching exactly.
You don't have to drop and recreate the index, but the UPDATE on the table may be more expensive than index recreation and the table gets slightly bigger.
I would go with the first option (index recreation). In fact, I am using this solution in several databases. The second incurs more costly updates.
Both solutions retain their usefulness over time, performance slowly deteriorates as more outdated rows are included in the index.