Avoiding double SELECT/INSERT by INSERT'ing placeholder - postgresql

Is it possible to perform a query that will SELECT for some values and if those values do not exist, perform an INSERT and return the very same values - in a single query?
Background:
I am writing an application with a large deal of concurrency. At one point a function will check a database to see if a certain key value exists using SELECT. If the key exists, the function can safely exit. If the value does not exist, the function will perform a REST API call to capture the necessary data, then INSERT the values into the database.
This works fine until it is run concurrently. Two threads (I am using Go, so goroutines) will each independently run the SELECT. Since both queries report that the key does not exist, both will independently perform the REST API call and both will attempt to INSERT the values.
Currently, I avoid double-insertion by using a duplicate constraint. However, I would like to avoid even the double API call by having the first query SELECT for the key value and if it does not exist, INSERT a placeholder - then return those values. This way, subsequent SELECT queries report that the key value already exists and will not perform the API calls or INSERT.
In Pseudo-code, something like this:
SELECT values FROM my_table WHERE key=KEY_HERE;
if found;
RETURN SELECTED VALUES;
if not found:
INSERT values, key VALUES(random_placeholder, KEY_HERE) INTO table;
SELECT values from my_table WHERE key=KEY_HERE;
The application code will insert a random value so that a routine/thread can determine if it was the one that generated the new INSERT and will subsequently go ahead and perform the Rest API call.
This is for a Go application using the pgx library.
Thanks!

You could write a stored procedure and it would be a single query for the client to execute. PostgreSQL, of course, would still execute multiple statements. PostgreSQL insert statement can return values with the returning keyword, so you may not need the 2nd select.

Lock the table in an appropriate lock mode.
For example in the strictest possible mode ACCESS EXCLUSIVE:
BEGIN TRANSACTION;
LOCK elbat
IN ACCESS EXCLUSIVE MODE;
SELECT *
FROM elbat
WHERE id = 1;
-- if there wasn't any row returned make the API call and
INSERT INTO elbat
(id,
<other columns>)
VALUES (1,
<API call return values>);
COMMIT;
-- return values the to the application
Once one transaction has acquired the ACCESS EXCLUSIVE lock, no other transaction isn't even reading from the table until the acquiring transaction ends. And ACCESS EXCLUSIVE won't be granted unless there are no other (even weaker) locks. That way the instance of your component that gets the lock first will do the check and the INSERT if necessary. The other one will be blocked in the meantime and the time it finally gets access, the INSERT has already been done in the first transaction, it need not make the API call anymore (unless the first one fails for some reason and rolled back).
If this is too strict for your use case, you may need to find out which lock level might be appropriate for you. Maybe, if you can make any component accessing the database (or at least the table) cooperative (and it sounds like you can do this), even advisory locks are enough.

Related

node-postgres: multi-query is atomic?

When using pg-promise (based on node-postgres), a multi-query seems to be atomic.
For example, the following PostgreSQL query does not insert any rows at all even though only the second INSERT fails due to a duplicate id. No transactions are used.
insert into mytable (id) values (1); insert into mytable (id) values (1)
This behavior seems counter-intuitive and differs from that of psql. Is this a bug?
My tests indicate that yes, surprisingly, it is atomic, i.e. if one query fails, they all fail, same as inside a transaction.
I will investigate why that is, and post an update, if I find anything. See the open issue.
UPDATE
The investigation has confirmed that it is indeed how PostgreSQL works when multiple queries are sent in a single string.
Documentation for methods multi and multiResult has been amended accordingly:
The operation is atomic, i.e. all queries are executed in a single transaction, unless there are explicit BEGIN/COMMIT commands included in the query string to divide it into multiple transactions.

psql upsert results in noncontinuous id

I have a postgresql (>9.5) table with primary_key id and a unique key col. When I use
INSERT INTO table_a (col) VLUES('xxx') ON CONFLICT(col) DO NOTHING;
to perform a upsert, let's say a row with an id 1 is generated.
If I run the sql again, nothing will happen, but actually the id 2 will be generated and abandoned.
Then if I insert a new record, for example,
INSERT INTO table_a (col) VLUES('yyy') ON CONFLICT(col) DO NOTHING;
Another row with id 3 will be generated and id 2 is wasted!
Is there anyway to avoid this waste?
Presumably id is a serial. Under the hood this causes a nextval() call from a sequence. A number nextval() once returned will never be returned again. And the call of nextval() happens before checking for possible conflicts.
From "9.16. Sequence Manipulation Functions":
nextval
(...)
Important: To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused "holes" in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain "gapless" sequences.
Concluded that means, that the answer to your question is no, there is no way to avoid this unless you generate the values yourself somehow.

Possible to let the stored procedure run one by one even if multiple sessions are calling them in postgresql

In postgresql: multiple sessions want to get one record from the the table, but we need to make sure they don't interfere with each other. I could do it using message queue: put the data in a queue, and them let each session get data from the queue. But is it doable in postgresql? since it will be easier for SQL guys to cal stored procedure. Any way to configure a stored procedure so that no concurrent calling will happen, or use some special lock?
I would recommend making sure the stored procedure uses SELECT FOR UPDATE, which should prevent the same row in the table from being accessed by multiple transactions.
Per the Postgres doc:
FOR UPDATE causes the rows retrieved by the SELECT statement to be
locked as though for update. This prevents them from being modified or
deleted by other transactions until the current transaction ends. That
is, other transactions that attempt UPDATE, DELETE, SELECT FOR UPDATE,
SELECT FOR NO KEY UPDATE, SELECT FOR SHARE or SELECT FOR KEY SHARE of
these rows will be blocked until the current transaction ends. The FOR
UPDATE lock mode is also acquired by any DELETE on a row, and also by
an UPDATE that modifies the values on certain columns. Currently, the
set of columns considered for the UPDATE case are those that have an
unique index on them that can be used in a foreign key (so partial
indexes and expressional indexes are not considered), but this may
change in the future.
More SELECT info.
So you don't end up locking all of the rows in the table at once (i.e. by SELECTing all of the records), I would recommend you use ORDER BY to sort the table in a consistent manner, and then do a LIMIT 1, so that it only gets the next one in the queue. Also add a WHERE clause that checks for a certain column value (i.e. processed), and then once processed set the column to a value that will prevent the WHERE clause from picking it up.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

Check referential integrity in stored procedure

I have a customer table and an order table in an sql server 2000 database.
I don't want an order to be in the order table with a customerID that doesn't exist in the customer table so I have put a foreign key constraint on customerID.
This all works fine but when writing a stored procedure that could possibly violate the constraint, is there a way to check whether the constraint will be violated and, if it will be, skip/rollback the query?
At the minute all that happens is the stored procedure returns an error that is displayed on my asp page and looks rather ugly + most users wont understand it.
I would like a more elegant way of handling the error if possible.
Thanks
You have two options:
Add error handling to catch the ugly, error inspect it to see if it's a FK constraint violation and display this to the user. This is IMHO the better solution.
Add code in the stored procedure like the following:
if exists (select null from customer where customerid=#customerId )
begin
--The customer existed so insert order
end
else
begin
--Do something to tell you code to display error message
end
With the second option you will want to watch your transactional consistency. For example what happens if a customer is deleted after your check is made.
You can inspect the data before attempting the operation, or you can attempt the operation and then check the errors after each statement, then ROLLBACK etc.
But you can handle it entirely within stored procedures and return appropriately to the caller according to your design.
Have a look at this article: http://www.sommarskog.se/error-handling-II.html
In SQL Server 2005, there is a possibility of using TRY/CATCH