How to achieve reliable db reads calls with concurrency - postgresql

Consider you have the very simple table definition of:
CREATE TABLE first_name
(
id Integer NOT NULL,
name varchar(10),
PRIMARY KEY id
);
Now consider you have two rows like:
id , name
1 Dan
2 Jack
Imagine you have X processes that read from time to time the max(id) value,
then decide what is the sequential id to be written as the new record.
The problem is when having multiple processes like that, while reading we can already have another id entered.
What is the best option to guarantee in postgres an atomic action of read latest id and then write the next one, when having multiple processes doing the same all the time?
I know we have the Serial type (like mysql autoincrement) which allows automatic management of field updates in a sequential manner, how will it perform when multiple processes won't have any lock mechanism applied and just the serial definition, is it sufficient? are we protected here for concurrency problem?
Example for the second declaration from point 2:
CREATE TABLE first_name
(
id Serial,
name varchar(10),
PRIMARY KEY id
);

To get the max just query the serial value:
select currval(pg_get_serial_sequence('first_name', 'id'));
for example:
clima=# CREATE TABLE first_name
(
id serial,
name varchar(10)
);
CREATE TABLE
clima=# insert into first_name(name) select 'Diego';
INSERT 0 1
clima=# select currval(pg_get_serial_sequence('first_name', 'id'));
currval
---------
1
(1 row)
Yes and No, Depends, have you transactions... with witch scope?

The whole point of SERIAL is that the database solves the concurrency issue for you.
With respect to postgresql:
Here's a page from the postgresql documentation (Data Types/Numeric Types/Serial Types) which tells you that SERIAL columns are built on sequences.
Note: Because smallserial, serial and bigserial are implemented using sequences...
Here we see sequence generators, CREATE SEQUENCE a postgresql (only?) construct that lets you make your own integer sequences without tying them to an (identity) column. It discusses the semantics, which includes the property that not every sequential number might appear in your sequence (because the sequence ids are generated even if the row isn't actually added to the table, e.g., if the inserting transaction is rolled back).
Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently.
(Also you can "cache" sequence generation but then you have issues with non-sequential sequence ids).
Finally, here we see you can also use GENERATED AS IDENTITY to the same effect, standard SQL.

Related

How can a relational database with foreign key constraints ingest data that may be in the wrong order?

The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.
This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all the needed data is available, read into the database which has fk constraints. However, this adds complexity and I'd like to avoid it.
We're working on a solution that creates "placeholder" rows to point the foreign key to. When the real data comes in, the placeholder is replaced with real values. Again, this adds complexity, but it's the best solution we've found so far.
How do people typically solve this problem?
Edit: Some sample data which might help explain the problem:
Let's say we have these tables:
CREATE TABLE order (
id INTEGER NOT NULL,
order_number,
PRIMARY KEY (id),
UNIQUE (order_number)
);
CREATE TABLE line_item (
id INTEGER NOT NULL,
order_number INTEGER REFERENCES order(order_number),
PRIMARY KEY (id)
);
If I insert an order first, not a problem! But let's say I try:
INSERT INTO line_item (order_number) values (123) before order 123 was inserted. This will fail the fk constraint of course. But this might be the order I get the data, since it's reading from a stream that is collecting this data from multiple sources.
Also, to address #philpxy's question, I didn't really find much on this. One thing that was mentioned was deferred constraints. This is a mechanism that waits to do the fk constraints at the end of a transaction. I don't think it's possible to do that in my case however, since these insert statements will be run at random times whenever the data is received.
You have a business workflow problem, because line items of individual orders are coming in before the orders themselves have come in. One workaround, perhaps not ideal, would be to create a before insert trigger which checks, for every incoming insert to the line_item table, whether that order already exists in the order table. If not, then it will first insert the order record before trying the insert on line_item.
CREATE OR REPLACE FUNCTION "public"."fn_insert_order" () RETURNS trigger AS $$
BEGIN
INSERT INTO "order" (order_number)
SELECT NEW.order_number
WHERE NOT EXISTS (SELECT 1 FROM "order" WHERE order_number = NEW.order_number);
RETURN NEW;
END
$$
LANGUAGE 'plpgsql'
# trigger
CREATE TRIGGER "trigger_insert_order"
BEFORE INSERT ON line_item FOR EACH ROW
EXECUTE PROCEDURE fn_insert_order()
Note: I am assuming that the id column of the order table in fact is auto increment, in which case Postgres would automatically assign a value to it when inserting as above. Most likely, this is what you want, as having two id columns which both need to be manually assigned does not make much sense.
You could accomplish that with a BEFORE INSERT trigger on line_item.
In that trigger you query order if a matching item exists, and if not, you insert a dummy row.
That will allow the INSERT to succeed, at the cost of some performance.
To insert rows into order, use
INSERT INTO order ...
ON CONFLICT ON (order_number) DO UPDATE SET
id = EXCLUDED.id;
Updating a primary key is problematic and may lead to conflicts. One way you could get around that is if you use negative ids for artificially generated orders (assuming that the real ids are positive). If you have any references to that primary key, you'd have to define the constraint with ON UPDATE CASCADE.

PostgreSQL: serial vs identity

To have an integer auto-numbering primary key on a table, you can use SERIAL
But I noticed the table information_schema.columns has a number of identity_ fields, and indeed, you could create a column with a GENERATED specifier...
What's the difference? Were they introduced with different PostgreSQL versions? Is one preferred over the other?
serial is the "old" implementation of auto-generated unique values that has been part of Postgres for ages. However that is not part of the SQL standard.
To be more compliant with the SQL standard, Postgres 10 introduced the syntax using generated as identity.
The underlying implementation is still based on a sequence, the definition now complies with the SQL standard. One thing that this new syntax allows is to prevent an accidental override of the value.
Consider the following tables:
create table t1 (id serial primary key);
create table t2 (id integer primary key generated always as identity);
Now when you run:
insert into t1 (id) values (1);
The underlying sequence and the values in the table are not in sync any more. If you run another
insert into t1 default_values;
You will get an error because the sequence was not advanced by the first insert, and now tries to insert the value 1 again.
With the second table however,
insert into t2 (id) values (1);
Results in:
ERROR: cannot insert into column "id"
Detail: Column "id" is an identity column defined as GENERATED ALWAYS.
So you can't accidentally "forget" the sequence usage. You can still force this, using the override system value option:
insert into t2 (id) overriding system value values (1);
which still leaves you with a sequence that is out-of-sync with the values in the table, but at least you were made aware of that.
identity columns also have another advantage: they also minimize the grants you need to give to a role in order to allow inserts.
While a table using a serial column requires the INSERT privilege on the table and the USAGE privilege on the underlying sequence this is not needed for tables using an identity columns. Granting the INSERT privilege is enough.
It is recommended to use the new identity syntax rather than serial

Create custom sequence generator in Postgres to consider column value

I have a table called jobs in Postgres, which has columns job_id, username, created_ts and other job related informations. Column job_id is of serial type.
...
job_id serial NOT NULL,
username character varying(32),
...
username denotes the user who created the job. An user can only see the jobs which are created by him. He can see the job_id also. This makes the user to see random job_ids, not strictly sequence.
I want to make the sequence start from 1 for each user. It has to be sequential with respect to username. This will create duplicate job_ids but that's okay.
I have referred https://www.postgresql.org/docs/9.5/static/sql-createsequence.html I couldn't find any way to create a sequence by column value. Is it possible to create such sequence generator in Postgres?
I don't know if you can do that with sequences, but unless someone can think of something better, you could remove the sequence and instead have a trigger on INSERT that does something like
NEW.job_id := COALESCE((SELECT MAX(job_id) FROM jobs WHERE username = NEW.username), 0) + 1;
As long as you have the appropriate indexes it should be perfectly fast. Though I can foresee issues with concurrency. Depends how you expect to use it.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

Values missing in postgres serial field

I run a small site and use PostgreSQL 8.2.17 (only version available at my host) to store data.
In the last few months there were 3 crashes of the database system on my server and every time it happened 31 ID's from a serial field (primary key) in one of the tables were missing. There are now 93 ID's missing.
Table:
CREATE TABLE "REGISTRY"
(
"ID" serial NOT NULL,
"strUID" character varying(11),
"strXml" text,
"intStatus" integer,
"strUIDOrg" character varying(11),
)
It is very important for me that all the ID values are there. What can I do to to solve this problem?
You can not expect serial column to not have holes.
You can implement gapless key by sacrificing concurrency like this:
create table registry_last_id (value int not null);
insert into registry_last_id values (-1);
create function next_registry_id() returns int language sql volatile
as $$
update registry_last_id set value=value+1 returning value
$$;
create table registry ( id int primary key default next_registry_id(), ... )
But any transaction, which tries to insert something to registry table will block until other insert transaction finishes and writes its data to disk. This will limit you to no more than 125 inserting transactions per second on 7500rpm disk drive.
Also any delete from registry table will create a gap.
This solution is based on article Gapless Sequences for Primary Keys by A. Elein Mustain, which is somewhat outdated.
Are you missing 93 records or do you have 3 "holes" of 31 missing numbers?
A sequence is not transaction safe, it will never rollback. Therefor it is not a system to create a sequence of numbers without holes.
From the manual:
Important: To avoid blocking
concurrent transactions that obtain
numbers from the same sequence, a
nextval operation is never rolled
back; that is, once a value has been
fetched it is considered used, even if
the transaction that did the nextval
later aborts. This means that aborted
transactions might leave unused
"holes" in the sequence of assigned
values. setval operations are never
rolled back, either.
Thanks to the answers from Matthew Wood and Frank Heikens i think i have a solution.
Instead of using serial field I have to create my own sequence and define CACHE parameter to 1. This way postgres will not cache values and each one will be taken directly from the sequence :)
Thanks for all your help :)