I have a postgres table that is used to hold users files. Two users can have a file with the same name, but a user isn't allowed to have two files with the same name. Currently, if a user tries to upload a file with a name they already used, the database will spit out the error below as it should.
IntegrityError: duplicate key value violates unique constraint "file_user_id_title_key"
What I would like to do is first query the database with the file name and user ID to see if the file name is being used by the user. If the name is already being used, return an error, otherwise write the row.
cur.execute('INSERT INTO files(user_id, title, share)'
'VALUES (%s, %s, %s) RETURNING id;',
(user.id, file.title, file.share))
The problem is that you cannot really do that without opening a race condition:
There is nothing to keep somebody else from inserting a conflicting row between the time you query the table and when you try to insert the row, so the error could still happen (unless you go to extreme measures like locking the table before you do that, which would affect concurrency badly).
Moreover, your proposed technique incurs extra load on the database by adding a superfluous second query.
You are right that you should not confront the user with a database error message, but the correct way to handle this is as follows:
You INSERT the new row like you showed.
You check if you get an error.
If the SQLSTATE of the error is the SQL standard value 23505 (unique_violation), you know that there is already such a file for the user and show the appropriate error to the user.
So you can consider the INSERT statement as an atomic operation check if there is already a matching entry, and if not, add the row.
Related
When I INSERT or UPDATE a list of rows in PostgreSQL and one of them is causing an error, how can I know which one exactly (its index in the input list) ?
For example, if I have a UNIQUE constraint on the name column, and if name two already exists, I want to know that the constraint violation is caused by the input row at index 1.
INSERT INTO table (id, name) VALUES ('0000', 'one'), ('0001', 'two');
I know PostgSQL will stop on the first error encountered, and therefore that we can't know all of the problematic rows. That's fine, I just need the first problematic index (if any).
Inserting each row separately is not a possibility since we want to optimize for performance as well.
Postgres gives you the exactly what you are asking for and actually. It provides the constraint name, the column(s), and the value(s). However, much of this is is subsequent details to the error message. You need to extract the complete message. See demo.
I have a table from which I want to UPSERT into another, when try to launch the query, I get the "cannot affect row a second time" error. So I tried to see if I have some duplicate on my first table regarding the field with the UNIQUE constraint, and I have none. I must be missing something, but since I cannot figure out what (and my query is a bit complex because it is including some JOIN), here is the query, the field with the UNIQUE constraint is "identifiant_immeuble" :
with upd(a,b,c,d,e,f,g,h,i,j,k) as(
select id_parcelle, batimentimmeuble,etatimmeuble,nb_loc_hab_ma,nb_loc_hab_ap,nb_loc_pro, dossier.id_dossier, adresse.id_adresse, zapms.geom, 1, batimentimmeuble2
from public.zapms
left join geo_pays_gex.dossier on dossier.designation_siea=zapms.id_dossier
left join geo_pays_gex.adresse on adresse.id_voie=(select id_voie from geo_pays_gex.voie where (voie.designation=zapms.nom_voie or voie.nom_quartier=zapms.nom_quartier) and voie.rivoli=lpad(zapms.rivoli,4,'0'))
and adresse.num_voie=zapms.num_voie
and adresse.insee=zapms.insee_commune::integer
)
insert into geo_pays_gex.bal2(identifiant_immeuble, batimentimmeuble, id_etat_addr, nb_loc_hab_ma, nb_loc_hab_ap, nb_loc_pro, id_dossier, id_adresse, geom, raccordement, batimentimmeuble2)
select a,b,c,d,e,f,g,h,i,j,k from upd
on conflict (identifiant_immeuble) do update
set batimentimmeuble=excluded.batimentimmeuble, id_etat_addr=excluded.id_etat_addr, nb_loc_hab_ma=excluded.nb_loc_hab_ma, nb_loc_hab_ap=excluded.nb_loc_hab_ap, nb_loc_pro=excluded.nb_loc_pro,
id_dossier=excluded.id_dossier, id_adresse=excluded.id_adresse,geom=excluded.geom, raccordement=1, batimentimmeuble2=excluded.batimentimmeuble2
;
As you can see, I use several intermediary tables in this query : one to store the street's names (voie), one related to this one storing the adresses (adresse, basically numbers related through a foreign key to the street's names table), and another storing some other datas related to the projects' names (dossier).
I don't know what other information I could give to help find an answer, I guess it is better I do not share the actual content of my tables since it may touch some privacy regulations or such.
Thanks for your attention.
EDIT : I found a workaround by deleting the entries present in the zapms table from the bal2 table, as such
delete from geo_pays_gex.bal2 where bal2.identifiant_immeuble in (select id_parcelle from zapms);
it is not entirely satisfying though, since I would have prefered to keep track of the data creator and the date of creation, as much as the fact that the data has been modified (I have some fields to store this information) and here I simply erase all this history... And I have another table with the primary key of the bal2 table as a foreign key. I am still in the DB creation so I can afford to truncate this table, but in production it wouldn't be possible since I would lose some datas.
I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.
I need to insert a table data into another table. Where it is not guaranteed that the source table have all rows correctly where some of the not null fields are having null values. So with this source table I need to enter all valid rows into the table and find all unvalid rows which failed to insert and return them.
I know we can do this by validating all rows before hand. But as this is a bulk insert from a csv and parsed by .net code so from db we wil not validate it but directly enter.
We can also do this by running a loop but performance might hit.
so my question is is any way where we can use a single statement for insert and skip rows which has a problem and insert which are valid.
BULK INSERT is all-or-nothing. SQL Server does not have the ability to shunt erroneous rows into a separate table, alas.
The best thing you can do is to validate all data thoroughly before inserting it. If the insert still fails (maybe due to a bug) you need to retry all rows one-by-one and log the errors that are occurring.
You can also bulk insert to a temp table and move the rows from there to the final table one-by-one.
On my table I have a secondary unique key labeled md5. Before inserting, I check to see if the MD5 exists, and if not, insert it, as shown below:
-- Attempt to find this item
SELECT INTO oResults (SELECT domain_id FROM db.domains WHERE "md5"=oMD5);
IF (oResults IS NULL) THEN
-- Attempt to find this domain
INSERT INTO db.domains ("md5", "domain", "inserted")
VALUES (oMD5, oDomain, now());
RETURN currval('db.domains_seq');
END IF;
This works great for single threaded inserts, my problem is when I have two external applications calling my function concurrently that happen to have the same MD5. I end up with a situation where:
App 1: Sees the MD5 does not exist
App 2: Inserts this MD5 into table
App 1: Goes to now Insert MD5 into table since it thinks it doesnt exist, but gets an error because right after it seen it does not, App 2 inserted it.
Is there a more effective way of doing this?
Can I catch the error on insert and if so, then select the domain_id?
Thanks in advance!
This also seems to be covered at Insert, on duplicate update in PostgreSQL?
You could just go ahead and try to insert the MD5 and catch the error, if you get a "unique constraint violation" error then ignore it and keep going, if you get some other error then bail out. That way you push the duplicate checking right down to the database and your race condition goes away.
Something like this:
Attempt to insert the MD5 value.
If you get a unique violation error, then ignore it and continue on.
If you get some other error, bail out and complain.
If you don't get an error, then continue on.
Do your SELECT INTO oResults (SELECT domain_id FROM db.domains WHERE "md5"=oMD5) to extract the domain_id.
There might be a bit of a performance hit but "correct and a little slow" is better than "fast but broken".
Eventually you might end up with more exceptions that successful inserts. Then you could try to insert in the table the references (through a foreign key) your db.domains and trap the FK violation there. If you had an FK violation, then do the old "insert and ignore unique violations" on db.domains and then retry the insert that gave you the FK violation. This is the same basic idea, it just a matter of choosing which one will probably throw the least exceptions and go with that.