Postgresql check if exists vs unique constraint in table - postgresql

Am inserting huge number of records containing Phone Number and Status (a million) in a postgresql database from hibernate. I am reading the records from a file, processing each, and inserting them one at a time. But before the insert I need to check whether this combination of phone number and status already exists in the table.
It seems to me that the fastest way would be to do a query and limit it by 1, or an Exists query, but another suggestion I got from a colleague is to add a unique constraint on the table on the phone number and status fields and in case the unique key rule is being violated, just catch the exception in hibernate.
Any thoughts on what's the fastest and most reliable method?

It depends whether there are only these columns, or also some others, for example date. If you don't care which record will stay in database (for example you need latest combination of number, status and date), then create unique constraint and recover from exception that is thrown when inserting duplicities.
You may also insert all with duplicities but with some primary key (id) and then delete all duplicities but one you like to have (group by...), then create unique constraint.
Last option depends on size of records - if it is only 1M of them, you may filter them in application layer and then save them.
All of these depend on how much duplicities are there, if just few, use option one, if each record may be there 10 times, maybe last option is the best (depending on RAM, but you will hold only currently best record for phone and status)

Related

PostgreSQL - on conflict update for GENERATED ALWAYS AS IDENTITY

I have a table of values which I want to manage with my application ...
let's say this is the table
CREATE TABLE student (
id_student int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
id_teacher int2
student_name varchar(255),
age int2
CONSTRAINT provider_pk PRIMARY KEY (id_student)
);
In the application, each teacher can see the list of all his/her students .. and they can edit or add new students
I am trying to figure out how to UPSERT data in the table in PostgreSQL ... what I am doing now is for each teacher (after the manipulation in the app) they are allowed to edit on FE only in JS (without the necessity of saving each change individually)... so after the edit, they click SAVE button and that's the time I need to store the changes and new records in the DB ...
what I do now, is I delete all records for that particular teacher and store the new object/array they created (by editing, adding, .. whatever) - so it's easy and I don't have to check for changes and new records ... the drawbacks is a brutal waste of the sequence for ID_STUDENT (autogenerated on the DB side) and of course a huge overhead on indexes while inserting (=rebuilding) considering there will be a lot of teachers saving a lot of their students .. that might cause some perf issues ... not to mention the fragmenting (HWM) so I would have to VACUUM regularly on this table
In Oracle, I could easily use MERGE INTO (which is fantastic for this use case) but the MERGE is not in the PostgreSQL :(
the only thing I know about is the INSERT ON CONFLICT UPDATE ... but the problem is, how am I supposed to apply this on GENERATED ALWAYS AS IDENTITY key? I do not provide this sequence (on top of that I don't even know the latest number) and therefore I cannot trigger the ON CONFLICT (id_student) ....
is there any nice way out of this sh*t ? Or DELETE / INSERT is really the way to go?
You shouldn't be too worried about the data churn – after all, an UPDATE also writes a new version of the row, so it wouldn't be that much different. And the sequence is no problem, because you used bigint for the primary key, right (anything else would have been a mistake)?
If you want to use INSERT ... ON CONFLICT in combination with an auto-generated sequence, you need some way beside the primary key to identify a row, that is, you need a UNIQUE constraint that you can use with ON CONFLICT. If there is no candidate for such a constraint, how can you identify the records for the teacher?

Postgres Unique Sequences in one table based on owner/foreign key

I am creating a web application that will store all user information in one database using permissions, roles, and FKs to restrict data access. One of the tables in this application tracks work orders created by each user (i.e. the work order table has an FK to the user table).
I am wanting to ensure that each user has their own uninterrupted sequence of 'work order IDs' that are assigned when the work order is scheduled. That is, if user 1 creates his first work order, it will assign it #1, however, if user 2 creates his fifth work order, it will assign it #5.
The work order table has a UUID primary key, so each record is distinguishable, and the user FK has a not-null constraint.
Based on my research so far, it seems like Postgres Sequences would likely be my best answer. I would need to create a sequence for each user, and incorporate it into a trigger to stamp the work order record with the next appropriate ID. However, this seems like it would be very performance intensive, and creating a new sequence for every user would have its own set of challenges.
A second approach could be to create a second table that tracks each user's latest sequence, query it, increment it, and update both the work order table and the number tracking table. However, in this scenario, I think it would be susceptible to race conditions if two users were to convert records at exactly the same time.
I'm unsure what the best way to solve the problem would be. Is there another way that would provide better performance?
Sequences won't work for you, because they are not transactional by design: if an insert with a generated number fails, that number is consumed even after a ROLLBACK.
You should create a second table
CREATE TABLE counters (
user_id bigint PRIMARY KEY REFERENCES users ON DELETE CASCADE,
work_order_id bigint NOT NULL DEFAULT 0
);
Then you get the next number with
UPDATE counters
SET work_order_id = work_order_id + 1
RETURNING work_order_id;
That is atomic and safe from race conditions. Just make sure you run that update and the insert in the same database transaction, then they will either both succeed or both fail and be undone.
This will serialize inserts into the work orders table per user, but gap-less sequences are always a performance problem.

DB2 access specific row, in an non Unique table, for update / delete operations

Can I do row-specific update / delete operations in a DB2 table Via SQL, in a NON QUNIQUE Primary Key Context?
The Table is a PHYSICAL FILE on the NATIVE SYSTEM of the AS/400.
It was, like many other Files, created without the unique definition, which leads DB2 to the conclusion, that The Table, or PF has no qunique Key.
And that's my problem. I can't override the structure of the table to insert a unique ID ROW, because, I would have to recompile ALL my correlating Programs on the AS/400, which is a serious issue, much things would not work anymore, "perhaps". Of course, I can do that refactoring for one table, but our system has thousands of those native FILES, some well done with Unique Key, some without Unique definition...
Well, I work most of the time with db2 and sql on that old files. And all files which have a UNIQUE Key are no problem for me to do those important update / delete operations.
Is there some way to get an additional column to every select with a very unique row id, respective row number. And in addition, what is much more important, how can I update this RowNumber.
I did some research and meanwhile I assume, that there is no chance to do exact alterations or deletes, when there is no unique key present. What I would wish would be some additional ID-ROW which is always been sent with the table, which I can Refer to when I do my update / delete operations. Perhaps my thinking here has an fallacy as non Unique Key Tables are purposed to be edited in other ways.
Try the RRN function.
SELECT RRN(EMPLOYEE), LASTNAME
FROM EMPLOYEE
WHERE ...;
UPDATE EMPLOYEE
SET ...
WHERE RRN(EMPLOYEE) = ...;

Alter a SELECT Query to Limit

I'm working on a 1M+ row table. The software that inserts the data sometimes tries to select all rows. If it tries to do that; It crashes.
I'm not able to modify the software so I'm trying to implement a fix on the Postgresql side.
I want Postgresql to limit SELECT query results that are coming from a special user to 1.
I tried to implement a RULE but haven't been able to do it with success. Any suggestions are welcome.
Br,
You could rename the table and create a view with the name of the table (selecting from the renamed table).
Then you can include a LIMIT clause in the view definition.
There is a chance you need an index. Let me give you a few scenarios
There is a unique constraint on one of the fields but no corresponding index. This way when you insert a record PostgreSQL has to scan the table to see if there is an existing record with the same value in that field.
Your software mimics unique field constraint. Before inserting a new record it scans the table for a record with the same value in one of the fields to check if such a record already exists. Index on the right field would definitely help.
You software wants to compute the next "id" value. In this case it runs SELECT MAX(id) in order to find the next available value. "id" needs an index.
Try to find out if indexing one of the table fields helps. You can also try to trace and analyze queries submitted to the server and see if those queries can benefit from indexing the table. You can enable query logging this way How to log PostgreSQL queries?
Another guess is that your software buffers all records before processing them. Reading 1M records into memory may crash it. Limiting fetchSize (e.g. if your software uses JDBC you could add defaultRowFetchSize connection parameter to the connection string) may help though I realize you may not have means to change the way the existing software fetches data from DB.

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)
You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.
Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.