Prevent dupe records in SQL Server - tsql

So I have this stored procedure to insert a message into my database. I wanted to prevent users from posting duplicate messages within a short period of time, whether on accident or on purpose (either a laggy connection or a spammer).
This is what the insert statement looks like:
IF NOT EXISTS (SELECT * FROM tblMessages WHERE message = #message and ip = #ip and datediff(minute,timestamp, getdate()) < 10)
BEGIN
INSERT INTO tblMessages (ip, message, votes, latitude, longitude, location, timestamp, flags, deleted, username, parentId)
VALUES (#ip, #message, 0, #latitude, #longitude, #location, GETDATE(), 0, 0, #username, #parentId)
END
You can see I check to see if the same user has posted the same message within 10 minutes, and if not, I post it. I still saw one dupe come through yesterday. When I checked the timestamp of both messages in the database, they were exactly the same, down to the second, so I'm guessing when this 'exists' check ran on each insert, both came back empty, so they both inserted fine (at basically the same exact time).
What's a way I can prevent this from happening correctly?

I reckon you need a trigger
A unique constraint/index isn't clever enough to deal with the 10 minute gap between posts for a given message and ip.
CREATE TRIGGER TRG_tblMessages_I FRO INSERT
AS
SET NOCOUNT ON;
IF EXISTS (SELECT *
FROM tblMessages M
JOIN INSERTED I ON M.message = I.message and M.ip = I.ip
WHERE
datediff(minute, M.timestamp, I.timestamp) < 10)
BEGIN
RAISERRROR ('blah', 16, 1)
ROLLBACK TRAN
END
Edit: you need an extra condition to ignore the same row you have just inserted (eg using surrogate key)

Actually Derek Kromm isn't far off; Essentially you do want a unique constraint, you just want range for one of the columns.
You can express this as a filtered index which enforces the uniqueness on the columns you want but with a filter to match timestamps within a 10 minutes range.
CREATE NONCLUSTERED INDEX IX_UNC_tblMessages
ON tblMessages (message, ip, timestamp)
WHERE datediff(minute, timestamp, getdate()) < 10)
On the difference between a unique constraint and a filtered index which maintains uniqueness (MSDN):
There are no significant differences between creating a UNIQUE
constraint and creating a unique index independent of a constraint.
Data validation occurs in the same manner and the query optimizer does
not differentiate between a unique index created by a constraint or
manually created. However, you should create a UNIQUE or PRIMARY KEY
constraint on the column when data integrity is the objective. By
doing this the objective of the index will be clear.
The only aspect of this I'm not sure about is the use of getdate(). I'm not sure what effect that will have on the index and performance- this you will want to test for yourself.

Add a unique constraint to the table to absolutely prevent it from happening
ALTER TABLE tblMessages ADD CONSTRAINT uq_tblMessages UNIQUE (message,ip,timestamp)

I think, the easiest way is to use a triggers to check the sender and body of message in existing records in the table.
or, as Derek said, you can use the constraint, but with another condition:
ALTER TABLE tblMessages ADD CONSTRAINT uq_tblMessages UNIQUE (message,ip,username, parentId)
but constraint will generate exception (and you will need to handle it).

Related

Unique partial constraint in Postgres?

I have a table and want to update some of the rows like this:
CREATE TABLE abc(i INTEGER, deleted BOOLEAN);
CREATE UNIQUE INDEX myidx ON abc(i) WHERE NOT deleted;
INSERT INTO abc VALUES (4), (5);
UPDATE abc SET i = i - 1;
Which works ok because of the order in which the UPDATE is processing the rows, but when the UPDATE is attempted like this, it fails:
UPDATE abc SET i = i + 1;
ERROR: 23505: duplicate key value violates unique constraint "myidx"
DETAIL: Key (i)=(4) already exists.
SCHEMA NAME: public
TABLE NAME: abc
CONSTRAINT NAME: myidx
LOCATION: _bt_check_unique, nbtinsert.c:534
Time: 0.472 ms
The reason of the error is, in the middle of the update 2 rows would have had the value i = 4, even though at the end of the update all rows would have had unique values.
So I thought of changing the index into a deferred constraint, but according to the docs, this is not possible as my index is partial (so it only enforces uniqueness on some rows):
A uniqueness restriction covering only some rows cannot be written as a unique constraint, but it is possible to enforce such a restriction by creating a unique partial index.
The docs say to use partial indexes, but those can't be deferred, so I go back to the original problem.
So far my solution would be to set i = NULL whenever I mark deleted = true so it's not considered duplicated by my constraint anymore.
Is there a better solution to this? Maybe a way to make the UPDATE go always in the direction I want?
Please note:
I cannot DELETE the row, that's why the deleted column is there. The actual delete is done after some human validation happens.
Update:
The reason I'm bulk-updating that unique column is because this table contains a sequence that is used in the UI for sorting the records (the users drag and drop the records as they wish). And they can also delete them (so I shift the sequences of elements occurring after the one that was deleted).
The actual columns look more like this (name TEXT, description TEXT, ..., sequence NUMBER).
That sequence row is what in the simplified case I called i. So say I have 3 records with (name, sequence):
("Laptop", 1)
("Mobile", 2)
("Desktop", 3)
And I the user deletes the middle one, I want to end up with:
("Laptop", 1)
("Desktop", 2) // <--- updated here

How can a relational database with foreign key constraints ingest data that may be in the wrong order?

The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.
This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all the needed data is available, read into the database which has fk constraints. However, this adds complexity and I'd like to avoid it.
We're working on a solution that creates "placeholder" rows to point the foreign key to. When the real data comes in, the placeholder is replaced with real values. Again, this adds complexity, but it's the best solution we've found so far.
How do people typically solve this problem?
Edit: Some sample data which might help explain the problem:
Let's say we have these tables:
CREATE TABLE order (
id INTEGER NOT NULL,
order_number,
PRIMARY KEY (id),
UNIQUE (order_number)
);
CREATE TABLE line_item (
id INTEGER NOT NULL,
order_number INTEGER REFERENCES order(order_number),
PRIMARY KEY (id)
);
If I insert an order first, not a problem! But let's say I try:
INSERT INTO line_item (order_number) values (123) before order 123 was inserted. This will fail the fk constraint of course. But this might be the order I get the data, since it's reading from a stream that is collecting this data from multiple sources.
Also, to address #philpxy's question, I didn't really find much on this. One thing that was mentioned was deferred constraints. This is a mechanism that waits to do the fk constraints at the end of a transaction. I don't think it's possible to do that in my case however, since these insert statements will be run at random times whenever the data is received.
You have a business workflow problem, because line items of individual orders are coming in before the orders themselves have come in. One workaround, perhaps not ideal, would be to create a before insert trigger which checks, for every incoming insert to the line_item table, whether that order already exists in the order table. If not, then it will first insert the order record before trying the insert on line_item.
CREATE OR REPLACE FUNCTION "public"."fn_insert_order" () RETURNS trigger AS $$
BEGIN
INSERT INTO "order" (order_number)
SELECT NEW.order_number
WHERE NOT EXISTS (SELECT 1 FROM "order" WHERE order_number = NEW.order_number);
RETURN NEW;
END
$$
LANGUAGE 'plpgsql'
# trigger
CREATE TRIGGER "trigger_insert_order"
BEFORE INSERT ON line_item FOR EACH ROW
EXECUTE PROCEDURE fn_insert_order()
Note: I am assuming that the id column of the order table in fact is auto increment, in which case Postgres would automatically assign a value to it when inserting as above. Most likely, this is what you want, as having two id columns which both need to be manually assigned does not make much sense.
You could accomplish that with a BEFORE INSERT trigger on line_item.
In that trigger you query order if a matching item exists, and if not, you insert a dummy row.
That will allow the INSERT to succeed, at the cost of some performance.
To insert rows into order, use
INSERT INTO order ...
ON CONFLICT ON (order_number) DO UPDATE SET
id = EXCLUDED.id;
Updating a primary key is problematic and may lead to conflicts. One way you could get around that is if you use negative ids for artificially generated orders (assuming that the real ids are positive). If you have any references to that primary key, you'd have to define the constraint with ON UPDATE CASCADE.

serial in postgres is being increased even though I added on conflict do nothing

I'm using Postgres 9.5 and seeing some wired things here.
I've a cron job running ever 5 mins firing a sql statement that is adding a list of records if not existing.
INSERT INTO
sometable (customer, balance)
VALUES
(:customer, :balance)
ON CONFLICT (customer) DO NOTHING
sometable.customer is a primary key (text)
sometable structure is:
id: serial
customer: text
balance: bigint
Now it seems like everytime this job runs, the id field is silently incremented +1. So next time, I really add a field, it is thousands of numbers above my last value. I thought this query checks for conflicts and if so, do nothing but currently it seems like it tries to insert the record, increased the id and then stops.
Any suggestions?
The reason this feels weird to you is that you are thinking of the increment on the counter as part of the insert operation, and therefore the "DO NOTHING" ought to mean "don't increment anything". You're picturing this:
Check values to insert against constraint
If duplicate detected, abort
Increment sequence
Insert data
But in fact, the increment has to happen before the insert is attempted. A SERIAL column in Postgres is implemented as a DEFAULT which executes the nextval() function on a bound SEQUENCE. Before the DBMS can do anything with the data, it's got to have a complete set of columns, so the order of operations is like this:
Resolve default values, including incrementing the sequence
Check values to insert against constraint
If duplicate detected, abort
Insert data
This can be seen intuitively if the duplicate key is in the autoincrement field itself:
CREATE TABLE foo ( id SERIAL NOT NULL PRIMARY KEY, bar text );
-- Insert row 1
INSERT INTO foo ( bar ) VALUES ( 'test' );
-- Reset the sequence
SELECT setval(pg_get_serial_sequence('foo', 'id'), 0, true);
-- Attempt to insert row 1 again
INSERT INTO foo ( bar ) VALUES ( 'test 2' )
ON CONFLICT (id) DO NOTHING;
Clearly, this can't know if there's a conflict without incrementing the sequence, so the "do nothing" has to come after that increment.
As already said by #a_horse_with_no_name and #Serge Ballesta serials are always incremented even if INSERT fails.
You can try to "rollback" serial value to maximum id used by changing the corresponding sequence:
SELECT setval('sometable_id_seq', MAX(id), true) FROM sometable;
As said by #a_horse_with_no_name, that is by design. Serial type fields are implemented under the hood through sequences, and for evident reasons, once you have gotten a new value from a sequence, you cannot rollback the last value. Imagine the following scenario:
sequence is at n
A requires a new value : got n+1
in a concurrent transaction B requires a new value: got n+2
for any reason A rollbacks its transaction - would you feel safe to reset sequence?
That is the reason why sequences (and serial field) just document that in case of rollbacked transactions holes can occur in the returned values. Only unicity is guaranteed.
Well there is technique that allows you to do stuff like that. They call insert mutex. It is old old old, but it works.
https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/
Generally idea is that you do INSERT SELECT and if your values are duplicating the SELECT does not return any results that of course prevents INSERT and the index is not incremented. Bit of mind boggling, but perfectly valid and performant.
This of course completely ignores ON DUPLICATE but one gets back control over the index.

PostgreSQL multi-column unique constraint causes errors

I am new to postgresql and have a question about multiple column unique constraint.
I got this error when tried to add rows to the table:
ERROR: duplicate key value violates unique constraint "i_rb_on"
DETAIL: Key (a_fk, b_fk)=(296, 16) already exists.
I used this code (short version):
INSERT INTO rb_on (a_fk, b_fk) SELECT a.pk, b.pk FROM A, B WHERE NOT EXISTS (SELECT * FROM rb_on WHERE a_fk=a.pk AND b_fk=b.pk);
i_rb_on is unique constraint / columns (a_fk, b_fk).
It seems that my WHERE NOT EXISTS doesn't provide a protection against the duplicate key error for this kind of unique key.
UPDATE:
INSERT INTO tabA (mark_done, log_time, tabB_fk, tabC_fk)
SELECT FALSE, '2003-09-02 04:05:06', tabB.pk, tabC.pk FROM tabB, tabC, tabD, tabE, tabF
WHERE (tabC.sf_id='SUMMER' AND tabC.sf_status IN(0,1)
AND tabE.inventory_status=0)
AND tabF.tabD_fk=tabD.pk
AND tabD.tabE_fk=tabE.pk
AND tabE.tabB_fk=tabB.pk
AND tabF.tabC_fk=tabC.pk
AND NOT EXISTS (SELECT *
FROM tabA
WHERE tabB_fk=tabB.pk AND tabC_fk=tabC.pk);
In tabA unique index:
CREATE UNIQUE INDEX i_tabA
ON tabA
USING btree
(tabB_fk , tabC_fk );
Only one row (of many) must be inserted into the tabA.
Your WHERE NOT EXISTS never provides proper protection against a unique violation. It only seems to most of the time. The WHERE NOT EXISTS can run concurrently with another insert, so the row is still inserted multiple times and all but one of the inserts causes a unique violation.
For that reason it's often better to just run the insert and let the violation happen if the row already exists.
I can't help you with the exact problem described unless you show the data (as SQL CREATE TABLE and INSERTs) and the real query.
BTW, please don't use old style A, B joins. Use A INNER JOIN B ON (...). It makes it easier to tell which join conditions are supposed to apply to which parts of the query, and harder to forget a join condition. You seem to have done that; you're attempting to insert a cartesian product. I suspect it's just an editing mistake in the query.
I added LIMIT 1 to the end: ...WHERE tabB_fk=tabB.pk AND tabC_fk=tabC.pk) LIMIT1 ;
and it did the trick.
I created a function with LIMIT 1 and ...EXCEPTION WHEN unique_violation THEN ... and it also worked.
But when LIMIT 1 and "NOT EXISTS" are used, I think, it is not necessary to use unique_violation error handling.

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)
You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.
Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.