Ignore bad records and still update all the good ones in a stored procedure. Log the errors - tsql

I have to rework a bunch of complex stored procedures in SQL Server to make them ignore all the records that would cause errors at runtime and still insert/update the correct records. I should also track all the error messages in a separate log table. Currently each procedure is 'wrapped' within a transaction and there is a TRY..CATCH block, so in case of any error, the transaction is rolled back. I would like to know how can I change this behavior but maintain the efficiency as high as possible.
I have scratched and example, to be easier to test.
--temporary table created for testing purposes
IF OBJECT_ID('tempdb..#temptable') IS NOT NULL
DROP TABLE #temptable
CREATE TABLE #temptable
(
[name] varchar(50),
[divisible] int,
[divider] int,
[result] float
)
GO
--insert some dummy records in #temptable
-- example of a record with good data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('A', 1, 1)
-- example of a record with bad data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('B', 2, 0)
-- another example of a record with good data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('C', 3, 1)
--A dummy example for unhandled error (I know how to handle it otherwise ;-) )
UPDATE #temptable
SET [result] = divisible/divider
SELECT * FROM #temptable
Currently nothing gets updated:
I would like to have the good records (A and C) updated and to log the error message that record B will throw.
Also, please keep in mind that I have the freedom to introduce SSIS in the solution, but I don't want to rewrite all the procedures.
So what would you suggest - cursor, while loop, SSIS, or anything else?

You need a layered approach for building a solution given the situation:
Layer 1 that runs first -
Do due diligence on acceptable data range and unacceptable outliers.
eg. -ve numbers, precision, max -+numbers.
Do some work on writing
code for a validation phase where you identify such records in your
process table and Move them (ie. in a single "begin tran ... commit
tran" section insert - delete them from Processing table, move to Log
table).
Layer 2 runs second -
Proceed to perform your Update statement.
Although this suggestion is outside your question, I recommend using a numeric or a decimal instead of float data type. You may run into issues with float.

Related

My PSQL after insert trigger fails to insert into another table when ON DUPLICATE encounters a dupilcate

I am slowly working through a feature where I am importing large csv files. The contents of the csv file has a chance that when it is uploaded the contents will trigger a uniqueness conflict. I've combed stack overflow for some similar resources but I still can't seem to get my trigger to update another table when a duplicate entry is found. The following code is what I have currently implemented with my line of logic for this process. Also, this is implemented in a rails app but the underlying sql is the following.
When a user uploads a file, the following happens when its processed.
CREATE TEMP TABLE codes_temp ON COMMIT DROP AS SELECT * FROM codes WITH NO DATA;
create or replace function log_duplicate_code()
returns trigger
language plpgsql
as
$$
begin
insert into duplicate_codes(id, campaign_id, code_batch_id, code, code_id, created_at, updated_at)
values (gen_random_uuid(), excluded.campaign_id, excluded.code_batch_id, excluded.code, excluded.code_id, now(), now());
return null;
end;
$$
create trigger log_duplicate_code
after insert on codes
for each row execute procedure log_duplicate_code();
INSERT INTO codes SELECT * FROM codes_temp ct
ON CONFLICT (campaign_id, code)
DO update set updated_at = excluded.updated_at;
DROP TRIGGER log_duplicate_code ON codes;
When I try to run this process nothing happens at all. If I were to have a csv file with this value CODE01 and then upload again with CODE01 the duplicate_codes table doesn't get populated at all and I don't understand why. There is no error that gets triggered or anything so it seems like DO UPATE..... is doing something. What am I missing here?
I also have some questions that come to my mind even if this were to work as intended. For example, I am uploading millions of these codes, etc.
1). Should my trigger be a statement trigger instead of a row for scalability?
2). What if someone else tries to upload another file that has millions of codes? I have my code wrapped in a transaction. Would a new separate trigger be created? Will this conflict with a previously processing process?
####### EDIT #1 #######
Thanks to Adriens' comment I do see that After Insert does not have the OLD key phrase. I updated my code to use EXCLUDED and I receive the following error for the trigger.
ERROR: missing FROM-clause entry for table "excluded" (PG::UndefinedTable)
Finally, here are the S.O. posts I've used to try to tailor my code but I just can't seem to make it work.
####### EDIT #2 #######
I have a little more context on to how this is implemented.
When the CSV is loaded, a staging table called codes_temp is created and dropped at the end of the transaction. This table contains no unique constraints. From what I read only the actual table that I want to insert codes should have the unique constraint error.
In my INSERT statement, the DO update set updated_at = excluded.updated_at; doesn't trigger a unique constraint error. As of right now, I don't know if it should or not. I borrowed this logic taken from this s.o. question postgresql log into another table with on conflict it seemed to me like I had to update something if I specify the DO UPDATE SET clause.
Last, the correct criteria for codes in the database is the following:
For example, this is an example entry in my codes table
id, campaign_id, code
1, 1, CODE01
2, 1, CODE02
3, 1, CODE03
If any of these codes appear again somewhere, This should not be inserted into the codes table but it needs to be inserted into the duplicate_codes table because they were already uploaded before.
id, campaign_id, code
1, 1, CODE01.
2, 1, CODE02
3, 1, CODE03
As for the codes_temp table I don't have any unique constraints, so there is no criteria to select the right one.
postgresql log into another table with on conflict
Postgres insert on conflict update using other table
Postgres on conflict - insert to another table
How to do INSERT INTO SELECT and ON DUPLICATE UPDATE in PostgreSQL 9.5?
Seems to me something like:
INSERT INTO
codes
SELECT
distinct on(campaign_id, code) *
FROM
codes_temp ct
ORDER BY
campaign_id, code, id DESC;
Assuming id was assigned sequentially, the above would select the most recent row into codes.
Then:
INSERT INTO
duplicate_codes
SELECT
*
FROM
codes_temp AS ct
LEFT JOIN
codes
ON
ct.id = codes.id
WHERE
codes.id IS NULL;
The above would select the rows in codes_temp that where not selected into codes into the duplicates table.
Obviously not tested on your data set. I would create a small test data set that has uniqueness conflicts and test with.

Lack of user-defined table types for passing data between stored procedures in PostgreSQL

So I know there's already similar questions on this, but most of them are very old, or they have non-answers, like "why would you even want to do this?", or "table types aren't performant and we don't want them here", or even "you need to rethink your whole approach".
So what I would ideally want to do is to declare a user-defined table type like this:
CREATE TYPE my_table AS TABLE (
a int,
b date);
Then use this in a procedure as a parameter, like this:
CREATE PROCEDURE my_procedure (
my_table_parameter my_table)
Then be able to do stuff like this:
INSERT INTO
my_temp_table
SELECT
m.a,
m.b,
o.useful_data
FROM
my_table m
INNER JOIN my_schema.my_other_table o ON o.a = m.a;
This is for a billing system, let's make it a mobile phone billing system (it isn't but it's similar enough to work). Here's several ways I might call my procedure:
I sometimes want to call my procedure for one row, to create an adhoc bill for one customer. I want to do this while they are on the phone and get them an immediate result. Maybe I just fixed something wrong with their bill, and they're angry!
I sometimes want to bill everyone who's due a bill on a specific date. Maybe this is their preferred billing date, and they're on a monthly billing cycle?
I sometimes want to bill in bulk, but my data comes from a CSV file. Maybe this is a custom billing run that I have no way of understanding the motivation for?
Maybe I want to final bill customers who recently left?
Sometimes I might need to rebill customers because a mistake was made. Maybe a tariff rate was uploaded incorrectly, and everyone on that tariff needs their bill regenerating?
I want to split my code up into modules, it's easier to work like this, and it allows a large degree of reusability. So what I don't want to do is to write n billing systems, where each one handles one of the use cases above. I want a generic billing engine that is a stored procedure, uses set-based queries where possible, and works just about as well for one customer as it does for 1,000,000. If anything I want it optimised for lots of customers, as long as it only takes a few seconds to run for one customer.
If I had SQL Server I would create a user-defined table type, and this would contain a list of customers, the date they need billing to, and maybe anything else that would be useful. But let's just leave it at the simplest case possible, an integer representing a customer and a date to say what date I want to bill them up to, like my example above.
I've spent some days now looking at the options available in PostgreSQL, and these are the conclusions I have reached. I would be extremely grateful for any help with this, correcting my incorrect assumptions, or telling me of another way I have overlooked.
Process-Keyed Table
Create a table that looks like this:
CREATE TABLE customer_list (
process_key int,
customer_id int,
bill_to_date date);
When I want to call my billing system I get a unique key (probably from a sequence), load up the rows with my list of customers/ dates to bill them to, and add the unique key to every row. Now I can simply pass the unique key to my billing engine, and it can scoop up the data at the other side.
This seems the most optional way to proceed, but it's clunky, like something I would have done in SQL Server 20 years ago, when there weren't better options, it's prone to leaving data lying around, and it doesn't seem like it would be optimal, as the data will have to be squirted to physical storage, and read back into memory.
Use a Temporary Table
So I'm thinking that I create a temporary table, call it customer_temp, and make it ON COMMIT DROP. When I call my stored procedure to bill customers it picks the data out of the temporary table, does what it needs to do, and then when it ends the table is vacuumed away.
But this doesn't work if I call the billing engine more than once at a time. So I need to give my temporary tables unique names, and also pass this name into my billing engine, which has to use some vile dynamic SQL to get the data into some usable area (probably another temporary table?).
Use a TYPE
When I first saw this I thought I had the answer, but it turns out to not work for multidimensional arrays (or I'm doing something wrong). I quickly learned that for a single dimensional array I could get this working by just pretending that a PostgreSQL TYPE was a user defined table type. But it obviously isn't.
So passing in an array of integers, e.g. customer_list int[]; works fine, and I can use the ARRAY command to populate that array from a query, and then it's easy to access it with =ANY(customer_list) at the other side. It's not ideal, and I bet it's awful for large data sets, but it's neat.
However, it doesn't work for multidimensional arrays, the ARRAY command can't cope with a query that has more than one column in it, and the other side becomes more awkward, needing an UNNEST command to unpack the data into a format where it's usable.
I defined my type like this:
CREATE TYPE customer_list (
customer_id int,
bill_to_date date);
...and then used it in my procedure parameter list as customer_list[], which seems to work, but I have no nice way to populate this structure from the calling procedure.
I feel I'm missing something here, as I never got it to work properly as a prototype, but I also feel this is a potential dead end anyway, as it won't cope with large numbers of rows in a performant way, and this isn't what arrays are meant for. The first thing I do with the array at the other side, is unpack it back into a table again, which seems counterintuitive.
Ref Cursors
I read that you can use REF CURSORs, but I'm not quite sure how this works. It seems that you open a cursor in one procedure, and then pass a handle to it to another procedure. It doesn't seem like this is going to be set-based, but I could be wrong, and I just haven't found a way to convert a cursor back into a table again?
Write Everything as One Massive Procedure
I'm not ruling this out, as PostgreSQL seems to be leading me this way. If I write one enormous billing engine that copes with every eventuality, then my only issue will be when this is called using an externally provided list. Every other issue can be solved by just not having to pass data between procedures.
I can probably cope with this by loading the data into a batch table, and feeding this in as one of the options. It's awful, and it's like going back to the 1990s, but if this is what PostgreSQL wants me to do, then so be it.
Final Thoughts
I'm sure I'm going to be asked for code examples, which I will happily provide, but I avoided because this post is already uber-long, and what I'm trying to achieve is actually quite simple I feel.
Having typed all of this out, I'm still feeling that there must be a way of working around the "temporary table names must be unique", as this would work nicely if I found a way to let it be called in a multithreaded way.
Okay, taking the bits I was missing I came up with this, which seems to work:
CREATE TYPE IF NOT EXISTS bill_list AS (
customer_id int,
bill_date date);
CREATE TABLE IF NOT EXISTS billing.pending_bill (
customer_id int,
bill_date date);
CREATE TABLE IF NOT EXISTS billing.customer (
customer_id int,
billed boolean,
last_billed date);
INSERT INTO billing.customer
VALUES
(1, false, NULL::date),
(2, false, NULL::date),
(3, false, NULL::date);
INSERT INTO billing.pending_bill
VALUES
(1, '20210108'::date),
(2, '20210105'::date),
(3, '20210104'::date);
CREATE OR REPLACE PROCEDURE billing.bill_customer_list (
pending bill_list[])
LANGUAGE PLPGSQL
AS
$$
BEGIN
UPDATE
billing.customer c
SET
billed = true,
last_billed = p.bill_date
FROM
UNNEST(pending) p
WHERE
p.customer_id = c.customer_id;
END;
$$
CREATE OR REPLACE PROCEDURE billing.test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE pending bill_list[];
BEGIN
pending := ARRAY(SELECT p FROM billing.pending_bill p);
CALL billing.bill_customer_list (pending);
END;
$$
Your select in the procedure returns multiple columns. But you want to create an array of a custom type. So your SELECT list needs to return the type, not *.
You don't need the bill_list type either, as every table has a corresponding type and you can simply pass an array of the table's type.
So you can use the following:
CREATE PROCEDURE bill_customer_list (
pending pending_bill[])
LANGUAGE PLPGSQL
AS
$$
BEGIN
UPDATE
customer c
SET
billed = true
FROM unnest(pending) p --<< treat the array as a table
WHERE
p.customer_id = c.customer_id;
END;
$$
;
CREATE PROCEDURE test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE
pending pending_bill[];
BEGIN
pending := ARRAY(SELECT p FROM pending_bill p);
CALL bill_customer_list (pending);
END;
$$
;
The select p returns a record (of the same type as the table) as a single column in the result.
The := is the assignment operator in PL/pgSQL and typically much faster than a SELECT .. INTO variable. Although in this case the performance difference wouldn't matter much I guess.
Online example
If you do want to keep the extra type bill_list around because it e.g. contains less columns than pending_bill you need to select only those columns that match the type's column and create a record by enclosing them in parentheses. (a,b) is a single column with an anonymous record type (and two fields). a,b are two distinct columns
CREATE PROCEDURE test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE
pending bill_list[];
BEGIN
pending := ARRAY(SELECT (id, customer_id) FROM pending_bill p);
CALL bill_customer_list (pending);
END;
$$
;
You should also note that DECLARE starts a block in PL/pgSQL where multiple variables can be defined. There is no need to write one DECLARE for each variable (your formatting of the DECLARE block let's me think that you assumed you need one DECLARE per variable as is the case in T-SQL)

How to check a sequence efficiently for used and unused values in PostgreSQL

In PostgreSQL (9.3) I have a table defined as:
CREATE TABLE charts
( recid serial NOT NULL,
groupid text NOT NULL,
chart_number integer NOT NULL,
"timestamp" timestamp without time zone NOT NULL DEFAULT now(),
modified timestamp without time zone NOT NULL DEFAULT now(),
donotsee boolean,
CONSTRAINT pk_charts PRIMARY KEY (recid),
CONSTRAINT chart_groupid UNIQUE (groupid),
CONSTRAINT charts_ichart_key UNIQUE (chart_number)
);
CREATE TRIGGER update_modified
BEFORE UPDATE ON charts
FOR EACH ROW EXECUTE PROCEDURE update_modified();
I would like to replace the chart_number with a sequence like:
CREATE SEQUENCE charts_chartnumber_seq START 16047;
So that by trigger or function, adding a new chart record automatically generates a new chart number in ascending order. However, no existing chart record can have its chart number changed and over the years there have been skips in the assigned chart numbers. Hence, before assigning a new chart number to a new chart record, I need to be sure that the "new" chart number has not yet been used and any chart record with a chart number is not assigned a different number.
How can this be done?
Consider not doing it. Read these related answers first:
Gap-less sequence where multiple transactions with multiple tables are involved
Compacting a sequence in PostgreSQL
If you still insist on filling in gaps, here is a rather efficient solution:
1. To avoid searching large parts of the table for the next missing chart_number, create a helper table with all current gaps once:
CREATE TABLE chart_gap AS
SELECT chart_number
FROM generate_series(1, (SELECT max(chart_number) - 1 -- max is no gap
FROM charts)) chart_number
LEFT JOIN charts c USING (chart_number)
WHERE c.chart_number IS NULL;
2. Set charts_chartnumber_seq to the current maximum and convert chart_number to an actual serial column:
SELECT setval('charts_chartnumber_seq', max(chart_number)) FROM charts;
ALTER TABLE charts
ALTER COLUMN chart_number SET NOT NULL
, ALTER COLUMN chart_number SET DEFAULT nextval('charts_chartnumber_seq');
ALTER SEQUENCE charts_chartnumber_seq OWNED BY charts.chart_number;
Details:
How to reset postgres' primary key sequence when it falls out of sync?
Safely and cleanly rename tables that use serial primary key columns in Postgres?
3. While chart_gap is not empty fetch the next chart_number from there.
To resolve possible race conditions with concurrent transactions, without making transactions wait, use advisory locks:
WITH sel AS (
SELECT chart_number, ... -- other input values
FROM chart_gap
WHERE pg_try_advisory_xact_lock(chart_number)
LIMIT 1
FOR UPDATE
)
, ins AS (
INSERT INTO charts (chart_number, ...) -- other target columns
TABLE sel
RETURNING chart_number
)
DELETE FROM chart_gap c
USING ins i
WHERE i.chart_number = c.chart_number;
Alternatively, Postgres 9.5 or later has the handy FOR UPDATE SKIP LOCKED to make this simpler and faster:
...
SELECT chart_number, ... -- other input values
FROM chart_gap
LIMIT 1
FOR UPDATE SKIP LOCKED
...
Detailed explanation:
Postgres UPDATE ... LIMIT 1
Check the result. Once all rows are filled in, this returns 0 rows affected. (you could check in plpgsql with IF NOT FOUND THEN ...). Then switch to a simple INSERT:
INSERT INTO charts (...) -- don't list chart_number
VALUES (...); -- don't provide chart_number
In PostgreSQL, a SEQUENCE ensures the two requirements you mention, that is:
No repeats
No changes once assigned
But because of how a SEQUENCE works (see manual), it can not ensure no-skips. Among others, the first two reasons that come to mind are:
How a SEQUENCE handles concurrent blocks with INSERTS (you could also add that the concept of Cache also makes this impossible)
Also, user triggered DELETEs are an uncontrollable aspect that a SEQUENCE can not handle by itself.
In both cases, if you still do not want skips, (and if you really know what you're doing) you should have a separate structure that assign IDs (instead of using SEQUENCE). Basically a system that has a list of 'assignable' IDs stored in a TABLE that has a function to pop out IDs in a FIFO way. That should allow you to control DELETEs etc.
But again, this should be attempted, only if you really know what you're doing! There's a reason why people don't do SEQUENCEs themselves. There are hard corner-cases (for e.g. concurrent INSERTs) and most probably you're over-engineering your problem case, that probably can be solved in a much better / cleaner way.
Sequence numbers usually have no meaning, so why worry? But if you really want this, then follow the below, cumbersome procedure. Note that it is not efficient; the only efficient option is to forget about the holes and use the sequence.
In order to avoid having to scan the charts table on every insert, you should scan the table once and store the unused chart_number values in a separate table:
CREATE TABLE charts_unused_chart_number AS
SELECT seq.unused
FROM (SELECT max(chart_number) FROM charts) mx,
generate_series(1, mx(max)) seq(unused)
LEFT JOIN charts ON charts.chart_number = seq.unused
WHERE charts.recid IS NULL;
The above query generates a contiguous series of numbers from 1 to the current maximum chart_number value, then LEFT JOINs the charts table to it and find the records where there is no corresponding charts data, meaning that value of the series is unused as a chart_number.
Next you create a trigger that fires on an INSERT on the charts table. In the trigger function, pick a value from the table created in the step above:
CREATE FUNCTION pick_unused_chart_number() RETURNS trigger AS $$
BEGIN
-- Get an unused chart number
SELECT unused INTO NEW.chart_number FROM charts_unused_chart_number LIMIT 1;
-- If the table is empty, get one from the sequence
IF NOT FOUND THEN
NEW.chart_number := next_val(charts_chartnumber_seq);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_charts_cn
BEFORE INSERT ON charts
FOR EACH ROW EXECUTE PROCEDURE pick_unused_chart_number();
Easy. But the INSERT may fail because of some other trigger aborting the procedure or any other reason. So you need a check to ascertain that the chart_number was indeed inserted:
CREATE FUNCTION verify_chart_number() RETURNS trigger AS $$
BEGIN
-- If you get here, the INSERT was successful, so delete the chart_number
-- from the temporary table.
DELETE FROM charts_unused_chart_number WHERE unused = NEW.chart_number;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_charts_verify
AFTER INSERT ON charts
FOR EACH ROW EXECUTE PROCEDURE verify_chart_number();
At a certain point the table with unused chart numbers will be empty whereupon you can (1) ALTER TABLE charts to use the sequence instead of an integer for chart_number; (2) delete the two triggers; and (3) the table with unused chart numbers; all in a single transaction.
While what you want is possible, it can't be done using only a SEQUENCE and it requires an exclusive lock on the table, or a retry loop, to work.
You'll need to:
LOCK thetable IN EXCLUSIVE MODE
Find the first free ID by querying for the max id then doing a left join over generate_series to find the first free entry. If there is one.
If there is a free entry, insert it.
If there is no free entry, call nextval and return the result.
Performance will be absolutely horrible, and transactions will be serialized. There'll be no concurrency. Also, unless the LOCK is the first thing you run that affects that table, you'll face deadlocks that cause transaction aborts.
You can make this less bad by using an AFTER DELETE .. FOR EACH ROW trigger that keeps track of entries you delete by INSERTing them into a one-column table that keeps track of spare IDs. You can then SELECT the lowest ID from the table in your ID assignment function on the default for the column, avoiding the need for the explicit table lock, the left join on generate_series and the max call. Transactions will still be serialized on a lock on the free IDs table. In PostgreSQL you can even solve that using SELECT ... FOR UPDATE SKIP LOCKED. So if you're on 9.5 you can actually make this non-awful, though it'll still be slow.
I strongly advise you to just use a SEQUENCE directly, and not bother with re-using values.

how to emulate "insert ignore" and "on duplicate key update" (sql merge) with postgresql?

Some SQL servers have a feature where INSERT is skipped if it would violate a primary/unique key constraint. For instance, MySQL has INSERT IGNORE.
What's the best way to emulate INSERT IGNORE and ON DUPLICATE KEY UPDATE with PostgreSQL?
With PostgreSQL 9.5, this is now native functionality (like MySQL has had for several years):
INSERT ... ON CONFLICT DO NOTHING/UPDATE ("UPSERT")
9.5 brings support for "UPSERT" operations.
INSERT is extended to accept an ON CONFLICT DO UPDATE/IGNORE clause. This clause specifies an alternative action to take in the event of a would-be duplicate violation.
...
Further example of new syntax:
INSERT INTO user_logins (username, logins)
VALUES ('Naomi',1),('James',1)
ON CONFLICT (username)
DO UPDATE SET logins = user_logins.logins + EXCLUDED.logins;
Edit: in case you missed warren's answer, PG9.5 now has this natively; time to upgrade!
Building on Bill Karwin's answer, to spell out what a rule based approach would look like (transferring from another schema in the same DB, and with a multi-column primary key):
CREATE RULE "my_table_on_duplicate_ignore" AS ON INSERT TO "my_table"
WHERE EXISTS(SELECT 1 FROM my_table
WHERE (pk_col_1, pk_col_2)=(NEW.pk_col_1, NEW.pk_col_2))
DO INSTEAD NOTHING;
INSERT INTO my_table SELECT * FROM another_schema.my_table WHERE some_cond;
DROP RULE "my_table_on_duplicate_ignore" ON "my_table";
Note: The rule applies to all INSERT operations until the rule is dropped, so not quite ad hoc.
For those of you that have Postgres 9.5 or higher, the new ON CONFLICT DO NOTHING syntax should work:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT field_one, field_two, field_three
FROM source_table
ON CONFLICT (field_one) DO NOTHING;
For those of us who have an earlier version, this right join will work instead:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT source_table.field_one, source_table.field_two, source_table.field_three
FROM source_table
LEFT JOIN target_table ON source_table.field_one = target_table.field_one
WHERE target_table.field_one IS NULL;
Try to do an UPDATE. If it doesn't modify any row that means it didn't exist, so do an insert. Obviously, you do this inside a transaction.
You can of course wrap this in a function if you don't want to put the extra code on the client side. You also need a loop for the very rare race condition in that thinking.
There's an example of this in the documentation: http://www.postgresql.org/docs/9.3/static/plpgsql-control-structures.html, example 40-2 right at the bottom.
That's usually the easiest way. You can do some magic with rules, but it's likely going to be a lot messier. I'd recommend the wrap-in-function approach over that any day.
This works for single row, or few row, values. If you're dealing with large amounts of rows for example from a subquery, you're best of splitting it into two queries, one for INSERT and one for UPDATE (as an appropriate join/subselect of course - no need to write your main filter twice)
To get the insert ignore logic you can do something like below. I found simply inserting from a select statement of literal values worked best, then you can mask out the duplicate keys with a NOT EXISTS clause. To get the update on duplicate logic I suspect a pl/pgsql loop would be necessary.
INSERT INTO manager.vin_manufacturer
(SELECT * FROM( VALUES
('935',' Citroën Brazil','Citroën'),
('ABC', 'Toyota', 'Toyota'),
('ZOM',' OM','OM')
) as tmp (vin_manufacturer_id, manufacturer_desc, make_desc)
WHERE NOT EXISTS (
--ignore anything that has already been inserted
SELECT 1 FROM manager.vin_manufacturer m where m.vin_manufacturer_id = tmp.vin_manufacturer_id)
)
INSERT INTO mytable(col1,col2)
SELECT 'val1','val2'
WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE col1='val1')
As #hanmari mentioned in his comment. when inserting into a postgres tables, the on conflict (..) do nothing is the best code to use for not inserting duplicate data.:
query = "INSERT INTO db_table_name(column_name)
VALUES(%s) ON CONFLICT (column_name) DO NOTHING;"
The ON CONFLICT line of code will allow the insert statement to still insert rows of data. The query and values code is an example of inserted date from a Excel into a postgres db table.
I have constraints added to a postgres table I use to make sure the ID field is unique. Instead of running a delete on rows of data that is the same, I add a line of sql code that renumbers the ID column starting at 1.
Example:
q = 'ALTER id_column serial RESTART WITH 1'
If my data has an ID field, I do not use this as the primary ID/serial ID, I create a ID column and I set it to serial.
I hope this information is helpful to everyone.
*I have no college degree in software development/coding. Everything I know in coding, I study on my own.
Looks like PostgreSQL supports a schema object called a rule.
http://www.postgresql.org/docs/current/static/rules-update.html
You could create a rule ON INSERT for a given table, making it do NOTHING if a row exists with the given primary key value, or else making it do an UPDATE instead of the INSERT if a row exists with the given primary key value.
I haven't tried this myself, so I can't speak from experience or offer an example.
This solution avoids using rules:
BEGIN
INSERT INTO tableA (unique_column,c2,c3) VALUES (1,2,3);
EXCEPTION
WHEN unique_violation THEN
UPDATE tableA SET c2 = 2, c3 = 3 WHERE unique_column = 1;
END;
but it has a performance drawback (see PostgreSQL.org):
A block containing an EXCEPTION clause is significantly more expensive
to enter and exit than a block without one. Therefore, don't use
EXCEPTION without need.
On bulk, you can always delete the row before the insert. A deletion of a row that doesn't exist doesn't cause an error, so its safely skipped.
For data import scripts, to replace "IF NOT EXISTS", in a way, there's a slightly awkward formulation that nevertheless works:
DO
$do$
BEGIN
PERFORM id
FROM whatever_table;
IF NOT FOUND THEN
-- INSERT stuff
END IF;
END
$do$;

Sequence Generators in T-SQL

We have an Oracle application that uses a standard pattern to populate surrogate keys. We have a series of extrinsic rows (that have specific values for the surrogate keys) and other rows that have intrinsic values.
We use the following Oracle trigger snippet to determine what to do with the Surrogate key on insert:
IF :NEW.SurrogateKey IS NULL THEN
SELECT SurrogateKey_SEQ.NEXTVAL INTO :NEW.SurrogateKey FROM DUAL;
END IF;
If the supplied surrogate key is null then get a value from the nominated sequence, else pass the supplied surrogate key through to the row.
I can't seem to find an easy way to do this is T-SQL. There are all sorts of approaches, but none of which use the notion of a sequence generator like Oracle and other SQL-92 compliant DBs do.
Anybody know of a really efficient way to do this in SQL Server T-SQL? By the way, we're using SQL Server 2008 if that's any help.
You may want to look at IDENTITY. This gives you a column for which the value will be determined when you insert the row.
This may mean that you have to insert the row, and determine the value afterwards, using SCOPE_IDENTITY().
There is also an article on simulating Oracle Sequences in SQL Server here: http://www.sqlmag.com/Articles/ArticleID/46900/46900.html?Ad=1
Identity is one approach, although it will generate unique identifiers at a per table level.
Another approach is to use unique identifiers, in particualr using NewSequantialID() that ensues the generated id is always bigger than the last. The problem with this approach is you are no longer dealing with integers.
The closest way to emulate the oracle method is to have a separate table with a counter field, and then write a user defined function that queries this field, increments it, and returns the value.
Here is a way to do it using a table to store your last sequence number. The stored proc is very simple, most of the stuff in there is because I'm lazy and don't like surprises should I forget something so...here it is:
----- Create the sequence value table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[SequenceTbl]
(
[CurrentValue] [bigint]
) ON [PRIMARY]
GO
-----------------Create the stored procedure
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE procedure [dbo].[sp_NextInSequence](#SkipCount BigInt = 1)
AS
BEGIN
BEGIN TRANSACTION
DECLARE #NextInSequence BigInt;
IF NOT EXISTS
(
SELECT
CurrentValue
FROM
SequenceTbl
)
INSERT INTO SequenceTbl (CurrentValue) VALUES (0);
SELECT TOP 1
#NextInSequence = ISNULL(CurrentValue, 0) + 1
FROM
SequenceTbl WITH (HoldLock);
UPDATE SequenceTbl WITH (UPDLOCK)
SET CurrentValue = #NextInSequence + (#SkipCount - 1);
COMMIT TRANSACTION
RETURN #NextInSequence
END;
GO
--------Use the stored procedure in Sql Manager to retrive a test value.
declare #NextInSequence BigInt
exec #NextInSequence = sp_NextInSequence;
--exec #NextInSequence = sp_NextInSequence <skipcount>;
select NextInSequence = #NextInSequence;
-----Show the current table value.
select * from SequenceTbl;
The astute will notice that there is a parameter (optional) for the stored proc. This is to allow the caller to reserve a block of ID's in the instance that the caller has more than one record that needs a unique id - using the SkipCount, the caller need make only a single call for however many IDs are needed.
The entire "IF EXISTS...INSERT INTO..." block can be removed if you remember to insert a record when the table is created. If you also remember to insert that record with a value (your seed value - a number which will never be used as an ID), you can also remove the ISNULL(...) portion of the select and just use CurrentValue + 1.
Now, before anyone makes a comment, please note that I am a software engineer, not a dba! So, any constructive criticism concerning the use of "Top 1", "With (HoldLock)" and "With (UPDLock)" is welcome. I don't know how well this will scale but this works OK for me so far...