SQLAlchemy fast bulk upsert from another table - postgresql

I would like to insert records into table_a from table_b that don't already exist in table table_a. I already have Postgres SQL code to do this, but now my team has requested that I use an ORM (SQLAlchemy) instead.
INSERT INTO table_a
SELECT
composite_pk1,
composite_pk2,
col_c,
col_d
FROM table_b
ON CONFLICT (
composite_pk1,
composite_pk2
) DO NOTHING
I have nearly a million rows and about 15 columns (not shown in the example). I need this query to be fast, which is why I don't think the solution posted here will work for my use case.
For performance reasons I also want to avoid treating my Python function as a data conduit. I don't want to transfer many rows of table_b over the network to my function just to push them back over the network again to table_a. That is, I would prefer the insert to happen entirely on Postgres, which I already accomplish with my original SQL query.

Probably the fastest way to perform an upsert with the usage of SQLAlchemy ORM is through bulk_update_mappings function, that allows you to upsert merely based on a list of dicts.
But the situation you are describing isn't really an upsert - you want to insert rows, and if there is a conflict - do nothing. No update is being done here, therefore it is a simple insert.
To perform an insert that skips any conflicts is a simple thing in SQLAlchemy (assuming you have your table already defined as a model):
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
engine = create_engine('your_db_connection_string', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
# example column names
data = [{'col1': result.col1, 'col2': result.col2}
for result in session.query(table_b).all()]
insert_query = insert(table_a).values(data).on_conflict_do_nothing()
session.execute(insert_query)
session.commit()
session.close()

Related

Is it safe to use temporary tables when an application may try to create them for independent, but simultaneous processes?

I am hoping that I can articulate this effectively, so here it goes:
I am creating a model which will be run on a platform by users, possibly simultaneously, but each model run is marked by a unique integer identifier. This model will execute a series of PostgreSQL queries and eventually write a result elswehere.
Now because of the required parallelization of model runs, I have to make sure that the processes will not collide, despite running in the same database. I am at a point now where I have to store a list of records, sorted by a score variable and then operate on them. This is the beginning of the query:
DO
$$
DECLARE row RECORD;
BEGIN
DROP TABLE IF EXISTS ranked_clusters;
CREATE TEMP TABLE ranked_clusters AS (
SELECT
pl.cluster_id AS c_id,
SUM(pl.total_area) AS cluster_score
FROM
emob.parking_lots AS pl
WHERE
pl.cluster_id IS NOT NULL
AND
run_id = 2005149
GROUP BY
pl.cluster_id
ORDER BY
cluster_score DESC
);
FOR row IN SELECT c_id FROM ranked_clusters LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;
END;
$$ LANGUAGE plpgsql;
So I create a temporary table called ranked_clusters and then iterate through it, at the moment just logging the identifiers of each record.
I have been careful to only build this list from records which have a run_id value equal to a certain number, so data from the same source, but with a different number will be ignored.
What I am worried about however is that a simultaneous process will also create its own ranked_clusters temporary table, which will collide with the first one, invalidating the results.
So my question is essentially this: Are temporary tables only visible to the session which creates them (or to the cursor object from say, Python)? And is it therefore safe to use a temporary table in this way?
The main reason I ask is because I see that these so-called "temporary" tables seem to persist after I execute the query in PgAdmin III, and the query fails on the next execution because the table already exists. This troubles me because it seems as though the tables are actually globally accessible during their lifetime and would therefore introduce the possibility of a collision when a simultaneous run occurs.
Thanks #a_horse_with_no_name for the explanation but I am not yet convinced that it is safe, because I have been able to execute the following code:
import psycopg2 as pg2
conn = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn.autocommit = True
cur = conn.cursor()
conn2 = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn2.autocommit = True
cur2 = conn.cursor()
cur.execute("CREATE TEMPORARY TABLE temptable (tempcol INTEGER); INSERT INTO temptable VALUES (0);")
cur2.execute("SELECT tempcol FROM temptable;")
print(cur2.fetchall())
And I receive the value in temptable despite it being created as a temporary table in a completely different connection as the one which queries it afterwards. Am I missing something here? Because it seems like the temporary table is indeed accessible between connections.
The above had a typo, Both cursors were actually being spawned from conn, rather than one from conn and another from conn2. Individual connections in psycopg2 are not able to access each other's temporary tables, but cursors spawned from the same connection are.
Temporary tables are only visible to the session (=connection) that created them. Even if two sessions create the same table, they won't interfere with each other.
Temporary tables are removed automatically when the session is disconnected.
If you want to automatically remove them when your transaction ends, use the ON COMMIT DROP option when creating the table.
So the answer is: yes, this is safe.
Unrelated, but: you can't store rows "in a sorted way". Rows in a table have no implicit sort order. The only way you can get a guaranteed sort order is to use an ORDER BY when selecting the rows. The order by that is part of your CREATE TABLE AS statement is pretty much useless.
If you have to rely on the sort order of the rows, the only safe way to do that is in the SELECT statement:
FOR row IN SELECT c_id FROM ranked_clusters ORDER BY cluster_score
LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;

How to replace row if primary key already exists ("IntegrityError: duplicate key value") [duplicate]

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".
9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)
Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;
I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.
SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert
MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)
WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3
Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

Does Postgres support virtual columns? [duplicate]

Does PostgreSQL support computed / calculated columns, like MS SQL Server? I can't find anything in the docs, but as this feature is included in many other DBMSs I thought I might be missing something.
Eg: http://msdn.microsoft.com/en-us/library/ms191250.aspx
Postgres 12 or newer
STORED generated columns are introduced with Postgres 12 - as defined in the SQL standard and implemented by some RDBMS including DB2, MySQL, and Oracle. Or the similar "computed columns" of SQL Server.
Trivial example:
CREATE TABLE tbl (
int1 int
, int2 int
, product bigint GENERATED ALWAYS AS (int1 * int2) STORED
);
fiddle
VIRTUAL generated columns may come with one of the next iterations. (Not in Postgres 15, yet).
Related:
Attribute notation for function call gives error
Postgres 11 or older
Up to Postgres 11 "generated columns" are not supported.
You can emulate VIRTUAL generated columns with a function using attribute notation (tbl.col) that looks and works much like a virtual generated column. That's a bit of a syntax oddity which exists in Postgres for historic reasons and happens to fit the case. This related answer has code examples:
Store common query as column?
The expression (looking like a column) is not included in a SELECT * FROM tbl, though. You always have to list it explicitly.
Can also be supported with a matching expression index - provided the function is IMMUTABLE. Like:
CREATE FUNCTION col(tbl) ... AS ... -- your computed expression here
CREATE INDEX ON tbl(col(tbl));
Alternatives
Alternatively, you can implement similar functionality with a VIEW, optionally coupled with expression indexes. Then SELECT * can include the generated column.
"Persisted" (STORED) computed columns can be implemented with triggers in a functionally equivalent way.
Materialized views are a related concept, implemented since Postgres 9.3.
In earlier versions one can manage MVs manually.
YES you can!! The solution should be easy, safe, and performant...
I'm new to postgresql, but it seems you can create computed columns by using an expression index, paired with a view (the view is optional, but makes makes life a bit easier).
Suppose my computation is md5(some_string_field), then I create the index as:
CREATE INDEX some_string_field_md5_index ON some_table(MD5(some_string_field));
Now, any queries that act on MD5(some_string_field) will use the index rather than computing it from scratch. For example:
SELECT MAX(some_field) FROM some_table GROUP BY MD5(some_string_field);
You can check this with explain.
However at this point you are relying on users of the table knowing exactly how to construct the column. To make life easier, you can create a VIEW onto an augmented version of the original table, adding in the computed value as a new column:
CREATE VIEW some_table_augmented AS
SELECT *, MD5(some_string_field) as some_string_field_md5 from some_table;
Now any queries using some_table_augmented will be able to use some_string_field_md5 without worrying about how it works..they just get good performance. The view doesn't copy any data from the original table, so it is good memory-wise as well as performance-wise. Note however that you can't update/insert into a view, only into the source table, but if you really want, I believe you can redirect inserts and updates to the source table using rules (I could be wrong on that last point as I've never tried it myself).
Edit: it seems if the query involves competing indices, the planner engine may sometimes not use the expression-index at all. The choice seems to be data dependant.
One way to do this is with a trigger!
CREATE TABLE computed(
one SERIAL,
two INT NOT NULL
);
CREATE OR REPLACE FUNCTION computed_two_trg()
RETURNS trigger
LANGUAGE plpgsql
SECURITY DEFINER
AS $BODY$
BEGIN
NEW.two = NEW.one * 2;
RETURN NEW;
END
$BODY$;
CREATE TRIGGER computed_500
BEFORE INSERT OR UPDATE
ON computed
FOR EACH ROW
EXECUTE PROCEDURE computed_two_trg();
The trigger is fired before the row is updated or inserted. It changes the field that we want to compute of NEW record and then it returns that record.
PostgreSQL 12 supports generated columns:
PostgreSQL 12 Beta 1 Released!
Generated Columns
PostgreSQL 12 allows the creation of generated columns that compute their values with an expression using the contents of other columns. This feature provides stored generated columns, which are computed on inserts and updates and are saved on disk. Virtual generated columns, which are computed only when a column is read as part of a query, are not implemented yet.
Generated Columns
A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables.
CREATE TABLE people (
...,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm * 2.54) STORED
);
db<>fiddle demo
Well, not sure if this is what You mean but Posgres normally support "dummy" ETL syntax.
I created one empty column in table and then needed to fill it by calculated records depending on values in row.
UPDATE table01
SET column03 = column01*column02; /*e.g. for multiplication of 2 values*/
It is so dummy I suspect it is not what You are looking for.
Obviously it is not dynamic, you run it once. But no obstacle to get it into trigger.
Example on creating an empty virtual column
,(SELECT *
From (values (''))
A("virtual_col"))
Example on creating two virtual columns with values
SELECT *
From (values (45,'Completed')
, (1,'In Progress')
, (1,'Waiting')
, (1,'Loading')
) A("Count","Status")
order by "Count" desc
I have a code that works and use the term calculated, I'm not on postgresSQL pure tho we run on PADB
here is how it's used
create table some_table as
select category,
txn_type,
indiv_id,
accum_trip_flag,
max(first_true_origin) as true_origin,
max(first_true_dest ) as true_destination,
max(id) as id,
count(id) as tkts_cnt,
(case when calculated tkts_cnt=1 then 1 else 0 end) as one_way
from some_rando_table
group by 1,2,3,4 ;
A lightweight solution with Check constraint:
CREATE TABLE example (
discriminator INTEGER DEFAULT 0 NOT NULL CHECK (discriminator = 0)
);

How to use UPDATE ... FROM in SQLAlchemy?

I would like to write this kind of statement in SQLAlchemy / Postgres:
UPDATE slots
FROM (SELECT id FROM slots WHERE user IS NULL
ORDER BY id LIMIT 1000) AS available
SET user='joe'
WHERE id = available.id
RETURNING *;
Namely, I would like to update a limited number of rows matching specified criteria.
PG
I was able to do it this way:
limited_slots = select([slots.c.id]).\
where(slots.c.user==None).\
order_by(slots.c.id).\
limit(1000)
stmt = slots.update().returning(slots).\
values(user='joe').\
where(slots.c.id.in_(limited_slots))
I don't think its as efficient as the original SQL query, however if the database memory is large enough to hold all related segments it shouldn't make much difference.
It's been a while since I used sqlalchemy so consider the following as pseudocode:
for i in session.query(Slots).filter(Slots.user == None):
i.user = "Joe"
session.add(i)
session.commit()
I recommend the sqlalchemy ORM tutorial.

how to emulate "insert ignore" and "on duplicate key update" (sql merge) with postgresql?

Some SQL servers have a feature where INSERT is skipped if it would violate a primary/unique key constraint. For instance, MySQL has INSERT IGNORE.
What's the best way to emulate INSERT IGNORE and ON DUPLICATE KEY UPDATE with PostgreSQL?
With PostgreSQL 9.5, this is now native functionality (like MySQL has had for several years):
INSERT ... ON CONFLICT DO NOTHING/UPDATE ("UPSERT")
9.5 brings support for "UPSERT" operations.
INSERT is extended to accept an ON CONFLICT DO UPDATE/IGNORE clause. This clause specifies an alternative action to take in the event of a would-be duplicate violation.
...
Further example of new syntax:
INSERT INTO user_logins (username, logins)
VALUES ('Naomi',1),('James',1)
ON CONFLICT (username)
DO UPDATE SET logins = user_logins.logins + EXCLUDED.logins;
Edit: in case you missed warren's answer, PG9.5 now has this natively; time to upgrade!
Building on Bill Karwin's answer, to spell out what a rule based approach would look like (transferring from another schema in the same DB, and with a multi-column primary key):
CREATE RULE "my_table_on_duplicate_ignore" AS ON INSERT TO "my_table"
WHERE EXISTS(SELECT 1 FROM my_table
WHERE (pk_col_1, pk_col_2)=(NEW.pk_col_1, NEW.pk_col_2))
DO INSTEAD NOTHING;
INSERT INTO my_table SELECT * FROM another_schema.my_table WHERE some_cond;
DROP RULE "my_table_on_duplicate_ignore" ON "my_table";
Note: The rule applies to all INSERT operations until the rule is dropped, so not quite ad hoc.
For those of you that have Postgres 9.5 or higher, the new ON CONFLICT DO NOTHING syntax should work:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT field_one, field_two, field_three
FROM source_table
ON CONFLICT (field_one) DO NOTHING;
For those of us who have an earlier version, this right join will work instead:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT source_table.field_one, source_table.field_two, source_table.field_three
FROM source_table
LEFT JOIN target_table ON source_table.field_one = target_table.field_one
WHERE target_table.field_one IS NULL;
Try to do an UPDATE. If it doesn't modify any row that means it didn't exist, so do an insert. Obviously, you do this inside a transaction.
You can of course wrap this in a function if you don't want to put the extra code on the client side. You also need a loop for the very rare race condition in that thinking.
There's an example of this in the documentation: http://www.postgresql.org/docs/9.3/static/plpgsql-control-structures.html, example 40-2 right at the bottom.
That's usually the easiest way. You can do some magic with rules, but it's likely going to be a lot messier. I'd recommend the wrap-in-function approach over that any day.
This works for single row, or few row, values. If you're dealing with large amounts of rows for example from a subquery, you're best of splitting it into two queries, one for INSERT and one for UPDATE (as an appropriate join/subselect of course - no need to write your main filter twice)
To get the insert ignore logic you can do something like below. I found simply inserting from a select statement of literal values worked best, then you can mask out the duplicate keys with a NOT EXISTS clause. To get the update on duplicate logic I suspect a pl/pgsql loop would be necessary.
INSERT INTO manager.vin_manufacturer
(SELECT * FROM( VALUES
('935',' Citroën Brazil','Citroën'),
('ABC', 'Toyota', 'Toyota'),
('ZOM',' OM','OM')
) as tmp (vin_manufacturer_id, manufacturer_desc, make_desc)
WHERE NOT EXISTS (
--ignore anything that has already been inserted
SELECT 1 FROM manager.vin_manufacturer m where m.vin_manufacturer_id = tmp.vin_manufacturer_id)
)
INSERT INTO mytable(col1,col2)
SELECT 'val1','val2'
WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE col1='val1')
As #hanmari mentioned in his comment. when inserting into a postgres tables, the on conflict (..) do nothing is the best code to use for not inserting duplicate data.:
query = "INSERT INTO db_table_name(column_name)
VALUES(%s) ON CONFLICT (column_name) DO NOTHING;"
The ON CONFLICT line of code will allow the insert statement to still insert rows of data. The query and values code is an example of inserted date from a Excel into a postgres db table.
I have constraints added to a postgres table I use to make sure the ID field is unique. Instead of running a delete on rows of data that is the same, I add a line of sql code that renumbers the ID column starting at 1.
Example:
q = 'ALTER id_column serial RESTART WITH 1'
If my data has an ID field, I do not use this as the primary ID/serial ID, I create a ID column and I set it to serial.
I hope this information is helpful to everyone.
*I have no college degree in software development/coding. Everything I know in coding, I study on my own.
Looks like PostgreSQL supports a schema object called a rule.
http://www.postgresql.org/docs/current/static/rules-update.html
You could create a rule ON INSERT for a given table, making it do NOTHING if a row exists with the given primary key value, or else making it do an UPDATE instead of the INSERT if a row exists with the given primary key value.
I haven't tried this myself, so I can't speak from experience or offer an example.
This solution avoids using rules:
BEGIN
INSERT INTO tableA (unique_column,c2,c3) VALUES (1,2,3);
EXCEPTION
WHEN unique_violation THEN
UPDATE tableA SET c2 = 2, c3 = 3 WHERE unique_column = 1;
END;
but it has a performance drawback (see PostgreSQL.org):
A block containing an EXCEPTION clause is significantly more expensive
to enter and exit than a block without one. Therefore, don't use
EXCEPTION without need.
On bulk, you can always delete the row before the insert. A deletion of a row that doesn't exist doesn't cause an error, so its safely skipped.
For data import scripts, to replace "IF NOT EXISTS", in a way, there's a slightly awkward formulation that nevertheless works:
DO
$do$
BEGIN
PERFORM id
FROM whatever_table;
IF NOT FOUND THEN
-- INSERT stuff
END IF;
END
$do$;