Related
Explanation:
Lets say we have a table:
userid
points
123
1
456
1
Both userid and points is of type int or any other numeric data type. And userid is my PK.
Now, in my table I would like to perform an update query, and if the row does not exist I would like to insert the row. If the user already exists I would like to increment the points by 1, otherwise insert the userid and points to be 1 as default.
I am aware I can perform an upsert like this:
INSERT INTO table(userid, points) VALUES(123, 1)
ON conflict (userid)
DO UPDATE
SET points = table.points + 1
where table.userid = 123;
However, in my case update operation is more frequent than inserting a new row. Lets say there are 5000 queries each day and around 4500 of those rows are UPDATE operations on already existing rows. Doing an opposite of upsert would be more beneficial since the conflict will reduce to 500 times instead of 4500. I would like to try to UPDATE first and if it returns UPDATE 0 I would like to perform an INSERT.
Is it possible to do the above using RETURNING or FOUND or something else in a single query? Or is the benefit for the above if possible is too insignificant and upsert is the way to go?
A simple representation of what I want to do using python and asyncpg (2 queries):
import asyncio
import asyncpg
async def run():
conn = await asyncpg.connect(user='user', password='password',
database='database')
output = await conn.execute("UPDATE table set points = points + 1 where userid = $1", 123)
if output == "UPDATE 0":
await conn.execute("INSERT INTO table(userid, points) values($1, $2)", 123, 0)
await conn.close()
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
Questions I already have checked:
This is the most similar to my question for mysql but unfortunately does not have an answer.
This kind of works but still uses 2 separate queries and rely upon one query failing/ignoring.
I also have read
This,
This,
This,
This,
This
and
This
but they dont answer my questions.
Your proposed code has a race condition: someone could insert a row between the UPDATE and the INSERT, making both fail. The only safe technique is an endless loop that tries both statements until one of them succeeds.
Since every statement requires a client-server round trip, I doubt that your code will perform better than INSERT ... ON CONFLICT.
Rather than making an unfounded assumption that INSERT ... ON CONFLICT is much slower than UPDATE, you should benchmark both solutions.
I would like to insert records into table_a from table_b that don't already exist in table table_a. I already have Postgres SQL code to do this, but now my team has requested that I use an ORM (SQLAlchemy) instead.
INSERT INTO table_a
SELECT
composite_pk1,
composite_pk2,
col_c,
col_d
FROM table_b
ON CONFLICT (
composite_pk1,
composite_pk2
) DO NOTHING
I have nearly a million rows and about 15 columns (not shown in the example). I need this query to be fast, which is why I don't think the solution posted here will work for my use case.
For performance reasons I also want to avoid treating my Python function as a data conduit. I don't want to transfer many rows of table_b over the network to my function just to push them back over the network again to table_a. That is, I would prefer the insert to happen entirely on Postgres, which I already accomplish with my original SQL query.
Probably the fastest way to perform an upsert with the usage of SQLAlchemy ORM is through bulk_update_mappings function, that allows you to upsert merely based on a list of dicts.
But the situation you are describing isn't really an upsert - you want to insert rows, and if there is a conflict - do nothing. No update is being done here, therefore it is a simple insert.
To perform an insert that skips any conflicts is a simple thing in SQLAlchemy (assuming you have your table already defined as a model):
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
engine = create_engine('your_db_connection_string', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
# example column names
data = [{'col1': result.col1, 'col2': result.col2}
for result in session.query(table_b).all()]
insert_query = insert(table_a).values(data).on_conflict_do_nothing()
session.execute(insert_query)
session.commit()
session.close()
If I use mybatis, I can easily get the count of updated rows, just like
update table set desc = 'xxx' where name = ?
However, if I want to get the updated rows, not the count, how can I achieve this by mybatis?
mybatis itself can't do that because this update happens in database and no row data is returned back.
The only option is to modify the query and make it update and select the data you need. The exact way how to achieve this effect depends on the database you are using and/or driver support.
In postgres for example you can change the query and add RETURNING clause like this:
UPDATE table
SET desc = 'xxx'
WHERE name = ?
RETURNING *
This will turn this query to a select one and you can map it as select query in mybatis. Some other databases have similar features.
Another option (if you database and/or JDBC driver support this) is to do two queries, update and select like this
<select id='updateAndReturnModified" resultMap="...">
UPDATE table
SET desc = 'xxx'
WHERE name = #{name};
SELECT *
FROM table
WHERE name = #{name};
</select>
This however may require to use more strict isolation level (READ_COMMITED for example will not work) to make sure the second select sees the state after update and does not see changes made by some concurrent update. Again whether you need this or not depends on the database your are using.
Looking through the documentation for the Postgres 9.4 datatype JSONB, it is not immediately obvious to me how to do updates on JSONB columns.
Documentation for JSONB types and functions:
http://www.postgresql.org/docs/9.4/static/functions-json.html
http://www.postgresql.org/docs/9.4/static/datatype-json.html
As an examples, I have this basic table structure:
CREATE TABLE test(id serial, data jsonb);
Inserting is easy, as in:
INSERT INTO test(data) values ('{"name": "my-name", "tags": ["tag1", "tag2"]}');
Now, how would I update the 'data' column? This is invalid syntax:
UPDATE test SET data->'name' = 'my-other-name' WHERE id = 1;
Is this documented somewhere obvious that I missed? Thanks.
If you're able to upgrade to Postgresql 9.5, the jsonb_set command is available, as others have mentioned.
In each of the following SQL statements, I've omitted the where clause for brevity; obviously, you'd want to add that back.
Update name:
UPDATE test SET data = jsonb_set(data, '{name}', '"my-other-name"');
Replace the tags (as oppose to adding or removing tags):
UPDATE test SET data = jsonb_set(data, '{tags}', '["tag3", "tag4"]');
Replacing the second tag (0-indexed):
UPDATE test SET data = jsonb_set(data, '{tags,1}', '"tag5"');
Append a tag (this will work as long as there are fewer than 999 tags; changing argument 999 to 1000 or above generates an error. This no longer appears to be the case in Postgres 9.5.3; a much larger index can be used):
UPDATE test SET data = jsonb_set(data, '{tags,999999999}', '"tag6"', true);
Remove the last tag:
UPDATE test SET data = data #- '{tags,-1}'
Complex update (delete the last tag, insert a new tag, and change the name):
UPDATE test SET data = jsonb_set(
jsonb_set(data #- '{tags,-1}', '{tags,999999999}', '"tag3"', true),
'{name}', '"my-other-name"');
It's important to note that in each of these examples, you're not actually updating a single field of the JSON data. Instead, you're creating a temporary, modified version of the data, and assigning that modified version back to the column. In practice, the result should be the same, but keeping this in mind should make complex updates, like the last example, more understandable.
In the complex example, there are three transformations and three temporary versions: First, the last tag is removed. Then, that version is transformed by adding a new tag. Next, the second version is transformed by changing the name field. The value in the data column is replaced with the final version.
Ideally, you don't use JSON documents for structured, regular data that you want to manipulate inside a relational database. Use a normalized relational design instead.
JSON is primarily intended to store whole documents that do not need to be manipulated inside the RDBMS. Related:
JSONB with indexing vs. hstore
Updating a row in Postgres always writes a new version of the whole row. That's the basic principle of Postgres' MVCC model. From a performance perspective, it hardly matters whether you change a single piece of data inside a JSON object or all of it: a new version of the row has to be written.
Thus the advice in the manual:
JSON data is subject to the same concurrency-control considerations as
any other data type when stored in a table. Although storing large
documents is practicable, keep in mind that any update acquires a
row-level lock on the whole row. Consider limiting JSON documents to a
manageable size in order to decrease lock contention among updating
transactions. Ideally, JSON documents should each represent an atomic
datum that business rules dictate cannot reasonably be further
subdivided into smaller datums that could be modified independently.
The gist of it: to modify anything inside a JSON object, you have to assign a modified object to the column. Postgres supplies limited means to build and manipulate json data in addition to its storage capabilities. The arsenal of tools has grown substantially with every new release since version 9.2. But the principal remains: You always have to assign a complete modified object to the column and Postgres always writes a new row version for any update.
Some techniques how to work with the tools of Postgres 9.3 or later:
How do I modify fields inside the new PostgreSQL JSON datatype?
This answer has attracted about as many downvotes as all my other answers on SO together. People don't seem to like the idea: a normalized design is superior for regular data. This excellent blog post by Craig Ringer explains in more detail:
"PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns"
Another blog post by Laurenz Albe, another official Postgres contributor like Craig and myself:
JSON in PostgreSQL: how to use it right
This is coming in 9.5 in the form of jsonb_set by Andrew Dunstan based on an existing extension jsonbx that does work with 9.4
For those that run into this issue and want a very quick fix (and are stuck on 9.4.5 or earlier), here is a potential solution:
Creation of test table
CREATE TABLE test(id serial, data jsonb);
INSERT INTO test(data) values ('{"name": "my-name", "tags": ["tag1", "tag2"]}');
Update statement to change jsonb value
UPDATE test
SET data = replace(data::TEXT,': "my-name"',': "my-other-name"')::jsonb
WHERE id = 1;
Ultimately, the accepted answer is correct in that you cannot modify an individual piece of a jsonb object (in 9.4.5 or earlier); however, you can cast the jsonb column to a string (::TEXT) and then manipulate the string and cast back to the jsonb form (::jsonb).
There are two important caveats
this will replace all values equaling "my-name" in the json (in the case you have multiple objects with the same value)
this is not as efficient as jsonb_set would be if you are using 9.5
update the 'name' attribute:
UPDATE test SET data=data||'{"name":"my-other-name"}' WHERE id = 1;
and if you wanted to remove for example the 'name' and 'tags' attributes:
UPDATE test SET data=data-'{"name","tags"}'::text[] WHERE id = 1;
This question was asked in the context of postgres 9.4,
however new viewers coming to this question should be aware that in postgres 9.5,
sub-document Create/Update/Delete operations on JSONB fields are natively supported by the database, without the need for extension functions.
See: JSONB modifying operators and functions
I wrote small function for myself that works recursively in Postgres 9.4. I had same problem (good they did solve some of this headache in Postgres 9.5).
Anyway here is the function (I hope it works well for you):
CREATE OR REPLACE FUNCTION jsonb_update(val1 JSONB,val2 JSONB)
RETURNS JSONB AS $$
DECLARE
result JSONB;
v RECORD;
BEGIN
IF jsonb_typeof(val2) = 'null'
THEN
RETURN val1;
END IF;
result = val1;
FOR v IN SELECT key, value FROM jsonb_each(val2) LOOP
IF jsonb_typeof(val2->v.key) = 'object'
THEN
result = result || jsonb_build_object(v.key, jsonb_update(val1->v.key, val2->v.key));
ELSE
result = result || jsonb_build_object(v.key, v.value);
END IF;
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql;
Here is sample use:
select jsonb_update('{"a":{"b":{"c":{"d":5,"dd":6},"cc":1}},"aaa":5}'::jsonb, '{"a":{"b":{"c":{"d":15}}},"aa":9}'::jsonb);
jsonb_update
---------------------------------------------------------------------
{"a": {"b": {"c": {"d": 15, "dd": 6}, "cc": 1}}, "aa": 9, "aaa": 5}
(1 row)
As you can see it analyze deep down and update/add values where needed.
Maybe:
UPDATE test SET data = '"my-other-name"'::json WHERE id = 1;
It worked with my case, where data is a json type
Matheus de Oliveira created handy functions for JSON CRUD operations in postgresql. They can be imported using the \i directive. Notice the jsonb fork of the functions if jsonb if your data type.
9.3 json https://gist.github.com/matheusoliveira/9488951
9.4 jsonb https://gist.github.com/inindev/2219dff96851928c2282
Updating the whole column worked for me:
UPDATE test SET data='{"name": "my-other-name", "tags": ["tag1", "tag2"]}' where id=1;
I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.