I have a million-row table in Postgres 13 that needs a one-time update of each row: the (golang) script will read the current column value for each row, transform it, then update the row with the new value, for example:
DECLARE c1 CURSOR FOR SELECT v FROM users;
FETCH c1;
-- read and transform v
UPDATE users SET v = ? WHERE CURRENT OF c1;
-- transaction committed
FETCH c1;
...
I'm familiar with cursors for reading, but have a few requirements for writing that I'm struggling to find the right settings for:
I don't want it all to run in a single huge transaction, which is the default with cursors, since the change set will be large and it will take a while. I'd rather each update be its own transaction, and I can re-run the idempotent script again if it fails for any reason. I'm aware of DECLARE WITH HOLD to have the cursor span transactions, but...
By default the data read by the cursor is "insensitive" (a snapshot from when the cursor was first created), but I would like the latest data for each row with FETCH in case there has been a subsequent update. The solution to that is to use FOR UPDATE in the cursor query to make it "sensitive," but that is not allowed together with WITH HOLD. I would prefer the row lock you get with FOR UPDATE to prevent the read-then-write race condition between FETCH and UPDATE, but it's not mandatory
How can I iterate all rows and update them one at a time without having to read everything into memory first?
Make the cursor be WITH HOLD, but select the pk rather than v. Then in the loop, select the now-current v from the table based on the pk (rather than current of main), and update it using the pk.
Related
I have to check when a table is inserted to/updated to see if a column value exists for the same HotelID and different RoomNo in the same table. I'm thinking that an INSTEAD OF trigger on the table would be a good option, but I read that it's a bad idea to update/insert the table the trigger executes on inside the trigger and you should create the trigger on a view instead (which raises more questions for me)
Is it ok to create a trigger like this? Is there a better option?
CREATE TRIGGER dbo.tgr_tblInterfaceRoomMappingUpsert
ON dbo.tblInterfaceRoomMapping
INSTEAD OF INSERT, UPDATE
AS
BEGIN
SET NOCOUNT ON;
DECLARE #txtRoomNo nvarchar(20)
SELECT #txtRoomNo = Sonifi_RoomNo
FROM dbo.tblInterfaceRoomMapping r
INNER JOIN INSERTED i
ON r.iHotelID = i.iHotelID
AND r.Sonifi_RoomNo = i.Sonifi_RoomNo
AND r.txtRoomNo <> i.txtRoomNo
IF #txtRoomNo IS NULL
BEGIN
-- Insert/update the record
END
ELSE
BEGIN
-- Raise error
END
END
GO
So it sounds like you only want 1 row per combo of HotelID and Sonifi_RoomNo.
CREATE UNIQUE INDEX UQ_dbo_tblInterfaceRoomMapping
ON dbo.tblInterfaceRoomMapping(HotelID,Sonifi_RoomNo)
Now if you try and put a second row with the same values, it will bark at you.
It's (usually) not okay to create a trigger like that.
Your trigger assumes a single row update or insert will only ever occur - is that guaranteed?
What will be the value of #txtRoomNo if multiple rows are inserted or updated in the same batch?
Eg, if an update is performed against the table resulting in 1 row with correct data and 1 row with incorrect data, how do you think your trigger would cope in that situation? Remember triggers fire once per insert/update, not per row.
Depending on your requirments you could keep the instead of trigger concept, however I would suggest a separate trigger for inserts and for updates.
In each you can then insert / update and include a where not exists clause to only allow valid inserts / updates, ignoring inserting or updating anything invalid.
I would avoid raising an error in the trigger, if you need to handle bad data you could also insert into some logging table with the reverse where exists logic and then handle separately.
Ultimately though, it would be best for the application to check if the roomNo is already used.
I am hoping that I can articulate this effectively, so here it goes:
I am creating a model which will be run on a platform by users, possibly simultaneously, but each model run is marked by a unique integer identifier. This model will execute a series of PostgreSQL queries and eventually write a result elswehere.
Now because of the required parallelization of model runs, I have to make sure that the processes will not collide, despite running in the same database. I am at a point now where I have to store a list of records, sorted by a score variable and then operate on them. This is the beginning of the query:
DO
$$
DECLARE row RECORD;
BEGIN
DROP TABLE IF EXISTS ranked_clusters;
CREATE TEMP TABLE ranked_clusters AS (
SELECT
pl.cluster_id AS c_id,
SUM(pl.total_area) AS cluster_score
FROM
emob.parking_lots AS pl
WHERE
pl.cluster_id IS NOT NULL
AND
run_id = 2005149
GROUP BY
pl.cluster_id
ORDER BY
cluster_score DESC
);
FOR row IN SELECT c_id FROM ranked_clusters LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;
END;
$$ LANGUAGE plpgsql;
So I create a temporary table called ranked_clusters and then iterate through it, at the moment just logging the identifiers of each record.
I have been careful to only build this list from records which have a run_id value equal to a certain number, so data from the same source, but with a different number will be ignored.
What I am worried about however is that a simultaneous process will also create its own ranked_clusters temporary table, which will collide with the first one, invalidating the results.
So my question is essentially this: Are temporary tables only visible to the session which creates them (or to the cursor object from say, Python)? And is it therefore safe to use a temporary table in this way?
The main reason I ask is because I see that these so-called "temporary" tables seem to persist after I execute the query in PgAdmin III, and the query fails on the next execution because the table already exists. This troubles me because it seems as though the tables are actually globally accessible during their lifetime and would therefore introduce the possibility of a collision when a simultaneous run occurs.
Thanks #a_horse_with_no_name for the explanation but I am not yet convinced that it is safe, because I have been able to execute the following code:
import psycopg2 as pg2
conn = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn.autocommit = True
cur = conn.cursor()
conn2 = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn2.autocommit = True
cur2 = conn.cursor()
cur.execute("CREATE TEMPORARY TABLE temptable (tempcol INTEGER); INSERT INTO temptable VALUES (0);")
cur2.execute("SELECT tempcol FROM temptable;")
print(cur2.fetchall())
And I receive the value in temptable despite it being created as a temporary table in a completely different connection as the one which queries it afterwards. Am I missing something here? Because it seems like the temporary table is indeed accessible between connections.
The above had a typo, Both cursors were actually being spawned from conn, rather than one from conn and another from conn2. Individual connections in psycopg2 are not able to access each other's temporary tables, but cursors spawned from the same connection are.
Temporary tables are only visible to the session (=connection) that created them. Even if two sessions create the same table, they won't interfere with each other.
Temporary tables are removed automatically when the session is disconnected.
If you want to automatically remove them when your transaction ends, use the ON COMMIT DROP option when creating the table.
So the answer is: yes, this is safe.
Unrelated, but: you can't store rows "in a sorted way". Rows in a table have no implicit sort order. The only way you can get a guaranteed sort order is to use an ORDER BY when selecting the rows. The order by that is part of your CREATE TABLE AS statement is pretty much useless.
If you have to rely on the sort order of the rows, the only safe way to do that is in the SELECT statement:
FOR row IN SELECT c_id FROM ranked_clusters ORDER BY cluster_score
LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;
PostgreSQL has read committed isolation level. Now I have a transaction which consists of a single DELETE statement and this delete statement has a subquery consisting of a SELECT statement for selection the rows to delete.
Is it true that I have to use FOR UPDATE in the select statement to get no conflicts with other transaction?
My thinking is the following: First the corresponding rows are read out from the table and in a second step these rows are deleted, so another transaction could interfere.
And what about a simple DELETE FROM myTable WHERE id = 4 statement? Do I also have to use FOR UPDATE?
Is it true that I have to use FOR UPDATE in the select statement to
get no conflicts with other transaction?
What does "no conflicts with other transaction" mean to you? You can test this by opening two terminals, and executing statements in each of them. Interleaved correctly, the DELETE statement will make the "other transaction" (the one that has its isolation level set to READ COMMITTED) wait until it commits or rolls back.
sandbox=# set transaction isolation level read committed;
SET
sandbox=# select * from customer;
date_of_birth
---------------
1996-09-29
1996-09-28
(2 rows)
sandbox=# begin transaction;
BEGIN
sandbox=# delete from customer
sandbox-# where date_of_birth = '1996-09-28';
DELETE 1
sandbox=# update customer
sandbox-# set date_of_birth = '1900-01-01'
sandbox-# where date_of_birth = '1996-09-28';
(Execution pauses here, waiting for transaction in other terminal.)
sandbox=# commit;
COMMIT
sandbox=#
UPDATE 0
sandbox=#
See below for the documentation.
And what about a simple DELETE FROM myTable WHERE id = 4 statement? Do
I also have to use FOR UPDATE?
There's no such statement as DELETE . . . FOR UPDATE.
You need to be sensitive to context when you're reading about database updates. Update can mean any change to a database; it can include inserting, deleting, and updating rows. In the docs cited below, "locked as though for update" is explicitly talking about UPDATE and DELETE statements, among others.
Current docs
FOR UPDATE causes the rows retrieved by the SELECT statement to be
locked as though for update. This prevents them from being modified or
deleted by other transactions until the current transaction ends. That
is, other transactions that attempt UPDATE, DELETE, SELECT FOR UPDATE,
SELECT FOR NO KEY UPDATE, SELECT FOR SHARE or SELECT FOR KEY SHARE of
these rows will be blocked until the current transaction ends. The FOR
UPDATE lock mode is also acquired by any DELETE on a row, and also by
an UPDATE that modifies the values on certain columns. Currently, the
set of columns considered for the UPDATE case are those that have an
unique index on them that can be used in a foreign key (so partial
indexes and expressional indexes are not considered), but this may
change in the future. Also, if an UPDATE, DELETE, or SELECT FOR UPDATE
from another transaction has already locked a selected row or rows,
SELECT FOR UPDATE will wait for the other transaction to complete, and
will then lock and return the updated row (or no row, if the row was
deleted).
Short version: the FOR UPDATE in a sub-select is not necessary because the DELETE implementation already does the necessary locking. It would be redundant.
Ideally you should read and digest Concurrency Control to learn how the concurrency issues are dealt with by the SQL engine.
Specifically for the case you're mentioning, I think these couple of excerpts are the most relevant, in Read Committed Isolation Level:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands
behave the same as SELECT in terms of searching for target rows: they
will only find target rows that were committed as of the command start
time.
However, such a target row might have already been updated (or
deleted or locked) by another concurrent transaction by the time it is
found. In this case, the would-be updater will wait for the first
updating transaction to commit or roll back (if it is still in
progress).
So one of your two concurrent DELETE will be put to wait, as soon as it tries to delete a row that the other one already processed just before. This wait will only end when the other one commits or roll backs. In a way, that means that the engine "detected the conflict" and serialized the two DELETE in order to deal with that conflict.
If the first updater rolls back, then its effects are
negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater
will ignore the row if the first updater deleted it, otherwise it will
attempt to apply its operation to the updated version of the row.
In your scenario, after the first DELETE has committed and the second one is waked up, the second one will be unable to delete the row that it was put to wait for, because it's no longer current, it's gone. That's not an error in this isolation level. The execution will just go on with the other rows, some of which may also have disappeared. Eventually it will report the actual number of rows that were deleted by this statement, that may be different from the number that the sub-select initially found, before the statement was put to wait.
I have a table in my SQL Server 2008 R2 database, and would like to add a column called LastUpdated, that will automatically be changed every time the row is updated. That way, I can see when each individual row was last updated.
It seems that SQL Server 2008 R2 doesn't have a data type to handle this like earlier versions did, so I'm not sure of the best way to do it. I wondered about using a trigger, but what would happen when the trigger updated the row? Will that fire the trigger again, etc?
To know which row was last updated, you need to create a new column of type DATETIME/DATETIME2 and update it with a trigger. There is no data type that automatically updates itself with date/time information every time the row is updated.
To avoid recursion you can use the UPDATE() clause inside the trigger, e.g.
ALTER TRIGGER dbo.SetLastUpdatedBusiness
ON dbo.Businesses
AFTER UPDATE -- not insert!
AS
BEGIN
IF NOT UPDATE(LastUpdated)
BEGIN
UPDATE t
SET t.LastUpdated = CURRENT_TIMESTAMP -- not dbo.LastUpdated!
FROM dbo.Businesses AS t -- not b!
INNER JOIN inserted AS i
ON t.ID = i.ID;
END
END
GO
In modern versions you can trick SQL Server into doing this using temporal tables:
Maintaining LastModified Without Triggers
But this is full of caveats and limitations and was really only making light of multiple other similar posts:
A System-Maintained LastModifiedDate Column
Tracking Row Changes With Temporal
Columns
How to add “created” and “updated” timestamps without triggers
Need a datetime column that automatically updates
It's not that easy, unfortunately.
You can add a new DATETIME (or DATETIME2) field to your table, and you can give it a default constraint of GETDATE() - that will set the value when a new row is inserted.
Unfortunately, other than creating an AFTER UPDATE trigger, there is no "out of the box" way to keep it updated all the time. The trigger per se isn't hard to write, but you'll have to write it for each and every single table that should have that feature.....
I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.