I need to delete the majority (say, 90%) of a very large table (say, 5m rows). The other 10% of this table is frequently read, but not written to.
From "Best way to delete millions of rows by ID", I gather that I should remove any index on the 90% I'm deleting, to speed up the process (except an index I'm using to select the rows for deletion).
From "PostgreSQL locking mode", I see that this operation will acquire a ROW EXCLUSIVE lock on the entire table. But since I'm only reading the other 10%, this ought not matter.
So, is it safe to delete everything in one command (i.e. DELETE FROM table WHERE delete_flag='t')? I'm worried that if the deletion of one row fails, triggering an enormous rollback, then it will affect my ability to read from the table. Would it be wiser to delete in batches?
Indexes are typically useless for operations on 90% of all rows. Sequential scans will be faster either way. (Exotic exceptions apply.)
If you need to allow concurrent reads, you cannot take an exclusive lock on the table. So you also cannot drop any indexes in the same transaction.
You could drop indexes in separate transactions to keep the duration of the exclusive lock at a minimum. In Postgres 9.2 or later you can also use DROP INDEX CONCURRENTLY, which only needs minimal locks. Later use CREATE INDEX CONCURRENTLY to rebuild the index in the background - and only take a very brief exclusive lock.
If you have a stable condition to identify the 10 % (or less) of rows that stay, I would suggest a partial index on just those rows to get the best for both:
Reading queries can access the table quickly (using the partial index) at all times.
The big DELETE is not going to modify the partial index at all, since none of the rows are involved in the DELETE.
CREATE INDEX foo (some_id) WHERE delete_flag = FALSE;
Assuming delete_flag is boolean. You have to include the same predicate in your queries (even if it seems logically redundant) to make sure Postgres can the partial index.
delete using batches of specific size and sleep between deletes:
create temp table t as
select id from tbl where ...;
create index on t(id);
do $$
declare sleep int = 5;
declare batch_size int = 10000;
declare c refcursor;
declare cur_id int = 0;
declare seq_id int = 0;
declare del_id int = 0;
declare ts timestamp;
begin
<<top>>
loop
raise notice 'sleep % sec', sleep;
perform pg_sleep(sleep);
raise notice 'continue..';
open c for select id from t order by id;
<<inn>>
loop
fetch from c into cur_id;
seq_id = seq_id + 1;
del_id = del_id + 1;
if cur_id is null then
raise notice 'goin to del end: %', del_id;
ts = current_timestamp;
close c;
delete from tbl tb using t where tb.id = t.id;
delete from t;
commit;
raise notice 'ok: %s', current_timestamp - ts;
exit top;
elsif seq_id >= batch_size then
raise notice 'goin to del: %', del_id;
ts = current_timestamp;
delete from tbl tb using t where t.id = tb.id and t.id <= cur_id;
delete from t where id <= cur_id;
close c;
commit;
raise notice 'ok: %s', current_timestamp - ts;
seq_id = 0;
exit inn;
end if;
end loop inn;
end loop top;
end;
$$;
Related
First question here after years of reading answers, so apologies if I've got some things wrong!
I'm creating a very large image mosaic (e.g. a big image made of lots of small images) and I've got it working, but I'm struggling to work out how to assign the candidate images to the target locations in parallel.
I have two tables in a postgres database - tiles and candidates. Each contains 50,000 rows.
For each candidate, I want to assign it to the best target location (tile) that has not already been assigned a candidate.
My tables look like this:
create table candidates
(
id integer not null primary key,
lab_l double precision,
lab_a double precision,
lab_b double precision,
processing_by integer
);
create table tiles
(
id integer not null primary key,
pos_x integer,
pos_y integer,
lab_l double precision,
lab_a double precision,
lab_b double precision,
candidate_id integer
);
To find the best tile for a candidate I'm using this function:
create or replace function update_candidates() RETURNS integer
LANGUAGE SQL
PARALLEL SAFE
AS
$$
WITH updated_candidate AS (
SELECT id, pg_advisory_xact_lock(id)
FROM candidates
WHERE id = (
SELECT id
FROM candidates
FOR UPDATE SKIP LOCKED
LIMIT1
)
)
UPDATE tiles
SET candidate_id = (SELECT id FROM updated_candidate)
WHERE tiles.id = (SELECT tile_id
FROM (select t.id AS tile_id,
c.id AS candidate_id,
delta_e_cie_2000(t.lab_l, t.lab_a, t.lab_b,
c.lab_l, c.lab_a, c.lab_b) AS diff
FROM tiles t,
candidates c
WHERE c.id = (SELECT id FROM updated_candidate)
AND t.candidate_id IS NULL
ORDER BY diff ASC
LIMIT 1 FOR UPDATE
) AS ticid)
RETURNING id;
$$
;
Which uses the delta_e_cie_2000 function to calculate the color difference between candidate and each of the un-used tiles.
To assign all the candidates to tiles I'm using this function:
create or replace function update_all_tiles() RETURNS int
LANGUAGE plpgsql
PARALLEL SAFE
AS
$$
DECLARE
counter integer := 0;
last_id integer := 1;
BEGIN
WHILE last_id IS NOT NULL
LOOP
RAISE NOTICE 'Counter %', counter;
counter := counter + 1;
last_id := update_candidates();
RAISE NOTICE 'Updated %', last_id;
END LOOP;
RETURN counter;
END
$$
;
Running select update_all_tiles(); is working fine on a single thread. It's producing the correct output but it's pretty slow - it takes 20-30 minutes to update 50,000 tiles.
I'm running it on a machine with lots of cores and was hoping to use them all, or at least more than one.
However, if I run select update_all_tiles() in a second session the it runs for a while and then one of the processes outputs something along the lines of
ERROR: deadlock detected
DETAIL: Process 2541 waits for ShareLock on transaction 1628083; blocked by process 1664.
Process 1664 waits for ShareLock on transaction 1628084; blocked by process 2541.
Ideally, I'd like the update_all_tiles() function itself to run in parallel so I could just call it once and it would use all the cores.
If that's not possible I'd like to be able to run the function in multiple sessions.
Is there a way I can do this while avoiding the deadlock detected error?
Any help would be greatly appreciated.
Ok well I solved this specific issue like this
create or replace function assign_next_candidate() returns int
LANGUAGE plpgsql
PARALLEL SAFE
as
$$
declare
_candidate_id integer := 0;
_tile_id integer := 0;
begin
--first we get a candidate, and lock it (using `FOR UPDATE`) so that no other process will pick it up
-- and `SKIP LOCKED` so that we don't pick up one another process is using
UPDATE candidates
set processing_by = pg_backend_pid()
WHERE id = (
SELECT id
from candidates
where processing_by is null
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id into _candidate_id;
raise notice 'Candidate %', _candidate_id;
-- now we get the least different tile that is not locked and update it, using the candidate_id from the previous query
UPDATE tiles
set candidate_id = _candidate_id
WHERE id = (
select t.id as tile_id
from tiles t,
candidates c
where c.id = _candidate_id
and t.candidate_id is null
order by delta_e_cie_2000(t.lab_l, t.lab_a, t.lab_b,
c.lab_l, c.lab_a, c.lab_b) desc
Limit 1 for update skip locked
)
RETURNING id into _tile_id;
raise notice 'Tile %', _tile_id;
return _tile_id;
end
$$;
select assign_next_candidate();
--does not commit changes until it's finished all assignments
create or replace function assign_all_candidates() RETURNS int
LANGUAGE plpgsql
PARALLEL SAFE
AS
$$
declare
counter integer := 0;
last_id integer := 1;
begin
while last_id is not null
loop
raise notice 'Counter %', counter;
counter := counter + 1;
last_id := assign_next_candidate();
raise notice 'Updated %', last_id;
end loop;
return counter;
end
$$
;
I found this article really helpful: https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
I've now got another issue which is that the queries run in parallel fine but they slow down a lot as they get further through the available rows. I think this is because the changes are not committed inside the loop in assign_all_candidates(), which means that the SKIP LOCKED queries are having to skip over a lot of candidates before finding one they can actually assign. I think I can get round that by breaking out of the loop every X records, committing those updates, and then re-calling assign_all_candidates().
It's still much much faster than it was and seems robust.
Update:
As far as I can tell there is no way to commit within a plpgsql function (at least up to version 13). You can commit within anonymous code blocks, and within procedures, but you can't have parallel procedures.
I've ended up creating a function that takes an array/series of IDs and updates each of them. I then call this function from multiple processes (via Python) so that the results are committed after each "block" is processed.
queries = [
"select assign_candidates(array[1,2,3])",
"select assign_candidates(array[4,5,6])"
]
with Pool(2) as p:
p.map(do_query, queries)
Where the assign_candidates function looks something like
CRETE FUNCTION assign_candidates(_candidates integer[]) RETURNS integer
PARALLEL SAFE
LANGUAGE plpgsql
AS
$$
DECLARE
_candidate_id int := 0;
BEGIN
foreach _candidate_id in ARRAY _candidates
LOOP
perform assign_candidate(_candidate_id);
END LOOP;
RETURN 1;
END;
$$;
This "split queries" approach scales fairly linearly as I process more rows unlike the previous approach that slowed down a lot as the number of rows grew (because it had to skip a lot of locked but not committed rows):
It also scaled fairly well on more processors although doubling the number of processes does not halve the time (I guess because they are also trying to lock the same rows, and skipping over each other's locked rows):
For my use case the above approach reduced the processing time from nearly an hour to a few minutes, which was good enough.
Problem statement: I am using the Repository pattern to pull and update records from a PostgreSQL DB (version 11) into a Node.JS API via the pg npm module. If two users try to modify the same record at almost the same time, the changes made by the first user to submit will be overwritten by the second user to submit. I want to prevent the second user's changes from being submitted until they have updated their local copy of the record with the first user's changes.
I know that a DB like CouchDB has a "_rev" property that it uses in this case to detect an attempt to update from a stale snapshot. As I've researched more, I've found this is called optimistic locking (Optimistic locking queue). So I'll add a rev column to my table and use it in my SQL update statement.
UPDATE tableX
SET field1 = value1,
field2 = value2,
...,
rev_field = uuid_generate_v4()
WHERE id_field = id_value
AND rev_field = rev_value
However, if id_value is a match and rev_value isn't that won't tell my repository code that the record was stale, only that 0 rows were affected by the query.
So I've got a script that I've written in pgAdmin that will detect cases where the update affected 0 rows and then checks the rev_field.
DO $$
DECLARE
i_id numeric;
i_uuid uuid;
v_count numeric;
v_rev uuid;
BEGIN
i_id := 1;
i_uuid := '20b2e135-42d0-4a49-94c0-5557dd09abd1';
UPDATE account_r
SET account_name = 'savings',
rev = uuid_generate_v4()
WHERE account_id = i_id
AND rev = i_uuid;
GET DIAGNOSTICS v_count = ROW_COUNT;
IF v_count < 1 THEN
SELECT rev INTO v_rev
FROM account_r
WHERE account_id = i_id;
IF v_rev <> i_uuid THEN
RAISE EXCEPTION 'revision mismatch';
END IF;
END IF;
RAISE NOTICE 'rows affected: %', v_count;
END $$;
While I'm perfectly comfortable adapting this code into a stored proc and calling that from Node, I'm hoping that there's a solution to this that's not nearly as complex. On the one hand, moving these functions to the DB will clean up my JS code, on the other hand, this is a lot of boilerplate SQL to write, since it will have to be done for UPDATE and DELETE for each table.
Is there an easier way to get this done? (Perhaps the code at Optimistic locking queue is the easier way?) Should I be looking at an ORM to help reduce the headache here?
There is no need to maintain a rev value. You can get the md5 hash of a table's row.
SQL Fiddle Here
create table mytable (
id int primary key,
some_text text,
some_int int,
some_date timestamptz
);
insert into mytable
values (1, 'First entry', 0, now() - interval '1 day'),
(2, 'Second entry', 1, now()),
(3, 'Third entry', 2, now() + interval '1 day');
select *, md5(mytable::text) from mytable order by id;
The fiddle includes other queries to demonstrate that the calculated md5() is based on the values of the row.
Using that hash for optimistic locking, the updates can take the form:
update mytable
set some_int = -1
where id = 1
and md5(mytable::text) = <md5 hash from select>
returning *
You will still need to check for no return rows, but that could be abstracted away on the Node side.
It looks like result.row_count contains the number of rows affected, so you will not need the returning * part.
Suppose I have simple logic:
If user had no balance accrual earlier (which is recorded in accruals table), we must give him 100$ to balance:
START TRANSACTION;
DO LANGUAGE plpgsql $$
DECLARE _accrual accruals;
BEGIN
--LOCK TABLE accruals; -- label A
SELECT * INTO _accrual from accruals WHERE user_id = 1;
IF _accrual.accrual_id IS NOT NULL THEN
RAISE SQLSTATE '22023';
END IF;
UPDATE users SET balance = balance + 100 WHERE user_id = 1;
INSERT INTO accruals (user_id, amount) VALUES (1, 100);
END
$$;
COMMIT;
The problem of this transaction is it's not concurrent.
Running this transaction in parrallel results getting user_id=1 with balance=200 and 2 accruals recorded.
How do I test concurrency ?
1. I run in session 1: START TRANSACTION; LOCK TABLE accruals;
2. In session 2 and session 3 I run this transaction
3. In session 1: ROLLBACK
The question is: How do I make this 100% concurrent and make sure user will have 100$ only once.
The only way I see is to lock the table (label A in code sample)
But do I have another way ?
The simplest way is probably to use the serializable isolation level (by changing default_transaction_isolation). Then one of the processes should get something like "ERROR: could not serialize access due to concurrent update"
If you want to keep the isolation level at 'read committed', then you can just count accruals at the end and throw an error:
START TRANSACTION;
DO LANGUAGE plpgsql $$
DECLARE _accrual accruals;
_count int;
BEGIN
SELECT * INTO _accrual from accruals WHERE user_id = 1;
IF _accrual.accrual_id IS NOT NULL THEN
RAISE SQLSTATE '22023';
END IF;
UPDATE users SET balance = balance + 100 WHERE user_id = 1;
INSERT INTO accruals (user_id, amount) VALUES (1, 100);
select count(*) into _count from accruals where user_id=1;
IF _count >1 THEN
RAISE SQLSTATE '22023';
END IF;
END
$$;
COMMIT;
This works because one process will block the other on the UPDATE (assuming non-zero number of rows get updated), and by the time one process commits to release the blocked process, its inserted row will be visible to the other one.
Formally there is then no need for the first check, but if you don't want a lot of churn due to rolled back INSERT and UPDATE, you might want to keep it.
I'm new in PostgreSQL. Assume that I have a table (tbl_box) with thousands of records and it is growing, I want to delete 10 rows from a specific index (for example I want to delete 10 records from 50th row to 59th row) I wrote a function
You can see below:
- Function: public.signalreject()
-- DROP FUNCTION public.signalreject();
CREATE OR REPLACE FUNCTION public.signalreject()
RETURNS void AS
$BODY$
DECLARE
rec RECORD;
cur CURSOR
FOR SELECT barcode,id
FROM tbl_box where gf is null order by id desc;
counter int ;
BEGIN
-- Open the cursor
OPEN cur;
counter:=0;
LOOP
-- fetch row into the rec
FETCH cur INTO rec;
-- exit when no more row to fetch
EXIT WHEN NOT FOUND;
counter :=counter+1;
-- build the output
IF counter >= 50 and counter < 60 THEN
delete from tbl_box where barcode = rec.barcode;
END IF;
END LOOP;
-- Close the cursor
CLOSE cur;
END; $BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION public.signalreject()
OWNER TO Morteza;
I found that the cursor consumes memory and has a high CPU usage. What else except cursor you guys suggest me?
Is this a good way to do this?
I need the fastest way because it is important for me to delete in a shortest time.
This seems pretty elaborate, why not just do
delete from tbl_box
where barcode in
( select barcode
from tbl_box
where gf is null
order by id desc limit 10 offset 49
);
assuming that barcode is unique. We skip 49 rows to start deleting 10 rows from row 50.
I want to know number of rows that will be affected by UPDATE query in BEFORE per statement trigger . Is that possible?
The problem is that i want to allow only queries that will update up to 4 rows. If affected rows count is 5 or more i want to raise error.
I don't want to do this in code because i need this check on db level.
Is this at all possible?
Thanks in advance for any clues on that
Write a function that updates the rows for you or performs a rollback. Sorry for poor style formatting.
create function update_max(varchar, int)
RETURNS void AS
$BODY$
DECLARE
sql ALIAS FOR $1;
max ALIAS FOR $2;
rcount INT;
BEGIN
EXECUTE sql;
GET DIAGNOSTICS rcount = ROW_COUNT;
IF rcount > max THEN
--ROLLBACK;
RAISE EXCEPTION 'Too much rows affected (%).', rcount;
END IF;
--COMMIT;
END;
$BODY$ LANGUAGE plpgsql
Then call it like
select update_max('update t1 set id=id+10 where id < 4', 3);
where the first param ist your sql-Statement and the 2nd your max rows.
Simon had a good idea but his implementation is unnecessarily complicated. This is my proposition:
create or replace function trg_check_max_4()
returns trigger as $$
begin
perform true from pg_class
where relname='check_max_4' and relnamespace=pg_my_temp_schema();
if not FOUND then
create temporary table check_max_4
(value int check (value<=4))
on commit drop;
insert into check_max_4 values (0);
end if;
update check_max_4 set value=value+1;
return new;
end; $$ language plpgsql;
I've created something like this:
begin;
create table test (
id integer
);
insert into test(id) select generate_series(1,100);
create or replace function trg_check_max_4_updated_records()
returns trigger as $$
declare
counter_ integer := 0;
tablename_ text := 'temptable';
begin
raise notice 'trigger fired';
select count(42) into counter_
from pg_catalog.pg_tables where tablename = tablename_;
if counter_ = 0 then
raise notice 'Creating table %', tablename_;
execute 'create temporary table ' || tablename_ || ' (counter integer) on commit drop';
execute 'insert into ' || tablename_ || ' (counter) values(1)';
execute 'select counter from ' || tablename_ into counter_;
raise notice 'Actual value for counter= [%]', counter_;
else
execute 'select counter from ' || tablename_ into counter_;
execute 'update ' || tablename_ || ' set counter = counter + 1';
raise notice 'updating';
execute 'select counter from ' || tablename_ into counter_;
raise notice 'Actual value for counter= [%]', counter_;
if counter_ > 4 then
raise exception 'Cannot change more than 4 rows in one trancation';
end if;
end if;
return new;
end; $$ language plpgsql;
create trigger trg_bu_test before
update on test
for each row
execute procedure trg_check_max_4_updated_records();
update test set id = 10 where id <= 1;
update test set id = 10 where id <= 2;
update test set id = 10 where id <= 3;
update test set id = 10 where id <= 4;
update test set id = 10 where id <= 5;
rollback;
The main idea is to have a trigger on 'before update for each row' that creates (if necessary) a temporary table (that is dropped at the end of transaction). In this table there is just one row with one value, that is the number of updated rows in current transaction. For each update the value is incremented. If the value is bigger than 4, the transaction is stopped.
But I think that this is a wrong solution for your problem. What's a problem to run such wrong query that you've written about, twice, so you'll have 8 rows changed. What about deletion rows or truncating them?
PostgreSQL has two types of triggers: row and statement triggers. Row triggers only work within the context of a row so you can't use those. Unfortunately, "before" statement triggers don't see what kind of change is about to take place so I don't believe you can use those, either.
Based on that, I would say it's unlikely you'll be able to build that kind of protection into the database using triggers, not unless you don't mind using an "after" trigger and rolling back the transaction if the condition isn't satisfied. Wouldn't mind being proved wrong. :)
Have a look at using Serializable Isolation Level. I believe this will give you a consistent view of the database data within your transaction. Then you can use option #1 that MusiGenesis mentioned, without the timing vulnerability. Test it of course to validate.
I've never worked with postgresql, so my answer may not apply. In SQL Server, your trigger can call a stored procedure which would do one of two things:
Perform a SELECT COUNT(*) to determine the number of records that will be affected by the UPDATE, and then only execute the UPDATE if the count is 4 or less
Perform the UPDATE within a transaction, and only commit the transaction if the returned number of rows affected is 4 or less
No. 1 is timing vulnerable (the number of records affected by the UPDATE may change between the COUNT(*) check and the actual UPDATE. No. 2 is pretty inefficient, if there are many cases where the number of rows updated is greater than 4.