Update a very large table in PostgreSQL without locking - postgresql

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:
UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null
This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.
I would appreciate a sample procedure code.
Thanks.

There is no update without locking, but you can strive to keep the row locks few and short.
You could simply run batches of this:
UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
FROM mytable
WHERE col1 IS NOT NULL
AND col2 IS DISTINCT FROM 'ABCD'
LIMIT 10000) AS part
WHERE mytable.id = part.id;
Just keep repeating that statement until it modifies less than 10000 rows, then you are done.
Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.
To make that performant, an index like this would help:
CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;

Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.
create table indexer (mytable_id integer primary key);
insert into indexer(mytable_id)
select mytable_id
from mytable
where col1 is null
and col2 is null;
The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.
create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
with idx as
( update mytable
set col2 = 'ABCD'
where col2 is null
and mytable_id in (select mytable_if
from indexer
limit rows_to_process_in
)
returning mytable_id
)
delete from indexer
where mytable_id in (select mytable_id from idx);
select count(*) from indexer;
$$;
When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.
Edited
Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.
Do $$
declare
rows_remaining bigint;
begin
loop
rows_remaining = set_mytable_col2(200000);
commit;
exit when rows_remaining = 0;
end loop;
end; $$;

Related

Would it be possible to select random rows with a little preference for a specific column?

I would like to get a random selection of records from my table but I wonder if it would be possible to give a better chance for items that are newly created. I also have pagination so this is why I'm using setseed
Currently I'm only retrieving items randomly and it works quite well, but I need to give a certain "preference" to newly created items.
Here is what I'm doing for now:
SELECT SETSEED(0.16111981), RANDOM();
I don't know what to do and I can't figure what can be a good solution without being an absolute performance disaster.
Firstly I want to explain how we can select random records on a table. On PostgreSQL, we can use random() function in the order by statement. Example:
select * from test_table
order by random()
limit 1;
I am using limit 1 for selecting only one record. But, using this method our query performance will be very bad for large size tables (over 100 million data)
The second way, you can manually be selecting records using random() if the tables are had id fields. This way is very high performance.
Let's firstly write our own randomize function for using it's easily on our queries.
CREATE OR REPLACE FUNCTION random_between(low integer, high integer)
RETURNS integer
LANGUAGE plpgsql
STRICT
AS $function$
BEGIN
RETURN floor(random()* (high-low + 1) + low);
END;
$function$;
This function returns a random integer value in the range of our input argument values. Then we can write a query using our random function. Example:
select * from test_table
where id = (select random_between(min(id), max(id)) from test_table);
This query I tested on the table has 150 million data and gets the best performance, Duration 12 ms. In this query, if you need many rows but not one, then you can write where id > instead of where id=.
Now, for your little preference, I don't know your detailed business logic and condition statements which you want to set to randomizing. I can write for you some sample queries for understanding the mechanism. PostgreSQL has not a function for doing this process, so randomize data using preferences. We must write this logic manually. I created a sample table for testing our queries.
CREATE TABLE test_table (
id serial4 NOT NULL,
is_created bool NULL,
action_date date NULL,
CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX test_table_id_idx ON test_table USING btree (id);
For example, I want to set more preference only to data which are action dates has a closest to today. Sample query:
select
id,
is_created,
action_date,
(extract(day from (now()-action_date))) as dif_days
from
test.test_table
where
id > (select random_between(min(id), max(id)) from test.test_table)
and
(extract(day from (now()-action_date))) = random_between(0, 6)
limit 1;
In this query this (extract(day from (now()-action_date))) as dif_days query will returned difference between action_date and today. On the where clause firstly I select data that are id field values greater than the resulting randomize value. Then using this query (extract(day from (now()-action_date))) = random_between(0, 6) I select from this resulting data only which data are action_date equals maximum 6 days ago (maybe 4 days ago or 2 days ago, mak 6 days ago).
Сan wrote many logic queries (for example set more preferences using boolean fields: closed are opened and etc.)

WHERE "id" = nextval(seq) doesn't work properly

Table: test_seq
id (varchar(8))
raw_data (text)
cd_1
'I'm text'
cd_2
'I'm more text'
CREATE SEQUENCE cd_seq CYCLE START 1 MAXVALUE 2;
ALTER TABLE test_seq
ALTER COLUMN id SET DEFAULT ('cd_'||nextval('cd_seq'))::VARCHAR(8);
UPDATE test_seq
SET raw_data = 'New Text'
WHERE "id" = 'cd_'||nextval('cd_seq')::VARCHAR(8);
I am making a table that will store raw data as a short term backup, if for some reason the data ingestion fails and we need to go back without extracting it again. I'm trying to setup a way to have the records get replace when we have reached the ID limit.
So if I want 25 records in the table, when the SEQUENCE rolls back from the maximum ('cd_25') to ('cd_1'), I want raw_data to get updated to the new data.
I've come up with the SEQUENCE and the DEFAULT value for the first inserts but my UPDATE command won't update the records even when the "id" matches the 'cd_'||nextval('cd_seq') and it will sometimes UPDATE 9 rows at once.
I checked the values of "id" and 'cd_'||nextval('cd_seq') and they appear to be a match but the WHERE doesn't work properly.
Am I missing something or am I overcomplicating things?
Thank you
While I agree with Adrian Klaver's comments that this approach is pretty fragile due to how sequences work, if:
You can make sure the column default value is the only call to the sequence
You don't mind skipped rows if an insert fails, but sequence still increments its value
You can make sure all inserts handle conflicts like below
this can work.
Instead of trying to insert data by updating existing rows - which by the way forces you to prepopulate the table - just actually insert it and handle the conflict.
insert into test_seq
(text_column)
values
('e')
on conflict(id) do update set text_column=excluded.text_column;
This also lets you insert more than one row at once (up to the max size of your table, the length of your sequence), compared to your current update approach, as I do in the test below.
drop sequence if exists cd_seq cascade;
create sequence cd_seq cycle start 1 maxvalue 4;
drop table if exists test_seq cascade;
create table test_seq
(id text primary key default ('cd_'||nextval('cd_seq'))::VARCHAR(8),
text_column text);
insert into test_seq
(text_column)
values
('a'),
('b'),
('c'),
('d')
on conflict(id) do update set text_column=excluded.text_column;
select id, text_column from test_seq;
-- id | text_column
--------+-------------
-- cd_1 | a
-- cd_2 | b
-- cd_3 | c
-- cd_4 | d
--(4 rows)
insert into test_seq
(text_column)
values
('e'),
('f')
on conflict(id) do update set text_column=excluded.text_column;
select id, text_column from test_seq;
-- id | text_column
--------+-------------
-- cd_3 | c
-- cd_4 | d
-- cd_1 | e
-- cd_2 | f
--(4 rows)
If you try to insert more rows than the length of your sequence, you'll get
ERROR: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same
command have duplicate constrained values.
If in your current solution you gave your update a source table to get multiple rows from and their number also exceeded the sequence length, it wouldn't pose a problem - in conflicting pairs you'd just get the last one. Here's your update, fixed (but still requires that your table is pre-populated):
with new as (
select ('cd_'||nextval('cd_seq'))::VARCHAR(8) id,'g' text_column union all
select ('cd_'||nextval('cd_seq'))::VARCHAR(8) id,'h' text_column union all
select ('cd_'||nextval('cd_seq'))::VARCHAR(8) id,'i' text_column union all
select ('cd_'||nextval('cd_seq'))::VARCHAR(8) id,'j' text_column union all
select ('cd_'||nextval('cd_seq'))::VARCHAR(8) id,'k' text_column)
update test_seq old
set text_column=new.text_column
from new
where old.id=new.id;

Cap on number of table rows matching a certain condition using Postgres exclusion constraint?

If I have a Postgresql db schema for a tires table like this (a user has many tires):
user_id integer
description text
size integer
color text
created_at timestamp
and I want to enforce a constraint that says "a user can only have 4 tires".
A naive way to implement this would be to do:
SELECT COUNT(*) FROM tires WHERE user_id='123'
, compare the result to 4 and insert if it's lower than 4. It's suspectible to race conditions and so is a naive approach.
I don't want to add a count column. How can I do this (or can I do this) using an exclusion constraint? If it's not possible with exclusion constraints, what is the canonical way?
The "right" way is using locks here.
Firstly, you need to lock related rows. Secondly, insert new record if a user has less than 4 tires. Here is the SQL:
begin;
-- lock rows
select count(*)
from tires
where user_id = 123
for update;
-- throw an error of there are more than 3
-- insert new row
insert into tires (user_id, description)
values (123, 'tire123');
commit;
You can read more here How to perform conditional insert based on row count?

Using SQL "seek" with a UUID for sorting in a PL/pgSQL Query

I have a table that looks like the following:
CREATE TABLE tmp (
id uuid primary key,
other_id uuid,
...
);
This table has millions of entries, and I need to: loop through them all, check and compare the values of some of its fields with the values of another table, and correct the values.
I did not want to use the standard ORDER BY ... LIMIT ... OFFSET ... approach as its performance suffers greatly for big offsets. Hence, I tried to used the "seek index" approach, example here.
My problem is that I am getting off-by-one errors, and I am not sure (conceptually) how to solve these in PL/pgSQL code. Something like this:
-- Get initial offset
SELECT id INTO _id_offset
FROM tmp
WHERE ...
ORDER BY id DESC
LIMIT 1
WHILE ... LOOP -- Loop until some fixed high value to prevent infinite loop, just in case
SELECT id, other_id, ... INTO rows_to_update
FROM tmp
WHERE id < _id_offset AND (...) -- Latter part is same condition as above
ORDER BY id DESC
FETCH NEXT _batch_size ROWS ONLY
-- Get next offset
SELECT id INTO _id_offset
FROM rows_to_update
ORDER BY id ASC -- ASC to get the "last" id from above. Cannot simply use _batch_size offset as there may be fewer entries left.
LIMIT 1
-- Update relevant records, check # of updated records to see
-- if we can terminate loop early, update loop condition
...
END LOOP;
Unsurprisingly, the first and last entry are skipped due to the < condition. It would have been rather simple to correct this behaviour in application code, but I'm not sure how it should look like in PL/pgSQL.
Is there a simpler way to loop over an entire table in an efficient manner using PL/pgSQL?

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.