POSTGRES (9.6) - Looped Updates to Avoid Running out of Memory - postgresql

I have a database table where I want to simply update a column's value for all records that have a certain value in another column. I can't seem to figure out how to get batch updates working using postgres queries.
The current table structure is like the following:
placed_orders (
order_id uuid,
source char(1), -- 'A', 'B', or 'C'
submitted char(1), -- 'Y' or 'N'
.
.
.
)
I want to simply update all placed_order rows' source to 'B' where submitted is 'Y', which usually would be as simple as running:
UPDATE placed_orders SET source='B' WHERE submitted='Y' AND source!='B';
The table is pretty big and has over 5 million records, and if I just ran the above query to update all records at once, I run into out of memory errors, so I'm now looking into the possibility of updating in batches.
The query below is what I have so far, I'm new to loops and declaring variables and changing its values:
DO $test$
DECLARE
batch_size int := 1000;
current_offset int := 0;
records_to_update int := 0;
BEGIN
SELECT COUNT(1) INTO records_to_update FROM placed_orders WHERE submitted='Y' AND source !='B';
FOR j IN 0..records_to_update BY batch_size LOOP
UPDATE placed_orders SET source = 'B' WHERE order_id IN (SELECT order_id FROM placed_orders WHERE submitted='Y' AND source!='B' ORDER BY order_id ASC OFFSET current_offset LIMIT batch_size);
current_offset := current_offset + batch_size;
END LOOP;
END $test$;
When running this query, I still get the memory error. Anyone know how to get it to update in true batches?

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

Fast new row insertion if a value of a column depends on previous value in existing row

I have a table cusers with a primary key:
primary key(uid, lid, cnt)
And I try to insert some values into the table:
insert into cusers (uid, lid, cnt, dyn, ts)
values
(A, B, C, (
select C - cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 1
), now())
on conflict do nothing
Quite often (with the possibility of 98%) a row cannot be inserted to cusers because it violates the primary key constraint, so hard select queries do not need to be executed at all. But as I can see PostgreSQL first counts the select query as a result of dyn column and only then rejects row because of uid, lid, cnt violation.
What is the best way to insert rows quickly in such situation?
Another explanation
I have a system where one row depends on another. Here is an example:
(x, x, 2, 2, <timestamp>)
(x, x, 5, 3, <timestamp>)
Two columns contain an absolute value (2 and 5) and relative value (2, 5 - 2). Each time I insert new row it should:
avoid same rows (see primary key constraint)
if new row differs, it should count a difference and put it into the dyn column (so I take the last inserted row for the user according to the timestamp and subtract values).
Another solution I've found is to use returning uid, lid, ts for inserts and get user ids which were really inserted - this is how I know they have differences from existing rows. Then I update inserted values:
update cusers
set dyn = (
select max(cnt) - min(cnt)
from (
select cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 2) Table
)
where uid = A and lid = B and ts = TS
But it is not a fast approach either, as it seeks all over the ts column to find the two last inserted rows for each user. I need a fast insert query as I insert millions of rows at a time (but I do not write duplicates).
What the solution can be? May be I need a new index for this? Thanks in advance.

How do I reduce the cost of set_bit in Postgres?

I am running PostgreSQL 9.6 and am running an experiment on the following table structure:
CREATE TABLE my_bit_varying_test (
id SERIAL PRIMARY KEY,
mr_bit_varying BIT VARYING
);
Just to understand how much performance I could expect if I was resetting bits on 100,000-bit data concurrently, I wrote a small PL/pgSQL block like this:
DO $$
DECLARE
t BIT VARYING(100000) := B'0';
idd INT;
BEGIN
FOR I IN 1..100000
LOOP
IF I % 2 = 0 THEN
t := t || B'1';
ELSE
t := t || B'0';
end if;
END LOOP ;
INSERT INTO my_bit_varying_test (mr_bit_varying) VALUES (t) RETURNING id INTO idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 100, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 99, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 34587, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 1, 1) WHERE id = idd;
FOR I IN 1..100000
LOOP
IF I % 2 = 0 THEN
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, I, 1)
WHERE id = idd;
ELSE
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, I, 0)
WHERE id = idd;
end if;
END LOOP ;
END
$$;
When I run the PL/pgSQL though, it takes several minutes to complete, and I've narrowed it down to the for loop that is updating the table. Is it running slowly because of the compression on the BIT VARYING column? Is there any way to improve the performance?
Edit This is a simulated, simplified example. What this is actually for is that I have tens of thousands of jobs running that each need to report back their status, which updates every few seconds.
Now, I could normalize it and have a "run status" table that held all the workers and their statuses, but that would involve storing tens of thousands of rows. So, my thought is that I could use a bitmap to store the client and status, and the mask would tell me in order which ones had run and which ones had completed. The front bit would be used as an "error bit" since I don't need to know exactly which client failed, only that a failure exists.
So for example, you might have 5 workers for one job. If they all completed, then the status would be "01111", indicating that all jobs were complete, and none of them failed. If worker number 2 fails, then the status is "111110", indicating that there was an error, and all workers completed except the last one.
So, you can see this as a contrived way of handling large numbers of job statuses. Of course I'm up for other ideas, but even if I go that route, for the future, I'd still like to know how to update a variable bit quickly, because well, I'm curious.
If it is really the TOAST compression that is your problem, you can simply disable it for that table:
ALTER TABLE my_bit_varying_test SET STORAGE EXTERNAL;
You can try a set based approach to replace the second loop. A set based approach is usually fatser than looping. Use generate_series() to get the indexes.
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, gs.i, abs(gs.i % 2 - 1))
FROM generate_series(1, 100000) gs(i)
WHERE id = idd;
Also consider creating an index on my_bit_varying_test (id), if you don't already have one.

postgresql pg_trgm speed up by where conditions

I use pg_trgm extension to check similarity of a text column. I want to speed it up by using additional conditions, but without success. The speed is the same. Here is my example:
create table test (
id serial,
descr text,
yesno text,
truefalse boolean
);
insert into test SELECT generate_series(1,1000000) AS id,
md5(random()::text) AS descr ;
update test set yesno = 'yes' where id < 500000;
update test set yesno = 'no' where id > 499999;
update test set truefalse = true where id < 100000;
update test set truefalse = false where id > 99999;
CREATE INDEX test_trgm_idx ON test USING gist (descr gist_trgm_ops);
So, when I execute query, there is no difference whether or not I use where clause.
select descr <-> '65c141ee1fdeb269d2e393cb1d3e1c09'
as dist, descr, yesno, truefalse from test
where
yesno = 'yes'
and
truefalse = true
order by dist
limit 10;
Is this right?
After creating test data do an ANALYZE to make sure statistics are updated. Then you can use EXPLAIN to find out.
On my machine it does an index scan on test_trgm_idx to scan the rows in order so it can stop when the limit is reached. With the where is actually slightly more work because it has to scan more rows before the limit is reached thought the time difference is not noticable.

SQL Server - How Do I Create Increments in a Query

First off, I'm using SQL Server 2008 R2
I am moving data from one source to another. In this particular case there is a field called SiteID. In the source it's not a required field, but in the destination it is. So it was my thought, when the SiteID from the source is NULL, to sort of create a SiteID "on the fly" during the query of the source data. Something like a combination of the state plus the first 8 characters of a description field plus a ten digit number incremented.
At first I thought it might be easy to use a combination of date/time + nanoseconds but it turns out that several records can be retrieved within a nanosecond leading to duplicate SiteIDs.
My second idea was to create a table that contained an identity field plus a function that would add a record to increment the identity field and then return it (the function would also delete all records where the identity field is less than the latest saving space). Unfortunately after I got it written, when trying to "CREATE" the function I got a notice that INSERTs are not allowed in functions.
I could (and did) convert it to a stored procedure, but stored procedures are not allowed in select queries.
So now I'm stuck.
Is there any way to accomplish what I'm trying to do?
This script may take time to execute depending on the data present in the table, so first execute on a small sample dataset.
DECLARE #TotalMissingSiteID INT = 0,
#Counter INT = 0,
#NewID BIGINT;
DECLARE #NewSiteIDs TABLE
(
SiteID BIGINT-- Check the datatype
);
SELECT #TotalMissingSiteID = COUNT(*)
FROM SourceTable
WHERE SiteID IS NULL;
WHILE(#Counter < #TotalMissingSiteID )
BEGIN
WHILE(1 = 1)
BEGIN
SELECT #NewID = RAND()* 1000000000000000;-- Add your formula to generate new SiteIDs here
-- To check if the generated SiteID is already present in the table
IF ( ISNULL(( SELECT 1
FROM SourceTable
WHERE SiteID = #NewID),0) = 0 )
BREAK;
END
INSERT INTO #NewSiteIDs (SiteID)
VALUES (#NewID);
SET #Counter = #Counter + 1;
END
INSERT INTO DestinationTable (SiteID)-- Add the extra columns here
SELECT ISNULL(MainTable.SiteID,NewIDs.SiteID) SiteID
FROM (
SELECT SiteID,-- Add the extra columns here
ROW_NUMBER() OVER(PARTITION BY SiteID
ORDER BY SiteID) SerialNumber
FROM SourceTable
) MainTable
LEFT JOIN ( SELECT SiteID,
ROW_NUMBER() OVER(ORDER BY SiteID) SerialNumber
FROM #NewSiteIDs
) NewIDs
ON MainTable.SiteID IS NULL
AND MainTable.SerialNumber = NewIDs.SerialNumber