How do I reduce the cost of set_bit in Postgres? - postgresql

I am running PostgreSQL 9.6 and am running an experiment on the following table structure:
CREATE TABLE my_bit_varying_test (
id SERIAL PRIMARY KEY,
mr_bit_varying BIT VARYING
);
Just to understand how much performance I could expect if I was resetting bits on 100,000-bit data concurrently, I wrote a small PL/pgSQL block like this:
DO $$
DECLARE
t BIT VARYING(100000) := B'0';
idd INT;
BEGIN
FOR I IN 1..100000
LOOP
IF I % 2 = 0 THEN
t := t || B'1';
ELSE
t := t || B'0';
end if;
END LOOP ;
INSERT INTO my_bit_varying_test (mr_bit_varying) VALUES (t) RETURNING id INTO idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 100, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 99, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 34587, 1) WHERE id = idd;
UPDATE my_bit_varying_test SET mr_bit_varying = set_bit(mr_bit_varying, 1, 1) WHERE id = idd;
FOR I IN 1..100000
LOOP
IF I % 2 = 0 THEN
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, I, 1)
WHERE id = idd;
ELSE
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, I, 0)
WHERE id = idd;
end if;
END LOOP ;
END
$$;
When I run the PL/pgSQL though, it takes several minutes to complete, and I've narrowed it down to the for loop that is updating the table. Is it running slowly because of the compression on the BIT VARYING column? Is there any way to improve the performance?
Edit This is a simulated, simplified example. What this is actually for is that I have tens of thousands of jobs running that each need to report back their status, which updates every few seconds.
Now, I could normalize it and have a "run status" table that held all the workers and their statuses, but that would involve storing tens of thousands of rows. So, my thought is that I could use a bitmap to store the client and status, and the mask would tell me in order which ones had run and which ones had completed. The front bit would be used as an "error bit" since I don't need to know exactly which client failed, only that a failure exists.
So for example, you might have 5 workers for one job. If they all completed, then the status would be "01111", indicating that all jobs were complete, and none of them failed. If worker number 2 fails, then the status is "111110", indicating that there was an error, and all workers completed except the last one.
So, you can see this as a contrived way of handling large numbers of job statuses. Of course I'm up for other ideas, but even if I go that route, for the future, I'd still like to know how to update a variable bit quickly, because well, I'm curious.

If it is really the TOAST compression that is your problem, you can simply disable it for that table:
ALTER TABLE my_bit_varying_test SET STORAGE EXTERNAL;

You can try a set based approach to replace the second loop. A set based approach is usually fatser than looping. Use generate_series() to get the indexes.
UPDATE my_bit_varying_test
SET mr_bit_varying = set_bit(mr_bit_varying, gs.i, abs(gs.i % 2 - 1))
FROM generate_series(1, 100000) gs(i)
WHERE id = idd;
Also consider creating an index on my_bit_varying_test (id), if you don't already have one.

Related

POSTGRES (9.6) - Looped Updates to Avoid Running out of Memory

I have a database table where I want to simply update a column's value for all records that have a certain value in another column. I can't seem to figure out how to get batch updates working using postgres queries.
The current table structure is like the following:
placed_orders (
order_id uuid,
source char(1), -- 'A', 'B', or 'C'
submitted char(1), -- 'Y' or 'N'
.
.
.
)
I want to simply update all placed_order rows' source to 'B' where submitted is 'Y', which usually would be as simple as running:
UPDATE placed_orders SET source='B' WHERE submitted='Y' AND source!='B';
The table is pretty big and has over 5 million records, and if I just ran the above query to update all records at once, I run into out of memory errors, so I'm now looking into the possibility of updating in batches.
The query below is what I have so far, I'm new to loops and declaring variables and changing its values:
DO $test$
DECLARE
batch_size int := 1000;
current_offset int := 0;
records_to_update int := 0;
BEGIN
SELECT COUNT(1) INTO records_to_update FROM placed_orders WHERE submitted='Y' AND source !='B';
FOR j IN 0..records_to_update BY batch_size LOOP
UPDATE placed_orders SET source = 'B' WHERE order_id IN (SELECT order_id FROM placed_orders WHERE submitted='Y' AND source!='B' ORDER BY order_id ASC OFFSET current_offset LIMIT batch_size);
current_offset := current_offset + batch_size;
END LOOP;
END $test$;
When running this query, I still get the memory error. Anyone know how to get it to update in true batches?

increment sequence in PostgreSQL stored procedure

How to auto increment sequence number once for every run of a stored procedure and how to use it in the where condition of an update statement?
I already assigned a sequence number to the next value in each run, but I'm not able to use it in the where condition.
CREATE OR REPLACE FUNCTION ops.mon_connect_easy()
RETURNS void
LANGUAGE plpgsql
AS $function$
declare
_inserted_rows bigint = 0;
sql_run bigint = 0;
--assigning the sequence number to the variable
select nextval('ops.mon_connecteasy_seq') into run_seq_num;
-- use for selection iteration_id. this is hwere I'm getting stuck
update t_contract c
set end_date = ce.correct_end_date, status='Active',
orig_end_date =ce.correct_end_date
from ops.t_mon_ConnectEasy ce
where c.contract_id = ce.contract_id
and run_seq_num = ??;
nextval() advances the sequence automatically before returning the resulting value. You don't need anything extra. Just use the function in your query directly:
update t_contract c
set end_date = ce.correct_end_date
, status = 'Active'
, orig_end_date = ce.correct_end_date
from ops.t_mon_ConnectEasy ce
where c.contract_id = ce.contract_id
and iteration_id = nextval('ops.mon_connecteasy_seq');
Be aware that concurrent transactions also might advance the sequence, creating virtual gaps in the sequential numbers.
And I have a nagging suspicion that this might not be the best way to achieve your undisclosed goals.

Is postgresl SERIAL guaranteeing no gaps within single insert statement?

Let's start with:
CREATE TABLE "houses" (
"id" serial NOT NULL PRIMARY KEY,
"name" character varying NOT NULL)
Imagine I try to concurrently (!) insert into the table in a single statement multiple records (maybe 10 maybe 1000).
INSERT INTO houses (name) VALUES
('B6717'),
('HG120');
Is it guaranteed that when a single thread inserts X records in a single statement (when in the same time other threads simultaneously try to insert other records to the same table) that those records will have IDs numbered from A to A+X-1 ? Or is it possible A+100 will be taken by thread 1 and A+99 by thread 2?
Inserting 10000 records at once using two PgAdmin connections seems to be enough to prove that serial type does not guarantee continuity within a batch on my PostgreSQL 9.5
DO
$do$
BEGIN
FOR i IN 1..200 LOOP
EXECUTE format('INSERT INTO houses (name) VALUES %s%s;', repeat('(''a' || i || '''),', 9999), '(''a' || i || ''')');
END LOOP;
END
$do$;
Above results in quite frequent overlap between ids belonging to two different batches
SELECT * FROM houses WHERE id BETWEEN 34370435 AND 34370535 ORDER BY id;
34370435;"b29"
34370436;"b29"
34370437;"b29"
34370438;"a100"
34370439;"b29"
34370440;"b29"
34370441;"a100"
...
I thought this was going to be harder to prove but it turns out it is not guaranteed.
I used a ruby script to have 4 threads insert thousands of records simultaneously and checked whether records created by a single statement had gaps in them and they did.
Thread.new do
100.times do |u|
House.import(1000.times.map do |i|
{
tenant: "#{t}-#{u}",
name: i,
}
end)
end
end
end.each(&:join)
House.distinct.pluck(:tenant).all? do |t|
recs = House.where(
tenant: t,
).order('id').to_a
recs.first.id - recs.first.name.to_i == recs.last.id - recs.last.name.to_i
end
Example of the gaps:
[#<House:0x00007fd2341b5e00
id: 177002,
tenant: "0-43",
name: "0",>,
#<House:0x00007fd2341b5c48
id: 177007,
tenant: "0-43",
name: "1">,
...
As you can see the GAP was 5 between first and second rows inserted within the same single INSERT statement.

Updating column with unique constraint/index in Firebird in multiple rows

I have a table that looks like this:
CREATE TABLE SCHEDULE (
RCDNO INTEGER NOT NULL,
MASTCONNO INTEGER NOT NULL,
ROWNO INTEGER DEFAULT 0 NOT NULL,
EDITTIMESTAMP TIMESTAMP
)
ALTER TABLE SCHEDULE ADD CONSTRAINT PK_SCHEDULE PRIMARY KEY (RCDNO);
CREATE UNIQUE INDEX IDX_SCHEDULE_3 ON SCHEDULE (MASTCONNO, ROWNO);
I can run a query like this:
select mastconno, rowno, rcdno, edittimestamp
from schedule
where mastconno = 12
order by rowno desc
and I get this:
Due to a bug in the app code, the time portion of the edittimestamp is missing but that's of no consequence here. The records are nonetheless listed in desc order of entry by ROWNO. That's what the design of this table is meant to facilitate.
What I tried to do is this...
update SCHEDULE
set ROWNO = (ROWNO + 1)
where MASTCONNO = 12
... in preparation for the insert of a new ROWNO=0 record, I get this error:
attempt to store duplicate value (visible to active transactions) in unique index "IDX_SCHEDULE_3".
Problematic key value is ("MASTCONNO" = 12, "ROWNO" = 3).
Incidently I have an exact copy of the table in MS SQL Server, and I didn't have this problem there. This seems to be specific to the way Firebird works.
So then I tried this, hoping that Firebird would "feed" values from the IN predicate to the UPDATE in a non-offending order.
update SCHEDULE
set ROWNO = (ROWNO + 1)
where MASTCONNO = 12 and
ROWNO in (select ROWNO from SCHEDULE
where MASTCONNO = 12 order by ROWNO DESC)
Sadly the response was the same as before. ROWNO 3 is being duplicated by the update statement.
With the unique index (MASTCONNO, ROWNO) in place, I have tested the following:
update SCHEDULE
set ROWNO = (ROWNO + 1)
where MASTCONNO = 12
order by ROWNO DESC
Works correctly!
update SCHEDULE
set ROWNO = (ROWNO - 1)
where MASTCONNO = 12
order by ROWNO ASC
Also works correctly!
Thanks very much Mark Rotteveel.
Indeed, constraint checks and triggers Firebird runs per every row, not per transaction or per table.
So you would have to use the same loop-based approach you use in applications dealing with lists and arrays.
You have to use EXECUTE BLOCK and a properly directed loop.
http://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25-dml-execblock.html
You take MAXimum ID that is of interest, and increment it.
Then you take previous value.
Then previous...
EXECUTE BLOCK
AS
DECLARE ID INTEGER;
BEGIN
ID = ( SELECT MAX(ROWNO) FROM SCHEDULE WHERE MASTCONNO = 12 );
/*
ID = NULL;
SELECT MAX(ROWNO) FROM SCHEDULE WHERE MASTCONNO = 12 INTO :ID;
IF (ID IS NULL) raise-some-error-somehow
*/
While (ID >= 0)
Begin
update SCHEDULE set ROWNO = (ID + 1)
where MASTCONNO = 12 and :ID = ROWNO;
ID = ID - 1;
End
END
DANGER!
If while you started are doing it some another transaction (maybe from another program working at another computer) inserts some new row with MASTCONNO = 12 and commits it - you have problems.
Firebird is multi-versions server, not table-blocking server, so there will be nothing in server, that prohibits this insert because your procedure is working on the table. Then you have race conditions, the fastest transaction would commit itself, and the slower one would fail with unique index violation.
You may also use FOR-SELECT loop instead of WHILE loop.
https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25-psql-coding.html#fblangref25-psql-forselect
Like this
EXECUTE BLOCK
AS
BEGIN
FOR
SELECT * FROM SCHEDULE
WHERE MASTCONNO = 12
ORDER BY ROWNO DESCENDING -- proper loop direction: ordering is key!
AS CURSOR PTR
DO
UPDATE SCHEDULE SET ROWNO = ROWNO + 1
WHERE CURRENT OF PTR;
END
However, in Firebird 3 cursors-based positioning became rather slow.

Can Entity Framework assign wrong Identity column value in case of high concurency additions

We have an auto-increment Identity column Id as part of my user object. For a campaign we just did for a client we had up to 600 signups per minute. This is code block doing the addition:
using (var ctx = new {{ProjectName}}_Entities())
{
int userId = ctx.Users.Where(u => u.Email.Equals(request.Email)).Select(u => u.Id).SingleOrDefault();
if (userId == 0)
{
var user = new User() { /* Initializing user properties here */ };
ctx.Users.Add(user);
ctx.SaveChanges();
userId = user.Id;
}
...
}
Then we use the userId to insert data into another table. What happened during high load is that there were multiple rows with same userId even though there shouldn't be. It seems like the above code returned the same Identity (int) number for multiple inserts.
I read through a few blog/forum posts saying that there might be an issue with SCOPE_IDENTITY() which Entity Framework uses to return the auto-increment value after insert.
They say a possible workaround would be writing insert procedure for User with INSERT ... OUTPUT INSERTED.Id which I'm familiar with.
Anybody else experienced this issue? Any suggestion on how this should be handled with Entity Framework?
UPDATE 1:
After further analyzing data I'm almost 100% positive this is the problem. Identity column had skipped auto-increment values 48 times total 2727, (2728 missing), 2729,... and exactly 48 duplicates we have in the other table.
It seems like EF returned random Identity value for each row it wasn't able to insert for some reason.
Anybody have any idea what could possible be going on here?
UPDATE 2:
Possibly important info I didn't mention is that this happened on Azure Website with Azure SQL. We had 4 instances running at the time it happened.
UPDATE 3:
Stored Proc:
CREATE PROCEDURE [dbo].[p_ClaimCoupon]
#CampaignId int,
#UserId int,
#Flow tinyint
AS
DECLARE #myCoupons TABLE
(
[Id] BIGINT NOT NULL,
[Code] CHAR(11) NOT NULL,
[ExpiresAt] DATETIME NOT NULL,
[ClaimedBefore] BIT NOT NULL
)
INSERT INTO #myCoupons
SELECT TOP(1) c.Id, c.Code, c.ExpiresAt, 1
FROM Coupons c
WHERE c.CampaignId = #CampaignId AND c.UserId = #UserId
DECLARE #couponCount int = (SELECT COUNT(*) FROM #myCoupons)
IF #couponCount > 0
BEGIN
SELECT *
FROM #myCoupons
END
ELSE
BEGIN
UPDATE TOP(1) Coupons
SET UserId = #UserId, IsClaimed = 1, ClaimedAt = GETUTCDATE(), Flow = #Flow
OUTPUT DELETED.Id, DELETED.Code, DELETED.ExpiresAt, CAST(0 AS BIT) as [ClaimedBefore]
WHERE CampaignId = #CampaignId AND IsClaimed = 0
END
RETURN 0
Called like this from the same EF context:
var coupon = ctx.Database.SqlQuery<CouponViewModel>(
"EXEC p_ClaimCoupon #CampaignId, #UserId, #Flow",
new SqlParameter("CampaignId", {{CampaignId}}),
new SqlParameter("UserId", {{userId}}),
new SqlParameter("Flow", {{Flow}})).FirstOrDefault();
No, that's not possible. For one, that would be an egregious bug in EF. You are not the first one to put 600 inserts/second on it. Also, SCOPE_IDENTITY is explicitly safe and is the recommended practice.
These statements go for the case that you are using a SQL Server IDENTITY column as ID.
I admit I don't know how Azure SQL Database synchronizes the generation of unique, sequential IDs, but intuitively it must be costly, especially at your rates.
If non-sequential IDs are an option, you might want to consider generating UUIDs at the application level. I know this doesn't answer your direct question, but it would improve performance (unverified) and bypass your problem.
Update: Scratch that, Azure SQL Database isn't distributed, it's simply replicated from a single primary node. So, no real performance gain to expect from alternatives to IDENTITY keys, and supposedly the number of instances is not significant to your problem.
I think your problem may be here:
UPDATE TOP(1) Coupons
SET UserId = #UserId, IsClaimed = 1, ClaimedAt = GETUTCDATE(), Flow = #Flow
OUTPUT DELETED.Id, DELETED.Code, DELETED.ExpiresAt, CAST(0 AS BIT) as [ClaimedBefore]
WHERE CampaignId = #CampaignId AND IsClaimed = 0
This will update the UserId of first record it finds in the campaign that hasn't been claimed. It doesn't look robust to me in the event that inserting a user failed. Are you sure that is correct?