How to disallow parallel INSERTs in PostgreSQL? - postgresql

I have an ON INSERT trigger in PostgreSQL 9.2, which does some calculation and injects extra data into every row being inserted. The problem is that I don't want any two INSERT transactions to happen in parallel. I want them to go one by one, because my calculations have to be incremental and take into account previous calculation results. Is it possible to achieve this somehow?
I'm trying to create a rolling balance on a list of payments:
id | amount | balance
----------------------
1 | 50 | 50
2 | 130 | 180
3 | -75 | 105
4 | 15 | 120
The balance has to be calculated on every INSERT as a previous balance plus a new payment amount. If INSERTs happen in parallel I have duplicates in balance column, which is logical. I need to find a way how to enforce them to be executed in a strict sequential order.

SERIALIZABLE transaction didn't help, mostly because of the problem explained here. I simply get errors on most of transactions: could not serialize access due to read/write dependencies among transactions.
What did help was an explicit LOCK before every INSERT:
LOCK TABLE receipt IN SHARE ROW EXCLUSIVE MODE;
INSERT INTO receipt ...
All inserts are happening consequently now.

Related

PostgreSQL 13 - Performance Improvement to delete large table data

I am using PostgreSQL 13 and has intermediate level experience with PostgreSQL.
I have a table named tbl_employee. it stores employee details for number of customers.
Below is my table structure, followed by datatype and index access method
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
name | character varying | |
customer_id | bigint | idx_customer_id | btree
is_active | boolean | idx_is_active | btree
is_delete | boolean | idx_is_delete | btree
I want to delete employees for specific customer by customer_id.
In table I have total 18,00,000+ records.
When I execute below query for customer_id 1001 it returns 85,000.
SELECT COUNT(*) FROM tbl_employee WHERE customer_id=1001;
When I perform delete operation using below query for this customer then it takes 2 hours, 45 minutes to delete the records.
DELETE FROM tbl_employee WHERE customer_id=1001
Problem
My concern is that this query should take less than 1 min to delete the records. Is this normal to take such long time or is there any way we can optimise and reduce the execution time?
Below is Explain output of delete query
The values of seq_page_cost = 1 and random_page_cost = 4.
Below are no.of pages occupied by the table "tbl_employee" from pg_class.
Please guide. Thanks
During :
DELETE FROM tbl_employee WHERE customer_id=1001
Is there any other operation accessing this table? If only this SQL accessing this table, I don't think it will take so much time.
In RDBMS systems each SQL statement is also a transaction, unless it's wrapped in BEGIN; and COMMIT; to make multi-statement transactions.
It's possible your multirow DELETE statement is generating a very large transaction that's forcing PostgreSQL to thrash -- to spill its transaction logs from RAM to disk.
You can try repeating this statement until you've deleted all the rows you need to delete:
DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
Doing it this way will keep your transactions smaller, and may avoid the thrashing.
SQL: DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
will not work then.
To make the batch delete smaller, you can try this:
DELETE FROM tbl_employee WHERE ctid IN (SELECT ctid FROM tbl_employee where customer_id=1001 limit 1000)
Until there is nothing to delete.
Here the "ctid" is an internal column of Postgresql Tables. It can locate the rows.

Why does PostgreSQL serializable transaction think this as conflict?

In my understanding PostgreSQL use some kind of monitors to guess if there's a conflict in serializable isolation level. Many examples are about modifying same resource in concurrent transaction, and serializable transaction works great. But I want to test concurrent issue in another way.
I decide to test 2 users modifying their own account balance, and wish PostgreSQL is smart enough to not detect it as conflict, but the result is not what I want.
Below is my table, there're 4 accounts which belongs to 2 users, each user has a checking account and a saving account.
create table accounts (
id serial primary key,
user_id int,
type varchar,
balance numeric
);
insert into accounts (user_id, type, balance) values
(1, 'checking', 1000),
(1, 'saving', 1000),
(2, 'checking', 1000),
(2, 'saving', 1000);
The table data is like this:
id | user_id | type | balance
----+---------+----------+---------
1 | 1 | checking | 1000
2 | 1 | saving | 1000
3 | 2 | checking | 1000
4 | 2 | saving | 1000
Now I run 2 concurrent transaction for 2 users. In each transaction, I reduce the checking account with some money, and check that user's total balance. If it's greater than 1000, then commit, otherwise rollback.
The user 1's example:
begin;
-- Reduce checking account for user 1
update accounts set balance = balance - 200 where user_id = 1 and type = 'checking';
-- Make sure user 1's total balance > 1000, then commit
select sum(balance) from accounts where user_id = 1;
commit;
The user 2 is the same, except the user_id = 2 in where:
begin;
update accounts set balance = balance - 200 where user_id = 2 and type = 'checking';
select sum(balance) from accounts where user_id = 2;
commit;
I first commit user 1's transaction, it success with no doubt. When I commit user 2's transaction, it fails.
ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt.
HINT: The transaction might succeed if retried.
My questions are:
Why PostgreSQL thinks this 2 transactions are conflict? I added user_id condition for all SQL, and doesn't modify user_id, but all these have no effect.
Does that mean serializable transaction doesn't allow concurrent transactions happened on the same table, even if their read/write have no conflict?
Do something per user is very common, Should I avoid use serializable transaction for operations which happen very frequently?
You can fix this problem with the following index:
CREATE INDEX accounts_user_idx ON accounts(user_id);
Since there are so few data in your example table, you will have to tell PostgreSQL to use an index scan:
SET enable_seqscan=off;
Now your example will work!
If that seems like black magic, take a look at the query execution plans of your SELECT and UPDATE statements.
Without the index both will use a sequential scan on the table, thereby reading all rows in the table. So both transactions will end up with a SIReadLock on the whole table.
This triggers the serialization failure.
To my knowledge serializable has the highest level of isolation, therefore lowest level of concurrency. The transactions occur one after the other with zero concurrency.

How to update large amount of rows in PostgreSQL?

I need to update thousands of rows in a table. For example, I have 1000 rows with ids - 1, 2.. 1000:
mytable:
| id | value1 | value2 |
| 1 | Null | Null |
| 2 | Null | Null |
...
| 1000 | Null | Null |
Now I need to change first 10 rows. I can do it like this:
UPDATE mytable SET value1=42, value2=111 WHERE id=1
...
UPDATE mytable SET value1=42, value2=111 WHERE id=10
This requires to many requests and not very fast, so I decide to do this optimization:
UPDATE mytable SET value1=42 WHERE id in (1, 2, 3.. 10)
UPDATE mytable SET value2=111 WHERE id in (1, 2, 3.. 10)
Note: In this case I can actually write SET value1=42, value2=111 but in real world applications this sets of ids is not the same, for one rows I need to set value1, for other - value2, for some subset of rows I need to set both. Because of that I need two queries.
The problem is that I have very large amount of id's. This queries is something about 1Mb!
Q1: Is this a right way to optimize this updates?
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I can't use where statement, I've just have lots of row id's in my program.
Create a TEMPORARY TABLE and populate it with your target ids and new values. Then use UPDATE with FROM clause to join to that target and do it in a single command.
In general, whenever you have large numbers of id/values like this life gets easier if you move them into the database first.
Q1: Is this a right way to optimize this updates?
It should be still possible to write it in one single query using the CASE ... WHEN syntactic construct:
UPDATE mytable SET
value1 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value1
END,
value2 =
CASE
WHEN id IN ( 1, 2, 3, 10) THEN 42
WHEN id IN (11,12,13, 20) THEN 43
ELSE value2
END;
etc.
You mentioned that you may have to update rows in multiple spots, and the above let you do that without problem in one single query.
Update: I overlooked the fact that speed was your main concern (you said "optimize"), and my answer is not correct in that regard. Using a temporary table as explained in the chosen answer leads to much better performances.
Q2: Is it right to send queries that is so large? Can I get faster update by dividing this query into several smaller parts?
I don't think that Postgresql should have much problems handling a large query (even much larger than 1mb). Remember that SQL DB initialization scripts can be way larger that 1mb.

DB2 table partitioning and delete old records based on condition

I have a table with few million records.
___________________________________________________________
| col1 | col2 | col3 | some_indicator | last_updated_date |
-----------------------------------------------------------
| | | | yes | 2009-06-09.12.2345|
-----------------------------------------------------------
| | | | yes | 2009-07-09.11.6145|
-----------------------------------------------------------
| | | | no | 2009-06-09.12.2345|
-----------------------------------------------------------
I have to delete records which are older than month with some_indicator=no.
Again I have to delete records older than year with some_indicator=yes.This job will run everyday.
Can I use db2 partitioning feature for above requirement?.
How can I partition table using last_updated_date column and above two some_indicator values?
one partition should contain records falling under monthly delete criterion whereas other should contain yearly delete criterion records.
Are there any performance issues associated with table partitioning if this table is being frequently read,upserted?
Any other best practices for above requirement will surely help.
I haven't done much with partitioning (I've mostly worked with DB2 on the iSeries), but from what I understand, you don't generally want to be shuffling things between partitions (ie - making the partition '1 month ago'). I'm not even sure if it's even possible. If it was, you'd have to scan some (potentially large) portion of your table every day, just to move it (select, insert, delete, in a transaction).
Besides which, partitioning is a DB Admin problem, and it sounds like you just have a DB User problem - namely, deleting 'old' records. I'd just do this in a couple of statements:
DELETE FROM myTable
WHERE some_indicator = 'no'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 MONTH, TIME('00:00:00'))
and
DELETE FROM myTable
WHERE some_indicator = 'yes'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 YEAR, TIME('00:00:00'))
.... and you can pretty much ignore using a transaction, as you want the rows gone.
(as a side note, using 'yes' and 'no' for indicators is terrible. If you're not on a version that has a logical (boolean) type, store character '0' (false) and '1' (true))

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.