How postgreSQL handle the multiple concurrent requests to foreign tables?
If two data consumers want to access the same foreign table, do they have to wait and execute the query sequentially, or concurrency of queries is supported?
The following answer is mostly for the foreign data wrapper for PostgreSQL, postgres_fdw.
If you need information about other foreign data wrappers, that will vary with the implementation of the foreign data wrapper and the capabilities of the underlying data store. For example, to have concurrent (read) requests with file_fdw, you need a file system that allows two processes to open the file for reading simultaneously.
Concurrency of queries against the same foreign table is just like for local tables. It is the remote server that handles the SQL statements, locks modified rows until the transaction finishes, and similar.
So there can be arbitrarily many concurrent readers, and readers won't block writers and vice versa.
If you run UPDATEs or DELETEs with WHERE conditions than cannot be pushed down to the foreign server (check the execution plan), it can happen that you have more locks than when using a local table.
Imagine a query like this:
UPDATE remote_tab SET col = 0 WHERE <complicated condition that is true for only one row>;
On a local table, this would only lock a single row.
If the condition is too complicated to be pushed down to the foreign server, postgres_fdw will first run a query like this:
SELECT ctid, col FROM remote_tab FOR UPDATE;
That will retrieve and lock all rows of the table.
Then the WHERE condition will be applied locally, and the resulting row is updated on the foreign server:
UPDATE remote_tab SET col = 0 WHERE ctid = ...;
So in this case, concurrency and performance can suffer quite a lot.
Related
I am trying to create multiple records on a database table that requires me to count the number of existing rows in the table before creating them.
There will be concurrent calls of this function and I need to ensure that each request is atomic. I am using transactions with a transaction level of SERIALIZABLE. However when both transactions start at the exact same time before the transaction level is set, then the transactions actually run in parallel
We're trying to stream data to postgres 11, using the following query:
INSERT INTO identifier_to_item
values (:id, :identifier_value, :identifier_type, :identifier_manufacturer, :delivery_timestamp_utc, :item)
ON CONFLICT (identifier_value, manufacturer, type) DO UPDATE
SET item = :item, delivery_timestamp_utc = :delivery_timestamp_utc
WHERE identifier_to_item.delivery_timestamp_utc < :delivery_timestamp_utc
Basically "insert record in the table, if it already exists -> optionally override some fields based on the data already stored in the database".
We would like to hook this query to message queue and run it in high concurrent environment within several instances. It is possible that the same row will be accessed from different connections using this query. For us it's critical that only items with highest delivery timestamp will eventually make it to the table
According to documentation:
ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency.
but is also accessing the fields in UPDATE WHERE part atomic and thread safe? Is this statement using some kind of pessimistic row/table locking?
PostgreSQL is not using threads on the server side.
PostgreSQL does not implement pessimistic/optimistic row level locking : it is the left to the application to decide to implement pessimistic or optimistic locking.
PostgreSQL does not escalate row level locks to table lock.
From the documenation:
ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency.
It does not mention what happens on ON CONFLICT DO NOTHING.
As a test, I did an INSERT ... ON CONFLICT DO NOTHING with 10 threads thousands of times and did not see any errors.
I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.
I'm migrating data from one table to another in an environment where any long locks or downtime is not acceptable, in total about 80000 rows. Essentially the query boils down to this simple case:
INSERT INTO table_2
SELECT * FROM table_1
JOIN table_3 on table_1.id = table_3.id
All 3 tables are being read from and could have an insert at any time. I want to just run the query above, but I'm not sure how the locking works and whether the tables will be totally inaccessible during the operation. My understanding tells me that only the affected rows (newly inserted) will be locked. Table 1 is just being selected, so no harm, and concurrent inserts are safe so table 2 should be freely accessible.
Is this understanding correct, and can I run this query in a production environment without fear? If it's not safe, what is the standard way to accomplish this?
You're fine.
If you're interested in the details, you can read up on multiversion concurrency control, or on the details of the Postgres MVCC implementation, or how its various locking modes interact, but the implications for your case are nicely summarised in the docs:
reading never blocks writing and writing never blocks reading
In short, every record stored in the database has some version number attached to it, and every query knows which versions to consider and which to ignore.
This means that an INSERT can safely write to a table without locking it, as any concurrent queries will simply ignore the new rows until the inserting transaction decides to commit.
I have a PostgreSQL 9.2.2 database that serves orders to my ERP system. The database tables contain boolean columns indicating if a customer is added or not among other records. The code I use extracts the rows from the database and sends them to our ERP system one at a time (single threaded). My code works perfectly in this regard; however over the past year our volume has grown enough to require a multi-threaded solution.
I don't think the MVCC modes will work for me because the added_customer column is only updated once a customer has been successfully added. The default MVCC modes could cause the same row to be worked on at the same time resulting in duplicate web service calls. What I want to avoid is duplicate web service calls to our ERP system as they can be rather heavy, although admittedly I am not an expert on MVCC nor the other modes that PostgreSQL provides.
My question is: How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads?
You will need to record the fact that the rows are being processed somehow. You will also need to deal with concurrent attempts to mark them as being processed and handle failures with sending them to your ERP system.
You may find SELECT ... FOR UPDATE useful to get a set of rows and simultaneously lock them against updates. One approach might be for each thread to select a target row, try to add it's ID to a "processing" table, then remove it in the same transaction you update added_customer.
If a thread fetches no candidate rows, or fails to insert then it just needs to sleep briefly and try again. If anything goes badly wrong then you should have rows left in the "processing" table that you can inspect/correct.
Of course the other option is to just grab a set of candidate rows and spawn a separate process/thread for each that communicates with the ERP. That keeps the database fetching single-threaded while allowing multiple channels to the ERP.
You can add a column user_is_proccesed to the table. It can hold the process id for the back end, that updates the record.
Then use a small serializable transaction to set the user_is_proccesed to "lock row for proccesing".
Something like:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE user_table
SET user_is_proccesed = pg_backend_pid()
WHERE <some condition>
AND user_is_proccesed IS NULL; -- no one is proccesing it now
COMMIT;
The key thing here - with SERIALIZABLE only one transaction can successfully update the record (all other concurrent SERIALIZABLE updates will fail with ERROR: could not serialize access due to concurrent update).