How does postgresql lock tables when inserting and selecting? - postgresql

I'm migrating data from one table to another in an environment where any long locks or downtime is not acceptable, in total about 80000 rows. Essentially the query boils down to this simple case:
INSERT INTO table_2
SELECT * FROM table_1
JOIN table_3 on table_1.id = table_3.id
All 3 tables are being read from and could have an insert at any time. I want to just run the query above, but I'm not sure how the locking works and whether the tables will be totally inaccessible during the operation. My understanding tells me that only the affected rows (newly inserted) will be locked. Table 1 is just being selected, so no harm, and concurrent inserts are safe so table 2 should be freely accessible.
Is this understanding correct, and can I run this query in a production environment without fear? If it's not safe, what is the standard way to accomplish this?

You're fine.
If you're interested in the details, you can read up on multiversion concurrency control, or on the details of the Postgres MVCC implementation, or how its various locking modes interact, but the implications for your case are nicely summarised in the docs:
reading never blocks writing and writing never blocks reading
In short, every record stored in the database has some version number attached to it, and every query knows which versions to consider and which to ignore.
This means that an INSERT can safely write to a table without locking it, as any concurrent queries will simply ignore the new rows until the inserting transaction decides to commit.

Related

How to perform a Bulk Insert in Sybase SQL

I need to insert a Big amount of data(Some Millions) and I need to perform it Quickly.
I read about Bulk insert via ODBC on .NET and JAVA But I need to perform it directly on the Database.
I also read about Batch Insert but What I have tried have not seemed to work
Batch Insert, Example
I'm executing a INSERT SELECT but it's taking something like 0,360s per row, this is very slow and I need to perform some improvements here.
I would really appreciate some guidance here with examples and documentation if possible.
DATABASE: SYBASE ASE 15.7
Expanding on some of the comments ...
blocking, slow disk IO, and any other 'wait' events (ie, anything other than actual insert/update activity) can be ascertained from the master..monProcessWaits table (where SPID = spid_of_your_insert_update_process) [see the P&T manual for Monitoring Tables (aka MDA tables)]
master..monProcessObject and master..monProcessStatement will show logical/physical IOs for currently running queries [again, see P&T manual for MDA tables]
master..monSysStatement will show logical/physical IOs for recently completed queries [again, see P&T manual for MDA tables]
for UPDATE statements you'll want to take a look at the query plan to see if you're suffering from a poor join order; also of key importance ... direct (fast/good) updates vs deferred (slow/bad) updates; deferred updates can occur for many reasons ... some fixable, some not ... updating indexed columns, poor join order, updates that cause page splits and/or row forwardings
RI (PK/FK) constraints can be viewed with sp_helpconstraint table_name; query plans will also show the under-the-covers joins required when performing RI (PK/FK) validations during inserts/updates/deletes
triggers are a bit harder to locate (an official sp_helptrigger doesn't show up until ASE 16); check the sysobjects.[ins|upd|del]trig where name = your_table - these represent the object id(s) of any insert/update/delete triggers on the table; also check sysobjects records where type = 'TR' and deltrig = object_id(your_table) - provides support for additional insert/update/delete triggers (don't recall at moment if this is just ASE 16+)
if triggers are being fired, need to review the associated query plans to make sure the inserted and deleted tables (if referenced) are driving any queries where these pseudo tables are joined with permanent tables
There are likely some areas I'm forgetting (off the top of my head) ... key take away is that there could be many reasons for 'slow' DML statements.
One (relatively) quick way to find out if RI (PK/FK) constraints or triggers are at play ...
set showplan on
go
insert/update/delete statements
go
Then review the resulting query plan(s); if you see references to any tables other than the ones explicitly listed in the insert/update/delete statements then you're likely dealing with RI constraints and/or triggers.

How do transactions work in the context of reads to the database?

I am using transactions to make changes to a SQL database. As I understand it, this means that changes to the database will happen in an all-or-nothing fashion. What I want to know is, does this have any guarantees for reads? For example, suppose I have some (pseudo)-code like this:
1) start TRANSACTION
2) INSERT INTO users ... // insert some data
3) count = SELECT COUNT(*) FROM ... // count something in the database
4) if count > 10: // do something based on the read
5) INSERT INTO other_table ... // write based on the read
6) COMMMIT TRANSACTION
In this code, I'm doing an INSERT, followed by a SELECT, and then conditionally doing another INSERT based on the outcome of the SELECT.
So my question is, if another process modifies the database between steps (3) and (5), what happens to the count variable, and to my transaction?
If it makes a difference, I am using PostgreSQL.
As Xin pointed out, it depends on the isolation level.
At the default READ COMMITTED level, records from other sessions will become visible as they are committed; you would see the same records if you didn't start a transaction at all (though of course, other processes would see your inserts appear at different times).
With REPEATABLE READ, your queries will not see any records committed by other sessions after your transaction starts. But while you don't have to worry about the result of SELECT COUNT(*) changing during your transaction, you can't assume that this result will still be accurate by the time you commit.
Using SERIALIZABLE provides the strongest guarantee: if your script does the right thing when given exclusive access to the database, then it will do the right thing in the presence of other serialisable transactions (or it will fail outright). However, this means that all transactions which might interfere with yours must be using the same isolation level (which comes at a cost), and all must be prepared to retry their transaction in the event of a serialisation failure.
When serialisable transactions are not an option, you generally guard against race conditions by explicitly locking things against concurrent writes. It's often enough to lock a selection of records, but you can't exactly lock the result of a COUNT(*); in your case, you'd probably need to lock the whole table.
I am not working on postgreSQL, but I think I can answer your question. Think of every query is parallel. I am saying so, because there are 2 transactions: when you insert into a; others can insert into b; then when you check b; whether you can see the new data depends on your isolation setting (read committed or just dirty read).
Also please note that, in database, there is a technology called lock: you can lock a table so that prevent altering it from others before committing your transaction.
Hope

SQL Transactions - allow read original data before commit (snapshot?)

I am facing an issue, possibly quite easy to solve, I am just new to advanced transaction settings.
Every 30 minutes I am running an INSERT query that is getting latest data from a linked server to my client's server, to a table we can call ImportTable. For this I have a simple job that looks like this:
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer
COMMIT
The thing is, each time the job runs the ImportTable is locked for the query run time (2-5 minutes) and nobody can read the records. I wish the table to be read-accessible all the time, with as little downtime as possible.
Now, I read that it is possible to allow SNAPSHOT ISOLATION in the database settings that could probably solve my problem (set to FALSE at the moment), but I have never played with different transaction isolation types and as this is not my DB but my client's, I'd rather not alter any database settings if I am not sure if it can break something.
I know I could have an intermediary table that the records are inserted to and then inserted to the final table and that is certainly a possible solution, I was just hoping for something more sophisticated and learning something new in the process.
PS: My client's server & database is fairly new and barely used, so I expect very little impact if I change some settings, but still, I cannot just randomly change various settings for learning purposes.
Many thanks!
Inserts wont normally block the table ,unless it is escalated to table level.In this case,you are deleting table first and inserting data again,why not insert only updated data?.for the query you are using transaction level (rsci)snapshot isolation will help you,but you will have an added impact of row version which means sql will store row versions of rows that changed in tempdb.
please see MCM isolation videos of Kimberely tripp for indepth understanding ,also dont forget to test in stage enviornment.
You are making this harder than it needs to be
The problem is the 2-5 minutes that you let be part of a transaction
It is only a few thousand rows - that part takes like a few milliseconds
If you need ImportTable to be available during those few milliseconds then put it in a SnapShot
Delete ImportTableStaging;
INSERT INTO ImportTableStaging(columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer;
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns) with (tablock)
SELECT (columns)
FROM ImportTableStaging
COMMIT
If you are worried about concurrent update to ImportTableStaging then use a #temp

PostgreSQL: OK to allow errors?

Before I try to insert a row into a PostgreSQL table, should I query whether the insert would violate a constraint?
I do check when the insert would cause unwanted side-effects (e.g., auto-increment) upon an error.
But, if there are no possible side effects, is it OK to just blindly try to insert into a table? Or, is it better practice to prevent errors by anticipating them when possible (as advised in Objective-C)?
Also, when performing the insert inside an SQL function, will other queries (e.g., CTEs) inside the function get rolled back if the insert fails?
In general testing before hand is not a good idea because it requires you to explicitly lock tables to prevent other clients from changing or inserting data between your test and inserts. Explicit locking is bad for concurrency.
Serials getting auto incremented by failed inserts is in general not a problem. Just don't assume the values inserted into the database are consecutive.
A database and obj-c are two completely different things. Let the database check for problems, it is much easier to add the appropriate constraints to your schema then it is to check everything in your client program.
The default is to rollback to the start of the transaction. But you can control it with savepoints and rollback to savepoint. However a CTE is part of the query and queries are always rolled back completely when part of them fails. However you might be able to work around that by splitting the CTE of into a full query that creates a temp table. Then you can use the temp table instead of the cte.

How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads

I have a PostgreSQL 9.2.2 database that serves orders to my ERP system. The database tables contain boolean columns indicating if a customer is added or not among other records. The code I use extracts the rows from the database and sends them to our ERP system one at a time (single threaded). My code works perfectly in this regard; however over the past year our volume has grown enough to require a multi-threaded solution.
I don't think the MVCC modes will work for me because the added_customer column is only updated once a customer has been successfully added. The default MVCC modes could cause the same row to be worked on at the same time resulting in duplicate web service calls. What I want to avoid is duplicate web service calls to our ERP system as they can be rather heavy, although admittedly I am not an expert on MVCC nor the other modes that PostgreSQL provides.
My question is: How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads?
You will need to record the fact that the rows are being processed somehow. You will also need to deal with concurrent attempts to mark them as being processed and handle failures with sending them to your ERP system.
You may find SELECT ... FOR UPDATE useful to get a set of rows and simultaneously lock them against updates. One approach might be for each thread to select a target row, try to add it's ID to a "processing" table, then remove it in the same transaction you update added_customer.
If a thread fetches no candidate rows, or fails to insert then it just needs to sleep briefly and try again. If anything goes badly wrong then you should have rows left in the "processing" table that you can inspect/correct.
Of course the other option is to just grab a set of candidate rows and spawn a separate process/thread for each that communicates with the ERP. That keeps the database fetching single-threaded while allowing multiple channels to the ERP.
You can add a column user_is_proccesed to the table. It can hold the process id for the back end, that updates the record.
Then use a small serializable transaction to set the user_is_proccesed to "lock row for proccesing".
Something like:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE user_table
SET user_is_proccesed = pg_backend_pid()
WHERE <some condition>
AND user_is_proccesed IS NULL; -- no one is proccesing it now
COMMIT;
The key thing here - with SERIALIZABLE only one transaction can successfully update the record (all other concurrent SERIALIZABLE updates will fail with ERROR: could not serialize access due to concurrent update).