Difference between INSERT and COPY - postgresql

As per the documentation,
Loading large number of rows using COPY is always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.
Why COPY is faster than INSERT (multiple insertion are batched into single transaction) ?

Quite a number of reasons, actually, but the main ones are:
Typically, client applications wait for confirmation of one INSERT's success before sending the next. So there's a round-trip delay for each INSERT, scheduling delays, etc. (PgJDBC supports pipelineing INSERTs in batches, but I'm not aware of any other clients that do).
Each INSERT has to go through the whole executor. Use of a prepared statement bypasses the need to run the parser, rewriter and planner, but there's still executor state to set up and tear down for each row. COPY does some setup once, and has an extremely low overhead for each row, especially where no triggers are involved.
The first point is the most significant. It's all about network round-trips and rescheduling delays.

This is because COPY is a single statement, while each INSERT is a separate statement. Since each single statement is normally subject to logging (manual), even inside a unique transaction, the use of many INSERT is slower than the use of a single COPY.

Related

Can reads occur whilst executing a bulk \copy batch of inserts

I plan to be batch inserting a large volume of rows into a Postgres table using the \copy command once per minute. My benchmarks show I should be able to insert about 40k rows per second, and I plan to do this for 3 or 4 seconds each minute.
Are read queries on the table blocked or impacted whilst the \copy dump is occurring? And I wonder the same for inserts as well?
I'm assuming as well that tables which aren't being \copy'd into will face no blocking issues.
The manual:
The main advantage of using the MVCC model of concurrency control
rather than locking is that in MVCC locks acquired for querying
(reading) data do not conflict with locks acquired for writing data,
and so reading never blocks writing and writing never blocks reading.
That's the beauty of the MVCC model used by Postgres.
So, no, readers are not blocked. Neither in the target table, nor in any other table.
Impacted? Well, bulk loading large amounts of data incurs considerable load on the system (especially I/O) which potentially impacts all other processes competing for the same resources. So if your system is already reaching some limits, readers may be impacted this way.
Rows written by your COPY command (by way of psql's \copy) are not visible to other transactions until the transaction is committed.
Concurrent INSERT commands are not blocked either - unless you have UNIQUE (or PK) constraints / indexes where writes do compete. Avoid race conditions with overlapping unique values! And performance can be impacted even with non-unique indexes as writing to indexes involves some short-term locking.
Generally, keep indexes on your table to a minimum if you plan huge bulk writes every minute. Every index incurs additional costs for the write - and may bloat more than the table if write patterns are unfavorable. Autovacuum may have a hard time to keep up.

PostgreSQL trigger an event on table update

I'm new with PostgreSQL and I would like to know or have some leads on:
Emit event (call an API) when a table is updated
My problem is: I have a SSO that insert row in an event table when user do something (login, register, update info). I need to exploit these inserts in another solution (a loyalty program) on real time.
For now I have in mind to query the table every minute (in nodeJS) and compare the size of table with the size of the previous minute. I think that is not the good way :)
You can do that with a trigger in principle. If the API is external to the database, you'd need a trigger function written in C or a language like PL/Perl or PL/Python that can perform the action you need.
However, unless this action can be guaranteed to be fast, it may not be a good idea to run it in a trigger. The trigger runs in the same transaction as the triggering statement, so if your trigger happens to run for a long time, you end up with a long database transaction. This has two main disadvantages:
Locks are held for a long time, which harms concurrency and hence performance, and also increases the risk of deadlocks.
Autovacuum cannot remove dead rows that were still active when the transaction started, which can lead to excessive table bloat on busy tables.
To avoid that risk, it is often better to use a queuing system: The trigger creates an entry in the queue, which is a fast action, and worker processes read and process these queue entries asynchronously outside the database.
Implementing a queue in a database is notoriously difficult, so you may want to look for existing solutions.

Does trigger(after insert) on table slow down inserting into this table

I have a big table(bo_sip_cti_event) which is too largest to even run queries on this so I made the same table (bo_sip_cti_event_day), added trigger after insert on bo_sip_cti_event to add all the same values to bo_sip_cti_event_day and now I am thinking if I significantly slowed down inserts into bo_sip_cti_event.
So generally, does trigger after insert slow down operations on this table?
Yes, the trigger must slow down inserts.
The reason is that relational databases are ACID compliant: All actions, including side-effects like triggers, must be completed before the update transaction completes. So triggers must be executed synchronously, and that consumes CPU, and in your case I/O too, which ultimately takes more time. There's no getting around it.
The answer is yes: it is additional overhead, so obviously it takes time to finish the transaction with the additional trigger execution.
Your design makes me wonder if:
You explored all options to speed up your large table. Even billions of rows can be handled quite fine, if you have proper index ect. But it all depends on the table, the design, the data and the queries.
What exactly your trigger is doing. The table name "_day" raises questions when and where and how exactly this table is cleaned out at midnight. Hopefully not inside the trigger function, and hopefully not with a "DELETE FROM".

postgresql concurrent queries as stored procedures

I have 2 stored procedures that interact with the same datatables.
first executes for several hours and second one is instant.
So if I run first one, and after that second one (second connection) then the second procedure will wait for the first one to end.
It is harmless for my data if both can run at the same time, how to do that?
The fact that the shorter query is blocked while being on a second connection suggests that the longer query is getting an exclusive lock on the table during the query.
That suggests it is doing writes, as if they were both reads there shouldn't be any locking issues. PgAdmin can show what locks are active during the longer query and also if the shorter query is indeed blocked on the longer one.
If the longer query is indeed doing writes, it's possible that you may be able to reduce the lock contention -- by chunking it, for example, which could allow readers in between chunked updates/inserts -- but if it's an operation that requires an exclusive write lock, then it will block everybody until it's done.
It's also possible that you may be able to optimize the query such that it needs to be a lower-level lock that isn't exclusive, but that would all depend on the specifics of what the query is doing and your data.

How to get high performance under a large transaction (postgresql)

I have data with amount of 2 millions needed to insert into postgresql. But it has played an low performance. Can I achieve a high-performance inserter by split the large transaction into smaller ones (Actually, I don't want to do this)? or, there is any other wise solutions?
No, the main idea to have it much faster is doing all inserts in one transaction. Multiple transactions, or using no transaction, is much slower.
And try to use copy, which is even faster: http://www.postgresql.org/docs/9.1/static/sql-copy.html
If you really have to use inserts, you can also try dropping all indexes on this table, and creating them after loading the data.
This can be interesting as well: http://www.postgresql.org/docs/9.1/static/populate.html
Possible methods to improve performance:
Use the COPY command.
Try to decrease the isolation level for the transaction if your data can deal with the consequences.
Tweak the PostgreSQL server configuration. The default memory limits are very low and will cause disk trashing even with a server having gigabytes of free memory.
Turn off disk barriers (e.g. nobarrier flag for the ext4 file system) and/or fsync on the PostgreSQL server. Warning: this is usually unsafe but will improve your performance a lot.
Drop all the indexes in your table before inserting the data. Some indexes require pretty much work to keep up to date while rows are added. PostgreSQL may be able to create indexes faster in the end instead of continuously updating the indexes in paraller with the insertion process. Unfortunately, there's no simple way to "save" current indexes and later restore/create the same indexes again.
Splitting the insert job into series of smaller transaction will help only if you have to retry the transaction because of data dependency issues with paraller transactions. If the transaction succeeds on the first try, splitting it into several smaller transactions run in sequence will only decrease your performance.
In my experience you CAN improve INSERT time-to-completion by splitting a large transaction into smaller ones, but only if the table you are inserting to has NO indexes or constraints applied, and NO default field values that would have to contend for a shared resource under multiple concurrent transactions. In that case, splitting the insert into several distinct parts and submitting each concurrently as separate processes will complete the job in significantly less time.