PostgreSQL: Optimizing bulk INSERT INTO/SELECT FROM - postgresql

In my Postgresql schema I have a jobs table and an accounts table. Once a day I need to schedule a job for each account by inserting a row per account into the jobs table. This can done using a simple INSERT INTO.. SELECT FROM statement, but is there any empirical way I can know if I am straining my DB by this bulk insert and whether I should chunk the inserts instead?
Postgres often does miraculous work so I have no idea if bulk inserting 500k records at a time is better than 100 x 5k, for example. The bulk insert works today but can take minutes to complete.
One additional data point: the jobs table has a uniqueness constraint on account ID, so this statement includes an ON CONFLICT clause too.

In PostgreSQL, it doesn't matter how many rows you modify in a single transaction, so it is preferable to do everything in a single statement so that you don't end up with half the work done in case of a failure. The only consideration is that transactions should not take too long, but if that happens once a day, it is no problem.

Related

Temp table updates is slower then normal table in postgresql

I have a situation where updates on my temp table is slow. Below is the scenario
Created temp table in session for every session,first time temp table created and then going forward doing insert,update and delete operations this operations until session ends only.
First i'm inserting the rows and based on rows i'm updateing other columns. but this updates is slow compared to norma table. i checked the performance by replacing temp table whereas normal table taking around 50 to 60s but temp table is taking nearly 5 mins.
I tried analyze on temp table, then i got the improved performance. when im using analyze the updates are completed in with 50 seconds.
I tried Types also, but no luck.
Record count in temp table is 480
Can anyone help to imprrove the performance on temp table with out analyze OR any alternative for bulk collect and bulk insert in user defined types
All the above ooperations i'm doing in postgresql.
The lack of information in your question forces me to guess, but if all other things are equal, the difference is probably that you don't have accurate statistics on the temporary table. For normal tables (which are visible to the public), autovacuum takes care of that automatically, but for temporary tables, you have to call ANALYZE explicitly to gather table statistics.

Insert bulk data in postreSQL / TimescaleDB and manage errors

I have a script that select rows from InfluxDB, and bulk insert it into TimescaleDB.
I am inserting data each 2000 rows, to make it faster.
Thing is when I get one error, all 2000 rows is ignored. Is it possible to insert the 1999 rows, and ignore the failing one ?
Since PostgreSQL implements ACID transactions, the entire transaction is rollbacked on an error. The minimal granularity for transaction is one statement, e.g., INSERT INTO statement with batch of values, and this is default. So if failure happens, it is not possible to ignore it and commit the rest.
I assume you use INSERT INTO statement. It provides ON CONFLICT clause, which can be used in the case if the observed error is due to conflict.
Another work around is to move into a temporal table and then insert into hypertable after cleaning.
BTW, have you looked to Outflux tool from Timescale if it can help?

Should I use database trigger in insert heavy operation table?

My use-case is that I need to copy a few columns from TABLE A to another TABLE B and also derive values of a few other columns of TABLE B by some calculation.
As per current estimation around 50,000 rows will be inserted on daily basis in TABLE A.
TABLE B should be updated with all data before End of day.
Hence, either I can use trigger which will be invoked on INSERT operation on TABLE A or schedule some Job at EOD which read all data in bulk from TABLE A , do some calculation and insert in TABLE B.
As I am new to trigger, I am not sure which option should i pick for this use-case. Any suggestion which would be a better approach ?
So far what I have read about triggers, they can slow down DBs performance if they are invoked frequently.
As around 50,000 insert operation will happen daily , so can I assume that 50,000 falls under heavy operations where triggers would not be beneficial ?
EDIT 1 : 50,000 insert operation will reach 100,000 insert operations daily
Postgres DB is used.
If you are doing bulk COPY into an unindexed table, adding a simple trigger will slow you down by a lot (like 5 fold). But if you are using single-row INSERTs or the table is indexed, the marginal slow down of adding a simple trigger will be pretty small.
50,000 inserts per day is not very many. You should have no trouble using a trigger on that, unless the trigger has to scan a large table on each call or something.

What is the difference between inserting duplicate rows and inserting new rows

When I use voltdb (an in-memory database), I found a strange phenomenon: Inserting duplicate raws (500ms) takes more time than inserting new columns (10 ms ),and the table has been partitioned.In my knowledge, inserting a duplicate data should return faster because there is no need to do a real insert action.
It looks like what you are measuring is the response time back to the client, which may not reflect the execution time. It may depend on what the client does with success results vs. error results, or it may just be random network latency, or how many other transactions were queued ahead of your requests when you sent them.
To measure the actual time it takes to insert a new record or to attempt to insert a duplicate record (and for it to fail and return a unique constraint violation), you may need to isolate and repeat enough transactions to get a good measurement from the built-in statistics.
Supposing you have a partitioned table with a primary key. Here is how I would measure this:
Call "exec #Statistics PROCEDUREPROFILE 1;", but you can ignore these initial results.
Insert 50-100 unique records into your table by calling the TABLENAME.insert procedure.
Call "exec #Statistics PROCEDUREPROFILE 1;" and look at the avg_execution_time column for the TABLENAME.insert procedure. This average should be just for the unique inserts you did between calls to #Statistics.
Insert 50-100 duplicate records using the TABLENAME.insert procedure.
Call "exec #Statistics PROCEDUREPROFILE 1;" and look at the avg_execution_time column for the TABLENAME.insert procedure. This average should be just for the duplicate inserts you did between calls to #Statistics.
I haven't tested unique inserts vs. insert constraint violations for speed myself. In general, inserts are fast, regardless of outcome, but I suspect your instinct is correct and the failed duplicate inserts may execute faster since it has to check either way, but if it finds a duplicate there is no need to insert.
Disclaimer: I work for VoltDB.

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.