Insert bulk data in postreSQL / TimescaleDB and manage errors

Insert bulk data in postreSQL / TimescaleDB and manage errors - postgresql

I have a script that select rows from InfluxDB, and bulk insert it into TimescaleDB.
I am inserting data each 2000 rows, to make it faster.
Thing is when I get one error, all 2000 rows is ignored. Is it possible to insert the 1999 rows, and ignore the failing one ?

Since PostgreSQL implements ACID transactions, the entire transaction is rollbacked on an error. The minimal granularity for transaction is one statement, e.g., INSERT INTO statement with batch of values, and this is default. So if failure happens, it is not possible to ignore it and commit the rest.
I assume you use INSERT INTO statement. It provides ON CONFLICT clause, which can be used in the case if the observed error is due to conflict.
Another work around is to move into a temporal table and then insert into hypertable after cleaning.
BTW, have you looked to Outflux tool from Timescale if it can help?

Related

PostgreSQL: Optimizing bulk INSERT INTO/SELECT FROM

In my Postgresql schema I have a jobs table and an accounts table. Once a day I need to schedule a job for each account by inserting a row per account into the jobs table. This can done using a simple INSERT INTO.. SELECT FROM statement, but is there any empirical way I can know if I am straining my DB by this bulk insert and whether I should chunk the inserts instead?
Postgres often does miraculous work so I have no idea if bulk inserting 500k records at a time is better than 100 x 5k, for example. The bulk insert works today but can take minutes to complete.
One additional data point: the jobs table has a uniqueness constraint on account ID, so this statement includes an ON CONFLICT clause too.

In PostgreSQL, it doesn't matter how many rows you modify in a single transaction, so it is preferable to do everything in a single statement so that you don't end up with half the work done in case of a failure. The only consideration is that transactions should not take too long, but if that happens once a day, it is no problem.

PostgreSQL select query on table that is being updated

I assume this question has been asked before, but unfortunately I cannot find the answer to my question.
I have a table, and I am using an update statement to update a column. Simultaneously I am running a create table query with a select statement that is retrieving data from the table and column that is also being updated.
My questions are: can this lead to wrong results in the output of the create table statement? does the update query finish 1st then the create table with the select execute? I just know that the create table statement is taking way longer to execute.

In PostgreSQL readers never lock writers and vice versa. This is guaranteed by PostgreSQL's MVCC implementation that keeps old row versions around.
If the updating transaction isn't finished yet, the reading transaction will see the old value, and the result is consistent.
There is nothing inside PostgreSQL that should slow down the SELECT statement noticeably, but of course I/O contention is a possible explanation.

Postgres Insert as select from table ignore any errors

I am trying to bulk load records from a temp table to table using insert as select stmt and on conflict strategy do update.
I want load as many records as possible, currently if there any any foreign key violations no records get inserted, everything gets rolled back. Is there a way to insert valid records and skip the faulty records.
In https://dba.stackexchange.com/a/46477 I saw a strategy of going with the foreign table in the query to ignore the faulty rows. I don't want to do that too as I may have many foreign keys on that table and it will make my query more complex and table specific. I would like it to be generic.
Sample use case, if have 100 rows in the temp table and suppose row number 5 and 7 are causing insertion failure, I want to insert the rest 98 records and identify which two rows failed.
I want to avoid inserting record by record and catch the error, as it is not efficient. I am doing this whole exercise to avoid loading a table row by row.
Oracle provides support to catch bulk errors at a shot.
Sample https://stackoverflow.com/a/36430893/8575780
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1422998100346727312
I have already explored options loading using copy, it catches not null constraints and other data type errors, but when foreign key violation happens nothing gets committed.
I am looking something closer to what pgloader is doing when it faces error.
https://pgloader.readthedocs.io/en/latest/pgloader.html#batches-and-retry-behaviour

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The files contain data like these "article_name,article_time,start_time,end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new article_time falls in an existing range [start_time,end_time] for the same article.
ie: don't insert row y if exists [start_time_x,end_time_x] for which time_article_y inside range [start_time_x,end_time_x] , with article_name_y = article_name_x
I tried with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes), but still not as fast as it needs to be for the amount of data I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
Lots of thanks folks

Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Need a bulk insert tip

I need to insert a table data into another table. Where it is not guaranteed that the source table have all rows correctly where some of the not null fields are having null values. So with this source table I need to enter all valid rows into the table and find all unvalid rows which failed to insert and return them.
I know we can do this by validating all rows before hand. But as this is a bulk insert from a csv and parsed by .net code so from db we wil not validate it but directly enter.
We can also do this by running a loop but performance might hit.
so my question is is any way where we can use a single statement for insert and skip rows which has a problem and insert which are valid.

BULK INSERT is all-or-nothing. SQL Server does not have the ability to shunt erroneous rows into a separate table, alas.
The best thing you can do is to validate all data thoroughly before inserting it. If the insert still fails (maybe due to a bug) you need to retry all rows one-by-one and log the errors that are occurring.
You can also bulk insert to a temp table and move the rows from there to the final table one-by-one.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse