Is it possible to insert and replace rows with pgloader? - postgresql

My use case is the following: I have data coming from a csv file and I need to load it into a table (so far so good, nothing new here). It might happen that same data is sent with updated columns, in which case I would like to try to insert and replace in case of duplicate.
So my table is as follows:
CREATE TABLE codes (
code TEXT NOT NULL,
position_x INT,
position_y INT
PRIMARY KEY (code)
);
And incoming csv file is like this:
TEST01,1,1
TEST02,1,2
TEST0131,3
TEST04,1,4
It might happen that sometime in the future I get another csv file with:
TEST01,1,1000 <<<<< updated value
TEST05,1,5
TEST0631,6
TEST07,1,7
Right now what is happening is when I run for the first file, everything is fine, but when I execute for the second one I'm getting an error:
2017-04-26T10:33:51.306000+01:00 ERROR Database error 23505: duplicate key value violates unique constraint "codes_pkey"
DETAIL: Key (code)=(TEST01) already exists.
I load data using:
pgloader csv.load
And my csv.load file looks like this:
LOAD CSV
FROM 'codes.csv' (code, position_x, position_y)
INTO postgresql://localhost:5432/codes?tablename=codes (code, position_x, position_y)
WITH fields optionally enclosed by '"',
fields terminated by ',';
Is what I'm trying to do possible with pgloader?
I also tried dropping constrains for the primary key but then I end up with duplicate entries in the table.
Thanks a lot for your help.

No, you can't. As per reference
To work around that (load exceptions, eg PK violations), pgloader cuts the data into batches of 25000 rows
each, so that when a problem occurs it's only impacting that many rows
of data.
in brackets - mine...
The best you can do is load csv to table with same structure and then merge data with help of query (EXCEPT, OUTER JOIN ... where null and so on)

Related

Know which index caused an error during a bulk INSERT or bulk UPDATE in PostgreSQL

When I INSERT or UPDATE a list of rows in PostgreSQL and one of them is causing an error, how can I know which one exactly (its index in the input list) ?
For example, if I have a UNIQUE constraint on the name column, and if name two already exists, I want to know that the constraint violation is caused by the input row at index 1.
INSERT INTO table (id, name) VALUES ('0000', 'one'), ('0001', 'two');
I know PostgSQL will stop on the first error encountered, and therefore that we can't know all of the problematic rows. That's fine, I just need the first problematic index (if any).
Inserting each row separately is not a possibility since we want to optimize for performance as well.
Postgres gives you the exactly what you are asking for and actually. It provides the constraint name, the column(s), and the value(s). However, much of this is is subsequent details to the error message. You need to extract the complete message. See demo.

copy csv postgres ignore rows that violate constraints

I have a .csv file with ~300,000 rows, some of which violate certain constraints I set in my postgres database. Is there a way to copy my .csv file into the database and have postgres filter out the rows that violate the constraints? I do not want these rows to show up in the database.
If this is not possible, is there any other way to solve this problem?
what I'm doing right now is
COPY blocksequences from '/tmp/blocksequences.csv CSV HEADER;
And I get
'ERROR: new row for relation "blocksequences" violates check constraint "blocksequences_partid3_check"
DETAIL: Failing row contains (M001-M049-S186, M001, null, M049, S186).
CONTEXT: COPY blocksequences, line 680: "M001-M049-S186,M001,,M049,S186"
reason for the error: column that contains M049 is not allowed to have that string entered. Many other rows have violations like this.
I read a little about exception when check violation --do nothing am I on the right track here? seems like it's only a mysql thing maybe
Usually this is done in this way:
create a temporary table with the same structure as the destination one but without constraints,
copy data to the temporary table with COPY command,
copy rows that do fulfill constraints from temp table to the destination one, using INSERT command with conditions in the WHERE clause based on the table constraint,
drop the temporary table.
When dealing with really large CSV files or very limited server resources, use the extension file_fdw instead of temporary tables. It's much more efficient way but it requires server access to a CSV file (while copying to a temporary table can be done over the network).
In Postgres 12 you can use the WHERE clause in COPY FROM.

Duplicate Key error when using INSERT DEFAULT

I am getting a duplicate key error, DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, when I try to INSERT records. The primary key is one column, INTEGER 4, Generated, and it is the first column.
the insert looks like this: INSERT INTO SCHEMA.TABLE1 values (DEFAULT, ?, ?, ...)
It's my understanding that using the value DEFAULT will just let DB2 auto-generate the key at the time of insert, which is what I want. This works most of the time, but sometimes/randomly I get the duplicate key error. Thoughts?
More specifically, I'm running against DB2 9.7.0.3, using Scriptella to copy a bunch of records from one database to another. Sometimes I can process a bunch with no problems, other times I'll get the error right away, other times after 2 records, or 20 records, or 30 records, etc. Does not seem to be a pattern, nor is it the same record every time. If I change the data to copy 1 record instead of a bunch, sometimes I'll get the error one time, then it's fine the next time.
I thought maybe some other process was inserting records during my batch program, and creating keys at the same time. However, the tables I'm copying TO should not have any other users/processes trying to INSERT records during this same time frame, although there could be READS happening.
Edit: adding create info:
Create table SCHEMA.TABLE1 (
SYSTEM_USER_KEY INTEGER NOT NULL
generated by default as identity (start with 1 increment by 1 cache 20),
COL2...,
)
alter table SCHEMA.TABLE1
add constraint SYSTEM_USER_SYSTEM_USER_KEY_IDX
Primary Key (SYSTEM_USER_KEY);
You most likely have records in your table with IDs that are bigger then the next value in your identity sequence. To find out what the current value your sequence is about at, run the following query.
select s.nextcachefirstvalue-s.cache, s.nextcachefirstvalue-s.increment
from syscat.COLIDENTATTRIBUTES as a inner join syscat.sequences as s on a.seqid=s.seqid
where a.tabschema='SCHEMA'
and a.TABNAME='TABLE1'
and a.COLNAME='SYSTEM_USER_KEY'
So basically what happened is that somehow you got records in your table with ids that are bigger then the current last value of your identity sequence. So sooner or later these ids will collide with identity generated ids.
There are different reasons on how this could have happened. One possibility is that data was loaded which already contained values for the id column or that records were inserted with an actual value for the ID. Another option is that the identity sequence was reset to start at a lower value than the max id in the table.
Whatever the cause, you may also want the fix:
SELECT MAX(<primary_key_column>) FROM onsite.forms;
ALTER TABLE <table> ALTER COLUMN <primary_key_column> RESTART WITH <number from previous query + 1>;

Need a bulk insert tip

I need to insert a table data into another table. Where it is not guaranteed that the source table have all rows correctly where some of the not null fields are having null values. So with this source table I need to enter all valid rows into the table and find all unvalid rows which failed to insert and return them.
I know we can do this by validating all rows before hand. But as this is a bulk insert from a csv and parsed by .net code so from db we wil not validate it but directly enter.
We can also do this by running a loop but performance might hit.
so my question is is any way where we can use a single statement for insert and skip rows which has a problem and insert which are valid.
BULK INSERT is all-or-nothing. SQL Server does not have the ability to shunt erroneous rows into a separate table, alas.
The best thing you can do is to validate all data thoroughly before inserting it. If the insert still fails (maybe due to a bug) you need to retry all rows one-by-one and log the errors that are occurring.
You can also bulk insert to a temp table and move the rows from there to the final table one-by-one.

PostgreSQL bulk insert with ActiveRecord

I've a lot of records that are originally from MySQL. I massaged the data so it will be successfully inserted into PostgreSQL using ActiveRecord. This I can easily do with insertions on row basis i.e one row at a time. This is very slow I want to do bulk insert but this fails if any of the rows contains invalid data. Is there anyway I can achieve bulk insert and only the invalid rows failing instead of the whole bulk?
COPY
When using SQL COPY for bulk insert (or its equivalent \copy in the psql client), failure is not an option. COPY cannot skip illegal lines. You have to match your input format to the table you import to.
If data itself (not decorators) is violating your table definition, there are ways to make this a lot more tolerant though. For instance: create a temporary staging table with all columns of type text. COPY to it, then fix offending rows with SQL commands before converting to the actual data type and inserting into the actual target table.
Consider this related answer:
How to bulk insert only new rows in PostreSQL
Or this more advanced case:
"ERROR: extra data after last expected column" when using PostgreSQL COPY
If NULL values are offending, remove the NOT NULL constraint from your target table temporarily. Fix the rows after COPY, then reinstate the constraint. Or take the route with the staging table, if you cannot afford to soften your rules temporarily.
Sample code:
ALTER TABLE tbl ALTER COLUMN col DROP NOT NULL;
COPY ...
-- repair, like ..
-- UPDATE tbl SET col = 0 WHERE col IS NULL;
ALTER TABLE tbl ALTER COLUMN col SET NOT NULL;
Or you just fix the source table. COPY tells you the number of the offending line. Use an editor of your preference and fix it, then retry. I like to use vim for that.
INSERT
For an INSERT (like commented) the check for NULL values is trivial:
To skip a row with a NULL value:
INSERT INTO (col1, ...
SELECT col1, ...
WHERE col1 IS NOT NULL
To insert sth. else instead of a NULL value (empty string in my example):
INSERT INTO (col1, ...
SELECT COALESCE(col1, ''), ...
A common work-around for this is to import the data into a TEMPORARY or UNLOGGED table with no constraints and, where data in the input is sufficiently bogus, text typed columns.
You can then do INSERT INTO ... SELECT queries against the data to populate the real table with a big query that cleans up the data during import. You can use a lot of CASE statements for this. The idea is to transform the data in one pass.
You might be able to do many of the fixes in Ruby as you read the data in, then push the data to PostgreSQL using COPY ... FROM STDIN. This is possible with Ruby's Pg gem, see eg https://bitbucket.org/ged/ruby-pg/src/tip/sample/copyfrom.rb .
For more complicated cases, look at Pentaho Kettle or Talend Studio ETL tools.