copy csv postgres ignore rows that violate constraints - postgresql

I have a .csv file with ~300,000 rows, some of which violate certain constraints I set in my postgres database. Is there a way to copy my .csv file into the database and have postgres filter out the rows that violate the constraints? I do not want these rows to show up in the database.
If this is not possible, is there any other way to solve this problem?
what I'm doing right now is
COPY blocksequences from '/tmp/blocksequences.csv CSV HEADER;
And I get
'ERROR: new row for relation "blocksequences" violates check constraint "blocksequences_partid3_check"
DETAIL: Failing row contains (M001-M049-S186, M001, null, M049, S186).
CONTEXT: COPY blocksequences, line 680: "M001-M049-S186,M001,,M049,S186"
reason for the error: column that contains M049 is not allowed to have that string entered. Many other rows have violations like this.
I read a little about exception when check violation --do nothing am I on the right track here? seems like it's only a mysql thing maybe

Usually this is done in this way:
create a temporary table with the same structure as the destination one but without constraints,
copy data to the temporary table with COPY command,
copy rows that do fulfill constraints from temp table to the destination one, using INSERT command with conditions in the WHERE clause based on the table constraint,
drop the temporary table.
When dealing with really large CSV files or very limited server resources, use the extension file_fdw instead of temporary tables. It's much more efficient way but it requires server access to a CSV file (while copying to a temporary table can be done over the network).
In Postgres 12 you can use the WHERE clause in COPY FROM.

Related

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

Ignore duplicates when importing from CSV

I'm using PostgreSQL database, after I've created my table I have to populate them with a CSV file. However the CSV file is corrupted and it violates the primary key rule and so the database is throwing an error and I'm unable to populate the table. Any ideas how to tell the database to ignore the duplicates when importing from CSV? Writing a script to remove them from the CSV file is no acceptable. Any workarounds are welcome too. Thank you! : )
On postgreSQL, duplicate rows are not permitted if they violate a unique constraint.
I think that your best option, is to import your CSV file on to temp table that has no constraint, delete from it duplicate values, and finally import from this temp table to your final table.

Postgres how can I work with csv file without copy it to the DB

What I do now is use COPY for the csv file I want to work with, then when I finish, I delete the table.
COPY mytable FROM 'D:/test.csv' WITH CSV HEADER DELIMITER AS ','
do my work
drop table mytable;
is there any other preferred/professional way
Depending on what the work is, the file_fdw extension may be suitable. It lets you access a CSV file as if it was a table.
There are some major downsides to this, though: It's slow, and you can't create indexes on it. So it's often much better to just COPY into an UNLOGGED table, do the work, and drop the table.

How to UPDATE table from csv file?

How to update table from csv file in PostgreSQL? (version 9.2.4)
Copy command is for insert. But I need to update table. How can I update table from csv file without temp table?
I don't want to copy to temp table from csv file and update table from temp table.
And no merge command like Oracle?
The simple and fast way is with a temporary staging table, like detailed in this closely related answer:
How to update selected rows with values from a CSV file in Postgres?
If you don't "want" that for some unknown reason, there are more ways:
A foreign data wrapper with file_fdw.
You can run UPDATE commands directly using this one.
pg_read_file(). For special use cases.
Details in this related answer:
Read data from a text file inside a trigger
There is no MERGE command in Postgres, even less for COPY.
Discussion about whether and how to add it is ongoing. Check out the Postgres Wiki for details.

PostgreSQL Creating an Insert Trigger which Remaps Columns

I'm wondering if I can use a trigger on a table to "ignore" columns that are in a COPY statement from STDIN but which are not in the target table. Sorry if the wording/syntax of the question is off, but here is and explanation of what I'm trying to say. I'm new to triggers so any advice is helpful.
I'm using the PostGIS Shapefile importer to copy shapefiles to the spatial tables in my PostgreSQL database.
This creates a COPY statement which contains all the fields in the shapefile something like:
COPY "public"."stations" ("column1","column2","column3","column4", geom) FROM stdin;
column1 and column2 are in the file but not in the target table, so the COPY fails.
Is there a way to create a trigger to create something that would have the same result as:
COPY "public"."stations" ("column3","column4", geom) FROM stdin;
No, you cannot skip columns that are present in the input file. This will error out, before triggers are even invoked. And you cannot use rules either. I quote the manual:
COPY FROM will invoke any triggers and check constraints on the
destination table. However, it will not invoke rules.
You can either edit the file or use a temporary staging table:
COPY to a temporary table with matching columns.
Use INSERT to write the desired columns to the final target table(s) - or the whole range of SQL DDL commands for more sophisticated matters.