Best way to prevent duplicate data on copy csv postgresql

Best way to prevent duplicate data on copy csv postgresql - postgresql

This is more of a conceptual question because I'm planning how best to achieve our goals here.
I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?
Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.
Thanks for any help!

Upsert
The Answer by Linoff is correct but can simplified a bit by Postgres 9.5 new ”UPSERT“ feature (a.k.a. MERGE). That new feature is implemented in Postgres as INSERT ON CONFLICT syntax.
Rather than explicitly check for violation of the unique index, we can let the ON CONFLICT clause detect the violation. Then we DO NOTHING, meaning we abandon the effort to INSERT without bothering to attempt an UPDATE. So if we cannot insert, we just move on to next row.
We get the same results as Linoff’s code but lose the WHERE clause.
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT idx_bigtable_col1_col2_col
DO NOTHING
;

I think I would take the following approach.
First, create an index on the three columns that you care about:
create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);
Then, load the data into a staging table using copy. Finally, you can do:
insert into bigtable(col1, . . . )
select col1, . . .
from stagingtable st
where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);
Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.
An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.

The method posted by Basil Bourque was great, but there was a slight syntax error.
Based on the documentation, I modified it to the following, which works:
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT (col1)
DO NOTHING
;

Related

Troubleshooting an insert statement, fails without error

I am trying to do what should be a pretty straightforward insert statement in a postgres database. It is not working, but it's also not erroring out, so I don't know how to troubleshoot.
This is the statement:
INSERT INTO my_table (col1, col2) select col1,col2 FROM my_table_temp;
There are around 200m entries in the temp table, and 50m entries in my_table. The temp table has no index or constraints, but both columns in my_table have btree indexes, and col1 has a foreign key constraint.
I ran the first query for about 20 days. Last time I tried a similar insert of around 50m, it took 3 days, so I expected it to take a while, but not a month. Moreover, my_table isn't getting longer. Queried 1 day apart, the following produces the same exact number.
select count(*) from my_table;
So it isn't inserting at all. But it also didn't error out. And looking at system resource usage, it doesn't seem to be doing much of anything at all, the process isn't drawing resources.
Looking at other running queries, nothing else that I have permissions to view is touching either table, and I'm the only one who uses them.
I'm not sure how to troubleshoot since there's no error. It's just not doing anything. Any thoughts about things that might be going wrong, or things to check, would be very helpful.

For the sake of anyone stumbling onto this question in the future:
After a lengthy discussion (see linked discussion from the comments above), the issue turned out to be related to psycopg2 buffering the query in memory.
Another useful note: inserting into a table with indices is slow, so it can help to remove them before bulk loads, and then add them again after.

in my case it was date format issue. i commented date attribute before interting to DB and it worked.

In my case it was a TRIGGER on the same table I was updating and it failed without errors.
Deactivated the trigger and the update worked flawlessly.

How can I remove extra characters from a column?

I have a table with Customer/Phone/City/State/Zip/etc..
Occasionally, I'll be importing the info from a .csv file, and sometimes the zipcode is formatted like this: xxxxx-xxxx and I only need it to be a general, 5 digit zip code.
How can I delete the last 5 characters without having to do it from Excel, cell by cell (which is what I'm doing now)?
Thanks
EDIT: This is what I used after Craig's suggestion and it worked. However, some of the zip entries are canadian zipcodes and often time they are formated x1x-x2x. Running this deletes the last character in the field.
How could I remedy this?

You'll need to do one of these 3 ideas:
use an ETL tool to filter the data during insert;
COPY into a TEMPORARY or UNLOGGED table then do an INSERT INTO real_table SELECT ... that transforms the data with a suitable substring(...) call; or
Write a simple Perl/Python/whatever script that reads the csv, transforms it as desired, and inserts the results into PostgreSQL. I'd use Python with the csv module and psycopg2's copy_from.
Such an insert into ... select might look like:
INSERT INTO real_table(col1, col2, zip)
SELECT
col1,
col2,
substring(zip from 1 for 5)
FROM temp_table;

Insert and update records in one TSQL statement?

I have a table BigTable and a table LittleTable. I want to move a copy of some records from BigTable into LittleTable and then (for these records) set BigTable.ExportedFlag to T (indicating that a copy of the record has been moved to little table).
Is there any way to do this in one statement?
I know I can do a transaction to:
moves the records from big table based on a where clause
updates big table setting exported to T based on this same where clause.
I've also looked into a MERGE statement, which does not seem quite right, because I don't want to change values in little table, just move records to little table.
I've looked into an OUTPUT clause after the update statement but can't find a useful example. I don't understand why Pinal Dave is using Inserted.ID, Inserted.TEXTVal, Deleted.ID, Deleted.TEXTVal instead of Updated.TextVal. Is the update considered an insertion or deletion?
I found this post TSQL: UPDATE with INSERT INTO SELECT FROM saying "AFAIK, you cannot update two different tables with a single sql statement."
Is there a clean single statement to do this? I am looking for a correct, maintainable SQL statement. Do I have to wrap two statements in a single transaction?

You can use the OUTPUT clause as long as LittleTable meets the requirements to be the target of an OUTPUT ... INTO
UPDATE BigTable
SET ExportedFlag = 'T'
OUTPUT inserted.Col1, inserted.Col2 INTO LittleTable(Col1,Col2)
WHERE <some_criteria>
It makes no difference if you use INSERTED or DELETED. The only column it will be different for is the one you are updating (deleted.ExportedFlag has the before value and inserted.ExportedFlag will be T)

syntax for COPY in postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.

As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.

You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.

how to emulate "insert ignore" and "on duplicate key update" (sql merge) with postgresql?

Some SQL servers have a feature where INSERT is skipped if it would violate a primary/unique key constraint. For instance, MySQL has INSERT IGNORE.
What's the best way to emulate INSERT IGNORE and ON DUPLICATE KEY UPDATE with PostgreSQL?

With PostgreSQL 9.5, this is now native functionality (like MySQL has had for several years):
INSERT ... ON CONFLICT DO NOTHING/UPDATE ("UPSERT")
9.5 brings support for "UPSERT" operations.
INSERT is extended to accept an ON CONFLICT DO UPDATE/IGNORE clause. This clause specifies an alternative action to take in the event of a would-be duplicate violation.
...
Further example of new syntax:
INSERT INTO user_logins (username, logins)
VALUES ('Naomi',1),('James',1)
ON CONFLICT (username)
DO UPDATE SET logins = user_logins.logins + EXCLUDED.logins;

Edit: in case you missed warren's answer, PG9.5 now has this natively; time to upgrade!
Building on Bill Karwin's answer, to spell out what a rule based approach would look like (transferring from another schema in the same DB, and with a multi-column primary key):
CREATE RULE "my_table_on_duplicate_ignore" AS ON INSERT TO "my_table"
WHERE EXISTS(SELECT 1 FROM my_table
WHERE (pk_col_1, pk_col_2)=(NEW.pk_col_1, NEW.pk_col_2))
DO INSTEAD NOTHING;
INSERT INTO my_table SELECT * FROM another_schema.my_table WHERE some_cond;
DROP RULE "my_table_on_duplicate_ignore" ON "my_table";
Note: The rule applies to all INSERT operations until the rule is dropped, so not quite ad hoc.

For those of you that have Postgres 9.5 or higher, the new ON CONFLICT DO NOTHING syntax should work:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT field_one, field_two, field_three
FROM source_table
ON CONFLICT (field_one) DO NOTHING;
For those of us who have an earlier version, this right join will work instead:
INSERT INTO target_table (field_one, field_two, field_three )
SELECT source_table.field_one, source_table.field_two, source_table.field_three
FROM source_table
LEFT JOIN target_table ON source_table.field_one = target_table.field_one
WHERE target_table.field_one IS NULL;

Try to do an UPDATE. If it doesn't modify any row that means it didn't exist, so do an insert. Obviously, you do this inside a transaction.
You can of course wrap this in a function if you don't want to put the extra code on the client side. You also need a loop for the very rare race condition in that thinking.
There's an example of this in the documentation: http://www.postgresql.org/docs/9.3/static/plpgsql-control-structures.html, example 40-2 right at the bottom.
That's usually the easiest way. You can do some magic with rules, but it's likely going to be a lot messier. I'd recommend the wrap-in-function approach over that any day.
This works for single row, or few row, values. If you're dealing with large amounts of rows for example from a subquery, you're best of splitting it into two queries, one for INSERT and one for UPDATE (as an appropriate join/subselect of course - no need to write your main filter twice)

To get the insert ignore logic you can do something like below. I found simply inserting from a select statement of literal values worked best, then you can mask out the duplicate keys with a NOT EXISTS clause. To get the update on duplicate logic I suspect a pl/pgsql loop would be necessary.
INSERT INTO manager.vin_manufacturer
(SELECT * FROM( VALUES
('935',' Citroën Brazil','Citroën'),
('ABC', 'Toyota', 'Toyota'),
('ZOM',' OM','OM')
) as tmp (vin_manufacturer_id, manufacturer_desc, make_desc)
WHERE NOT EXISTS (
--ignore anything that has already been inserted
SELECT 1 FROM manager.vin_manufacturer m where m.vin_manufacturer_id = tmp.vin_manufacturer_id)
)

INSERT INTO mytable(col1,col2)
SELECT 'val1','val2'
WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE col1='val1')

As #hanmari mentioned in his comment. when inserting into a postgres tables, the on conflict (..) do nothing is the best code to use for not inserting duplicate data.:
query = "INSERT INTO db_table_name(column_name)
VALUES(%s) ON CONFLICT (column_name) DO NOTHING;"
The ON CONFLICT line of code will allow the insert statement to still insert rows of data. The query and values code is an example of inserted date from a Excel into a postgres db table.
I have constraints added to a postgres table I use to make sure the ID field is unique. Instead of running a delete on rows of data that is the same, I add a line of sql code that renumbers the ID column starting at 1.
Example:
q = 'ALTER id_column serial RESTART WITH 1'
If my data has an ID field, I do not use this as the primary ID/serial ID, I create a ID column and I set it to serial.
I hope this information is helpful to everyone.
*I have no college degree in software development/coding. Everything I know in coding, I study on my own.

Looks like PostgreSQL supports a schema object called a rule.
http://www.postgresql.org/docs/current/static/rules-update.html
You could create a rule ON INSERT for a given table, making it do NOTHING if a row exists with the given primary key value, or else making it do an UPDATE instead of the INSERT if a row exists with the given primary key value.
I haven't tried this myself, so I can't speak from experience or offer an example.

This solution avoids using rules:
BEGIN
INSERT INTO tableA (unique_column,c2,c3) VALUES (1,2,3);
EXCEPTION
WHEN unique_violation THEN
UPDATE tableA SET c2 = 2, c3 = 3 WHERE unique_column = 1;
END;
but it has a performance drawback (see PostgreSQL.org):
A block containing an EXCEPTION clause is significantly more expensive
to enter and exit than a block without one. Therefore, don't use
EXCEPTION without need.

On bulk, you can always delete the row before the insert. A deletion of a row that doesn't exist doesn't cause an error, so its safely skipped.

For data import scripts, to replace "IF NOT EXISTS", in a way, there's a slightly awkward formulation that nevertheless works:
DO
$do$
BEGIN
PERFORM id
FROM whatever_table;
IF NOT FOUND THEN
-- INSERT stuff
END IF;
END
$do$;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Best way to prevent duplicate data on copy csv postgresql - postgresql

The method posted by Basil Bourque was great, but there was a slight syntax error. Based on the documentation, I modified it to the following, which works: INSERT INTO bigtable(col1, … ) SELECT col1, … FROM stagingtable st ON CONFLICT (col1) DO NOTHING ;

Related

Troubleshooting an insert statement, fails without error

How can I remove extra characters from a column?

Insert and update records in one TSQL statement?

syntax for COPY in postgresql

how to emulate "insert ignore" and "on duplicate key update" (sql merge) with postgresql?

Categories

Resources