How can I ignore duplicate rows while storing a dataframe in a postgres DB with Blaze's Odo?
For example, I store the first 3 rows like this:
>>> odo(df[:3], 'postgresql:///my_db::my_table')
my_table has a column ID as a primary key. If I add a few more but this time including the previous last row, I want to skip that row and add others instead of getting that IntegrityError.
>>> odo(df[2:5], 'postgresql:///my_db::my_table')
IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint ...
How can I do that? Loading ID values from DB and checking for duplicates seems expensive to me if the DB has millions of rows. Is there any better alternative?
Something like this:
INSERT...ON DUPLICATE KEY IGNORE
Blaze: 0.8.3, Postgres: 9.4.4, Psycopg2: 2.6.1
The data model for odo only supports appending, not merging. You'll need to either remove duplicates before passing it through odo, or use the database to remove the duplicates. Try adding an auto-increment field, and set it as the primary key. This will fix your IntegrityError problems on insertion, and you can remove the duplicates after that.
Related
Firstly, I have a table in database USERS with almost 30 Million records in it. I have different indices for each column. But some of the column have only 2 to 3 non null values while others are Null but still their index size is 847 MB a little less than the one index that contain unique value for each row.
Can anyone know why is it like this?
Secondly, in PostgreSQL we have a index for primary key index for each column by default what if we delete that index what will be the consequences?
What that index is really use for?
As i'm searching based on values in other columns only will it be safe to delete index for primary key?
NULL values are stored in indexes just like all other values, so the first part is not surprising.
You cannot delete the primary key index, what you could do is drop the primary key constraint. But then you cannot be certain that no duplicate rows get added to the table. If you think that is no problem, look at the many questions asking for help with exactly that problem.
Every table should have a primary key.
But it might be a good idea to get rid of some other indexes if you don't need them.
There is nothing called primary key index, seems to be you are talking about unique index.
First of all you need to understand the difference between primary key and index. You can have only one primary key in a table. Primary key would be your unique identifier of each column and does not allow nulls. Index is used to speed up your fetching process on particular column and you can have one null if it is unique index. Deleting unique index in your table will not impact any thing apart from performance. Its your way of design to have index or not
I am trying to bulk load records from a temp table to table using insert as select stmt and on conflict strategy do update.
I want load as many records as possible, currently if there any any foreign key violations no records get inserted, everything gets rolled back. Is there a way to insert valid records and skip the faulty records.
In https://dba.stackexchange.com/a/46477 I saw a strategy of going with the foreign table in the query to ignore the faulty rows. I don't want to do that too as I may have many foreign keys on that table and it will make my query more complex and table specific. I would like it to be generic.
Sample use case, if have 100 rows in the temp table and suppose row number 5 and 7 are causing insertion failure, I want to insert the rest 98 records and identify which two rows failed.
I want to avoid inserting record by record and catch the error, as it is not efficient. I am doing this whole exercise to avoid loading a table row by row.
Oracle provides support to catch bulk errors at a shot.
Sample https://stackoverflow.com/a/36430893/8575780
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1422998100346727312
I have already explored options loading using copy, it catches not null constraints and other data type errors, but when foreign key violation happens nothing gets committed.
I am looking something closer to what pgloader is doing when it faces error.
https://pgloader.readthedocs.io/en/latest/pgloader.html#batches-and-retry-behaviour
I am creating a table in Postgresql 9.5 where id is the primary key. While inserting rows in the table if anyone tries to insert duplicate id, i want it to get ignored instead of raising exception. Is there any way such that i can set this while table creation itself that duplicate entries get ignored.
There are many techniques to resolve duplicate insertion issue while writing insertion query i.e. using ON CONFLICT DO NOTHING, or using WHERE EXISTS clause etc. But i want to handle this at table creation end so that the person writing insertion query doesn't need to bother any.
Creating RULE is one of the possible solution. Are there other possible solutions? Maybe something like this:
`CREATE TABLE dbo.foo (bar int PRIMARY KEY WITH (FILLFACTOR=90, IGNORE_DUP_KEY = ON))`
Although exact this statement doesn't work on Postgresql 9.5 on my machine.
add a trigger before insert or rule on insert do instead - otherwise has to be handled by inserting query. both solutions will require more resources on each insert.
Alternative way to use function with arguments for insert, that will check for duplicates, so end users will use function instead of INSERT statement.
WHERE EXISTS sub-query is not atomic btw - so you can still have exception after check...
9.5 ON CONFLICT DO NOTHING is the best solution still
This can be marked as duplicate but I am finding issue when I refereed
Create Unqiue case-insensitive constraint on two varchar fields
I have a table std_tbl having some duplicate records in one of the columns say Column_One.
I created a unique constraint on that column
ALTER TABLE std_tbl
ADD CONSTRAINT Unq_Column_One
UNIQUE (Column_One) ENABLE NOVALIDATE;
I used ENABLE NOVALIDATE as I want to keep existing duplicate records and validate future records for duplicates.
But here, the constaint does not look for case sensitive words, like if value of Column_One is 'abcd', it allows 'Abcd' and 'ABCD' to insert in the table.
I want this behaviour to be case insensitive so that it should not look for case while validating data. For this I came up with this solution.
CREATE UNIQUE INDEX Unq_Column_One_indx ON std_tbl (LOWER(Column_One));
But it is giving me the error:
ORA-01452: cannot CREATE UNIQUE INDEX; duplicate keys found
Please help me out...
This occurs when you try to execute a CREATE UNIQUE INDEX statement on one or more columns that contain duplicate values.
Two ways to resolve (that I know of):
Remove the UNIQUE keyword from your CREATE UNIQUE INDEX statement and rerun the command (i.e. if the values need not be unique).
If they must be unique, delete the extraneous records that are causing the duplicate values and rerun the CREATE UNIQUE INDEX statement.
I am getting a duplicate key error, DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, when I try to INSERT records. The primary key is one column, INTEGER 4, Generated, and it is the first column.
the insert looks like this: INSERT INTO SCHEMA.TABLE1 values (DEFAULT, ?, ?, ...)
It's my understanding that using the value DEFAULT will just let DB2 auto-generate the key at the time of insert, which is what I want. This works most of the time, but sometimes/randomly I get the duplicate key error. Thoughts?
More specifically, I'm running against DB2 9.7.0.3, using Scriptella to copy a bunch of records from one database to another. Sometimes I can process a bunch with no problems, other times I'll get the error right away, other times after 2 records, or 20 records, or 30 records, etc. Does not seem to be a pattern, nor is it the same record every time. If I change the data to copy 1 record instead of a bunch, sometimes I'll get the error one time, then it's fine the next time.
I thought maybe some other process was inserting records during my batch program, and creating keys at the same time. However, the tables I'm copying TO should not have any other users/processes trying to INSERT records during this same time frame, although there could be READS happening.
Edit: adding create info:
Create table SCHEMA.TABLE1 (
SYSTEM_USER_KEY INTEGER NOT NULL
generated by default as identity (start with 1 increment by 1 cache 20),
COL2...,
)
alter table SCHEMA.TABLE1
add constraint SYSTEM_USER_SYSTEM_USER_KEY_IDX
Primary Key (SYSTEM_USER_KEY);
You most likely have records in your table with IDs that are bigger then the next value in your identity sequence. To find out what the current value your sequence is about at, run the following query.
select s.nextcachefirstvalue-s.cache, s.nextcachefirstvalue-s.increment
from syscat.COLIDENTATTRIBUTES as a inner join syscat.sequences as s on a.seqid=s.seqid
where a.tabschema='SCHEMA'
and a.TABNAME='TABLE1'
and a.COLNAME='SYSTEM_USER_KEY'
So basically what happened is that somehow you got records in your table with ids that are bigger then the current last value of your identity sequence. So sooner or later these ids will collide with identity generated ids.
There are different reasons on how this could have happened. One possibility is that data was loaded which already contained values for the id column or that records were inserted with an actual value for the ID. Another option is that the identity sequence was reset to start at a lower value than the max id in the table.
Whatever the cause, you may also want the fix:
SELECT MAX(<primary_key_column>) FROM onsite.forms;
ALTER TABLE <table> ALTER COLUMN <primary_key_column> RESTART WITH <number from previous query + 1>;