Ignore row if duplicate at CSV import - postgresql

I was wondering if it possible? If a row for some reason cannot be imported
ex. duplicate primary key, wrong input type etc etc can it be ignored and move to the next row?
I'm getting this
ERROR: duplicate key value violates unique constraint "team_pkey"
DETAIL: Key (team)=(DEN) already exists.
CONTEXT: COPY team, line 23: "DEN,Denver,Rockets,A"
There's a lot of mistakes in the file and its a pretty big one, so is it possible to ignore the rows that can't be inserted?

A solution that handles the duplicate key issue is described in To ignore duplicate keys during 'copy from' in postgresql - in short using an unconstrained temp table and select distinct on uniquefield into the destination table.
Another way would involve using pgLoader. Unfortunately the documentation seems to have disappeared from the website, but there are several tutorial article on the author's site. It has rich functionality to help you read data with issues, and can do things like store rejected lines in a separate file, transform fields and so on.
Something that may not be obvious immediately: pgLoader version 2 is written in Python, version 3 is written in Lisp. Both can be obtained from the GitHub page.

Related

Constrain a column across tables

I'm using Postgres to store formulae and elements of formulae across two tables. Basically, you have something like:
Elements Table
symbol
content
pi
3.1.45
lune
42
Formula Table
symbol
content
area
pi*r^2
rugsize
area*lune
So, formulae can use elements but also other formulae in their content field. For this reason (and for general reduction of confusion) I would like to make symbol unique across both tables.
I can, of course, in the code that's doing the insertion, look up the entries before adding them and refuse to add a duplicate symbol. (I probably will do this, but I don't want the database reliant on that.) I could also require a tag within the formula table to specify when it's using another formula:
symbol
content
rugsize
f(area)*lune
I'm not crazy about that since it puts a burden on the user to remember that, or on the coder to secretly add and remove the "f()".
Everything I found on Stack and elsewhere went the other way: Forcing a column value to be present in another table, except for one suggestion that the unique items be kept in a separate table.
symbol
area
lune
pi
rugsize
And then...actually, I'm still not sure how that would work at the DB level.
So, is there a way to do this with constraints or foreign keys, or must I write a trigger for each table to look into the other table?
Addition: I've simplified here greatly but the elements table is much more complex than I'm showing and has little in common with the formulae table.
Edited to add the above addition and to try to fix the one-column "symbol" table which looks fine in the editing preview but does not format correctly on the actual page.

INTERBASE - INDEX ISSUE

Good day,
I run many databases under interbase XE7 and 2017 now.
Lately, I got a strange behavior on one of the db:
A table with a primary key was found hosting many rows with similar values, like image below.
We can see that SCRIPTTYPE is a primary key column and it contains many times MATRIX, no space or strange characters ( I checked).
I was able to backup / restore without issues.
I am puzzled by this and I am wondering if anybody did encountered something similar?
And how it was done?
Thanks.
enter image description here
Run this query to be sure:
select SCRIPTTYPE, count(*)
from yourtable
group by SCRIPTTYPE
order by 2 desc
If you get a count>1, then I'd argue it's simple a bug and you should contact support. For them to assist you they'll almost certainly need your database, so you should be prepared to provide that. Based on your description, you should be able to drop all tables except the one, and drop all fields except your key, then backup and restore to get the simplest test case.

How to remove gaps due the jump of identity column in SQL Server 2017?

I use SQL Server with ASP.NET Core and EF Core. After each record is added, the identity column's value jumps about by 1000 and creates a gap between current row and the last previously added row.
Questions
Is there any way to prevent this?
How to delete those gaps that have been created before?
If I use GUID for key columns to prevent that issue, is there a problem (performance or each other problem)?
Is it way on the server side that with EF Core could handle it (each some way)?
Thank you in advance for your helps...
For the reason for 1000-value gaps, see Aaron Bertrand's answer
It doesn't really make sense to "want" to delete the gaps. The content of an identity column contains no semantic information. It correlates to nothing "in the world" outside the database. The gaps are as meaningless as the values themselves.
I don't see how a uniqueidentifier would "prevent" that issue. A uniqueidentifier may be "meaningfully" sortable (if you use newsequentialid()), but there's no sense in which any particular value is "one more" than a previous value.
You can certainly try to build your own key generating algorithm that does not produce gaps, but you will run into concurrency issues (also mentioned by Mr Bertrand).
workaround trick:
CREATE OR ALTER TRIGGER TGR_Transaction_Identity_Fix
ON [dbo].[TBL_Transaction]
FOR INSERT
AS
BEGIN
DECLARE #RESEEDVAL INT
SELECT #RESEEDVAL = MAX(TransactionId) FROM [dbo].[TBL_Transaction]
DBCC CHECKIDENT([TBL_Transaction], RESEED, #RESEEDVAL)
END
this triger will reset identity on each insert

Avoid duplicate inserts without unique constraint in target table?

Source & target tables are similar.
Target table has a UUID field that is computed in tMap, however the flow should not insert duplicate persons in target i.e unique (firstname,lastname,dob,gender). I tried marking those columns as key in tMap as in below screenshot, but that does not prevent duplicate inserts. How can I avoid duplicate inserts without adding unique constraint on target?
I also tried "using field" in target.
Edit: Solution as suggested below:
The CDC components in the Paid version of Talend Studio for Data Integration undoubtedly address this.
In Open Studio, you'll can roll your own Change data capture based on the composite, unique key (firstname,lastname,dob,gender).
Use tUniqueRow on data coming from stage_geno_patients, unique on the following columns: firstname,lastname,dob,gender
Feed that into a tMap
Add another query as input to the tMap, to perform look-ups against the table behind "patients_test", to find a match on the firstname,lastname,dob,gender. That lookup should "Reload for each row" using looking up against values from the staging row
In the case of no-match, detect it and then do an insert of the staging row of data into the table behind "patients_test"
Q: Are you going to update information, also? Or, is the goal only to perform unique inserts where the data is not already present?

Insert record in table if does not exist in iPhone app

I am obtaining a json array from a url and inserting data into a table. Since the contents of the url are subject to change, I want to make a second connection to a url and check for updates and insert new records in y table using sqlite3.
The issues that I face are:
1) My table doesn't have a primary key
2) The url lists the changes on the same day. Hence, if I run my app multiple times, when I insert values in my database, I get duplicate entries. I want to keep a check for the day duplicated entries that should be removed. The problem can be solved by adding a constraint, but since the url itself has duplicated values, I find it difficult.
The only way I can see you can do it if you have no primary key or something you can use that is unique to each record, is when you get your new data in you go through the new entries where for each one you check if the exact same data exists in the database already. If it doesn't then you add it, if it does then you skip over it.
You could even do something like create a unique key yourself for each entry which is a concatenation of each column of the table. That way you can quickly do the check for if the entry already exists in the database.
I see two possibilities depending on your setup:
You have a column setup as UNIQUE (this can be through a PRIMARY KEY or not). In this case, you can use the ON CONFLICT clause:
http://www.sqlite.org/lang_conflict.html
If you find this construct a little confusing, you can instead use "INSERT OR REPLACE" or "INSERT OR IGNORE" as described here:
http://www.sqlite.org/lang_insert.html
You do not have a column setup as UNIQUE. In this case, you will need to SELECT first to verify for duplicate data, and based on the result INSERT, UPDATE, or do nothing.
A more common & robust way to handle this is to associate a timestamp with each data item on the server. When your app interrogates the server it provides the timestamp corresponding to the last time it synced. The server then queries its database and returns all values that are timestamped later than the timestamp provided by the app. Then it also returns a new timestamp value for the app to store, to use on the next sync.