Avoid duplicate inserts without unique constraint in target table? - postgresql

Source & target tables are similar.
Target table has a UUID field that is computed in tMap, however the flow should not insert duplicate persons in target i.e unique (firstname,lastname,dob,gender). I tried marking those columns as key in tMap as in below screenshot, but that does not prevent duplicate inserts. How can I avoid duplicate inserts without adding unique constraint on target?
I also tried "using field" in target.
Edit: Solution as suggested below:

The CDC components in the Paid version of Talend Studio for Data Integration undoubtedly address this.
In Open Studio, you'll can roll your own Change data capture based on the composite, unique key (firstname,lastname,dob,gender).
Use tUniqueRow on data coming from stage_geno_patients, unique on the following columns: firstname,lastname,dob,gender
Feed that into a tMap
Add another query as input to the tMap, to perform look-ups against the table behind "patients_test", to find a match on the firstname,lastname,dob,gender. That lookup should "Reload for each row" using looking up against values from the staging row
In the case of no-match, detect it and then do an insert of the staging row of data into the table behind "patients_test"
Q: Are you going to update information, also? Or, is the goal only to perform unique inserts where the data is not already present?

Related

Copy data from Cosmos Db to table storage fails on custom RowKey

I'm trying to get a very simple data migration to work, where I want 3 fields from Cosmos Db documents to be inserted as entities in Table Storage.
The challenge seems to be in the fact that I want an Id from the document, also to be the value of the partition key and row key.
I took the Copy data activity, defined Cosmos Db as source, table storage as sink and defined mappings to get the right data into the right field.
In the sink you can specify what to do with partition key and row key.
When I specify the partition key to be the id from the document, it works.
However, when I do the same for row key (instead of a generated identifier), I get this error "The specified AzureTableRowKeyName 'UserId' doesn't exist in the source data".
The weird thing is that there appears to be no problem regarding the partition key for that value.
Any one who can point me in the right direction?
Thanks to #BhanunagasaiVamsi-MT for pointing me in the right direction.
For completeness sake, I'm dropping my solution here, although the link in that post also explains it.
You need to:
specify additional columns, based on the source data
select these colums as rowkey or partitionkey in the sink
assign the additional fields to rowkey and partitionkey in the mapping (feels like a duplicate thing to do, but if you don't you get the error mentioned in the question)
I tried to reproduce the same in my environment and got the same error as below:
If I specify unique identifier its working fine.
Note: Specify the name of the column whose column values are used as the row key. If not specified, use a GUID for each row
For more information refer this Microsoft documentation.

How to know in Talend if tMySQLInput will overwrite data?

I have one already existing Talend Open Studio tMySQLInput component with some sql code inside it, in order to retrieve some joined columns linked to a tMySQLOuput component (pointing to an already existing MySQL table) with few records.
QUESTION:
Will the "tMySQLInput" component overwrite the already existing table data that the tMySQLOutput component relates to? I mean is there an option to check in the tMySQLInput our output in order to say, overwrite each time this job is executed ?
Thank you all.
Yes, there is an option where in tMySQLOutput where you can specify what action you want to do to your table. Follow following steps:
Go to component tab of tMySQLOutput, it will open the basic settings of this component.
If you will look closer you will find Action on table. This is the action which you can perform on the table which is pointed by tMySQLOutput. It has options as Default, Drop and Create Table etc.
Then you have Action on data. These are the options which you can perform on the data like Insert, Update etc.
In your case I suppose you can choose Action on Table as Default and Action on Data as Insert. Default action would not do anything on the table and Insert option would insert the records at the end of table. But in case of Insert if you will have duplicate rows then job would stop the moment it will find any duplicate row.

CDC multiple insert/delete of the same identity value

I have a table T that contains an ID set as identity and primary key. I have enabled CDC on the table and then later added an XML field that I didn't care capturing so I did not do anything further (to recreate the capture table and/or migrate old capture data).
I now have a stored procedure that (among other things) updates only the newly created field (no other field) in table T. I notice that instead of recording an update (operation=3 followed by operation=4), CDC records a delete (operation=1) followed by an insert (operation=2) and all fields are the same (of course since none of them was updated)
I actually noticed this because I had the same identity value inserted and/or deleted more than once, which is not possible (unless identity_insert is on, which is not)
Why does CDC record operation=1 instead of 3 and operation=2 instead of 4?
Is this documented anywhere or is it a bug?
The reason you are seeing a Delete/insert pair (Operation number 1/2) as opposed to an update pair (3/4) is because you are updating a "set" of data that ALSO has a unique constraint on your column.
For SQL to make sense of this wihout violating the unique cosntraint, it deletes the row and reinserts it (with the "update").
More information on this. Its not an issue or a defect. its the way SQL works and CDC innocently logs it as it sees it. Remember, CDC is just a subscriber and replicates things as they happen.
If you have a need to see an update you may have to look for the 1/2 "pair" and not ONLY the operation code 3/4.
Some great articles:
Bounded Update is the term used to describe certain types of UPDATE statements from the publisher that will replicate as DELETE/INSERT pairs on the subscriber. We perform a bounded update for every set based update that changes a column that is part of a unique index or constraint. In other words, if an UPDATE statement touches more than one row and modifies a column that is has any UNIQUE constraints, the UPDATE statement is sent to the subscriber as a DELETE/INSERT pair ... read more here
https://support.microsoft.com/en-us/kb/238254

Insert record in table if does not exist in iPhone app

I am obtaining a json array from a url and inserting data into a table. Since the contents of the url are subject to change, I want to make a second connection to a url and check for updates and insert new records in y table using sqlite3.
The issues that I face are:
1) My table doesn't have a primary key
2) The url lists the changes on the same day. Hence, if I run my app multiple times, when I insert values in my database, I get duplicate entries. I want to keep a check for the day duplicated entries that should be removed. The problem can be solved by adding a constraint, but since the url itself has duplicated values, I find it difficult.
The only way I can see you can do it if you have no primary key or something you can use that is unique to each record, is when you get your new data in you go through the new entries where for each one you check if the exact same data exists in the database already. If it doesn't then you add it, if it does then you skip over it.
You could even do something like create a unique key yourself for each entry which is a concatenation of each column of the table. That way you can quickly do the check for if the entry already exists in the database.
I see two possibilities depending on your setup:
You have a column setup as UNIQUE (this can be through a PRIMARY KEY or not). In this case, you can use the ON CONFLICT clause:
http://www.sqlite.org/lang_conflict.html
If you find this construct a little confusing, you can instead use "INSERT OR REPLACE" or "INSERT OR IGNORE" as described here:
http://www.sqlite.org/lang_insert.html
You do not have a column setup as UNIQUE. In this case, you will need to SELECT first to verify for duplicate data, and based on the result INSERT, UPDATE, or do nothing.
A more common & robust way to handle this is to associate a timestamp with each data item on the server. When your app interrogates the server it provides the timestamp corresponding to the last time it synced. The server then queries its database and returns all values that are timestamped later than the timestamp provided by the app. Then it also returns a new timestamp value for the app to store, to use on the next sync.

SqlDataAdapter Update

Can any one help me why this error occurs when i update using sqlDataadapter with join query
Dynamic SQL generation is not supported against multiple base tables.
You have a "join" in your main query for your dataset (The first one in the TableAdapter with a check by it). You can't automatically generate insert/update/delete logic for a TableAdapter when the main query has multiple tables referenced in the query via a join. The designer isn't smart enough to figure out which table you want to send updates to in that case, that is why you get the error message.
Solution. Ensure that your main query only references the table you want the designer to write insert/update/delete code for. Your secondary queries may reference as many tables as you want.
It was in the case that i was trying to set value for identity column in my datarow. Simply i deleted the code to set value for identity column and it will work.
My Scenario:
Database:
uin [primary, identity]
name
address
Whenever i tried to set the datarow("uin") the error occurs. But works fine with datarow("name") and datarow("address").
Hope it works for you too