datastage scd stage is updating old records instead of inserting no match business keys - datastage

we have dim build with 1.8 million records and scd is updating the old records instead of inserting the record when business key is not found.
need immediate help as this is a production issue....
we had identity on the destination table and we are not using surrogate key for inserts but for updates we are using and this is causing a lot of troubles...

This could be because correct keys are not specified in SCD stage.

Related

Update column in a dataset only if matching record exists in another dataset in Tableau Prep Builder

Any way to do this? Basically trying to do a SQL UPDATE SET function if matching record for one or more key fields exists in another dataset.
Tried using Joins and Merge. Joins seems like more steps and the Merge appends records instead of updating the correlating rows.

Performing a delete in an ADF mapping data flow - Updated with potential solution

I am trying to do upsert and delete in a mapping data flow.
There is a dimension table, DimCustomer.
It is being populated with data from a file.
If a Sha2 hash does not match then upsert.
if CustomerID is missing from the rawSource data, then delete (see image below for settings)
The upsert works, but the delete does not. Its likely because in the sink i have selected the customerID column as the key, but this means it can never delete a record if the entire record, including key is missing from source.
Is there a prescribed design pattern for this scenario?
The easiest solution i can think of is a 2nd dataflow, in which the only customerID's sent to the sink are ones where there is no matching customerID in the source (effectively a right outer join), but want to see if this is indeed the best way to do this.
Update:
The best solution i can come up with for this is, to the above dataflow, add an additional column, the formula for which is coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value.
In the sink, i change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected. Still unsure if this is the best solution but it works and doesn't appear to cause a major performance issue.
Per my experience, I think that's the best solution, add a new column can solve the problem is much easier than other operations. The way which simplest and effective is the best solution. You don't need create another data flow actives to achieve it or re-design the Alter active logic.
Your Solution:
Add an additional column CustID, the formula for which is: coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value. In the sink, you change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

loop and change all record numbers in Firebird database

I have a database table with unique record numbers created with generator but because of error in code (setting generators) record numbers suddenly became large because many numbers are skipped. I would like to rewrite all record numbers starting with 1 and finish with total records number. With application it will take a lot of time.
As I see from documentation for Firebird it should be simple task using loop but I have no experience with Firebird programming, I am using only simple SQL statements, can somebody help?
Actually there is no need to program a loop, simple update statement should do. First, reset the generator:
SET GENERATOR my_GEN TO 0;
and then update all the records assigning them new id
update tab set recno = gen_id(my_GEN, 1) order by recno asc;
It assumes that all references to the recno field are via foreign key with ON UPDATE CASCADE, otherwise you either mess up your data or the update fails.
During this operation there should be no other users in the database!
That being said, you really shouldn't care about gaps in your record numbers.

How to insert autoincremented master/slave records using ScalaQuery?

Classic issue, new framework -- thus problem.
PostgreSQL + Scala + ScalaQuery. I have Master table with serial (autincrement) id and Slave table also with serial id.
I need to insert one master record and several slaves. I have to do it within transaction (to have ability to cancel all), so I cannot run a query after inserting master to find out id. As far as I see SQ "insert" method does not return any reference to inserted master record.
So how to do it?
SQ Examples cover this however without autoincremented field, so such solution (pre-set ids) is not applicable here.
If I understand it correctly this is not possible for now in automatic way. If one is not afraid, this can be done this way. Obtaining the id of last insert (per each master record insertion):
postgreSQL function for last inserted ID
Then using it in SQ:
http://groups.google.com/group/scalaquery/browse_thread/thread/faa7d3e5842da82e
This code shows the MySql way. I'm posting it to the list for
posterity's sake.
val scopeIdentity = SimpleFunction.nullaryLong
val inserted = Actions.insert(
"cat", "eats", "dog)
//Print out the count of inserted records. println(inserted )
//Print out the primary key for the last inserted record.
println(Query(scopeIdentity).first)
//Regards //Bryan
But since for auto incremented fields you have to use projections excluding autoinc fields, and then inserting tuples instead of named record types, there is a question if it is not worth to hold breath until SQ will support this directly.
Note I am SQ newbie, I might just misinform you.