Performing a delete in an ADF mapping data flow - Updated with potential solution - azure-data-factory

I am trying to do upsert and delete in a mapping data flow.
There is a dimension table, DimCustomer.
It is being populated with data from a file.
If a Sha2 hash does not match then upsert.
if CustomerID is missing from the rawSource data, then delete (see image below for settings)
The upsert works, but the delete does not. Its likely because in the sink i have selected the customerID column as the key, but this means it can never delete a record if the entire record, including key is missing from source.
Is there a prescribed design pattern for this scenario?
The easiest solution i can think of is a 2nd dataflow, in which the only customerID's sent to the sink are ones where there is no matching customerID in the source (effectively a right outer join), but want to see if this is indeed the best way to do this.
Update:
The best solution i can come up with for this is, to the above dataflow, add an additional column, the formula for which is coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value.
In the sink, i change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected. Still unsure if this is the best solution but it works and doesn't appear to cause a major performance issue.

Per my experience, I think that's the best solution, add a new column can solve the problem is much easier than other operations. The way which simplest and effective is the best solution. You don't need create another data flow actives to achieve it or re-design the Alter active logic.
Your Solution:
Add an additional column CustID, the formula for which is: coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value. In the sink, you change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected.

Related

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Updating the table through tOracleOutput in Talend using an additional SQL query

I have a job where I am getting a flow into tOracleOutput where I am updating the table. Now, I have to update that table using an SQL statement, which I guess we have option in Advanced settings of tOracleOuptut, but I don't know how to use it or you can say that I am not getting the settings properly. I referred to official documentation but could not understand. Can any one explain the fields like Name, SQL expression, Position, Reference Column in a better way?
the SQL query which I am using is:
update set COL1=SOMETHING1
where COL2=SOMETHING2
Now, value for COL1 is coming from the flow but COL2 is some column in the table which is not coming from the flow.
Have a look to tOracleRow for such a case.
Hope this helps.
TRF
Using tOracleOutput is helpful when a ready data source (table or file (...) with same columns as destination) the more elaborate your query is, the more you should do as TRF said (and use tOracleRow), but here's an example to your question:
file contain 3 column,
DB table of destination contains 4 column, where the 4th is the date of update, (the first 3 are identical to the input)
so you add the destination's column's name in Name and put the SQL function for the date (eg: SYSDATE) and where to put it (Position) in reference to a column of your choice (Reference Column)
In my view it helps avoid using tMap for a miserable additional column when you want to Insert, but you want to Update, in which case the component doesn't offer the additional column section, plus I don't think you can add the WHERE clause here
Hope it helps

Talend tMap Set Default Value for Rejected Inner Joins and connect them with the main data flow

i've got the following problem.
I have several tMaps, each has a lookup and at the end all the data is written in a db. The following mockup shall illustrate it:
There can be values in the main data stream which are not found in the lookup tables. For this values there is a rejected path which catches them from the specific tMap.
Requirements:
In case of a rejected inner join the looked up value shall be set to a default value (for example 0, which could be done in the schema of the tMap) and after that these "corrected" records should be added to the "normal" main data flow and process the next lookup.
The tUnite component is not able to handle this cases because it can not exist in a data flow loop.
Does anybody got an idea how to solve this problem?
Cheers.
The answer was so easy that i didn't got it in the first conception. I just have to change the join model from inner to left-join so all the formal rejected values will have a null value in it. Afterwards i can check the columns in the tmap and set them on a default value if they are null.
row1.id == null ? 0 : row1.id
Cheers.
If I understand correctly what you are trying to accomplish you will have to have staging files or staging tables on the database. Once you get the rejected rows, write them on a file or table. The accepted files will go also to a staging table(different than the rejected). Then you can union both tables or files by reading them. The key point is having a staging structure. I attach a picture what how would it be. In the picture the staging structure is a mysql table.
Let me know if it helps!

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.