Use Azure Data Factory to Conditionally Split data to different tables - azure-data-factory

I want to use Azure Data Factory to split data similar to the data below based on the Name column to different tables. Ideally this could be done dynamically so that if a new Name value is added, then it will automatically split out that data to a separate table. I know how to manually specify a Conditional Split, I'm just wondering if theres any way to write an expression or etc. that would dynamically split these into separate tables, i.e. tbl_apple would have the first three rows, tbl_banana the next two, etc. ?
Thanks!
Name
Number
label
apple
1
a
apple
2
a
apple
3
a
banana
001
b
banana
002
b
carrot
0
dfb
carrot
1
dfb
carrot
2
dfb
carrot
3
dfb
plum
010
p
avocado
021
v
avocado
022
v

You can use Data flow script for conditional script but dynamic split condition isn't possible.
You can refer below syntax to write a conditional split script:
<incomingStream>
split(
<conditionalExpression1>
<conditionalExpression2>
...
disjoint: {true | false}
) ~> <splitTx>#(stream1, stream2, ..., <defaultStream>)
If you want it dynamically, you need to manage it programmatically. You can choose any data manipulation language like SQL or Python and read all the unique values in from the table and based on that you can split.
Use custom activity to run such scripts.
To move data to/from a data store that the service does not support,
or to transform/process data in a way that isn't supported by the
service, you can create a Custom activity with your own data movement
or transformation logic and use the activity in a pipeline.
For example, using Python Pandas module you can find the unique values in the dataframe. Refer below syntax:
<DataFrame_Name>.<Column_Name>.unique()
You will get the all unique values in the given column in a list.
Now you can loop over the list and store the records for each unique value in a separate table.

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Azure Data Factory: Flattening/normalizing a cloumn from CSV file using Azure Data Factory activity

I have pulled a csv file from one of our source using ADF and there is one column called "attributes" which contains multiple fields (in the form of key value pairs). Now I want to expand that column into different fields (columns). Below is the sample of that:
leadId activityDate activityTypeId campaignId primaryAttributeValue attributes
1234 2020-06-22T00:00:44Z 46 33686 Mail {"Description":"Clicked: https://stepuptostepout.com/","Source":"Lead action","Date":"2020-06-21 19:00:44"}
5678 2020-06-22T00:01:54Z 13 33128 SMS {"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}
Here the attributes column have different Key-value pairs and I want them in different column so that I can store them in Azure SQL Database:
attributes
{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo"}
I want them as:
Reason New Value Old Value Source
Changed 110 null Marketo
I am using Azure Data Factory. Please help!
Updating this:
One more thing I have noticed in my data is that the keys are not uniform, also if there is one key (say 'Source') present for one lead ID it might not be present/missing in the other leadId, making this more complicated. Hence having a separate column for each Attribute Key might not be a good idea.
Thus, we can have a separate table for 'attribute' field with lead ID, AttributeKey, AttributeValue as columns (we can join this with our main table using LeadID). The Attribute table will look like:
LeadID AttributeKey AttributeValue
5678 Reason Changed
5678 New Value 110
5678 Old Value null
5678 Source Marketo
Can you help me I can I achieve this using ADF?
You can use data flow to do this thing.Below is my test sample.
Setting of source1
Setting of Filter1
instr(attributes,'Reason') != 0
Setting of DerivedColumn1
Here is my expression and it's complex.
#(Reason=translate(split(split(attributes,',')[1],':')[2],'"',''),
NewValue=translate(split(split(attributes,',')[2],':')[2],'"',''),
OldValue=translate(split(split(attributes,',')[3],':')[2],'"',''),
Source=translate(translate(split(split(attributes,',')[4],':')[2],'"',''),'}',''))
Setting of Select1
Here is the result:
By the way,if your file is json,may be simple to do this than csv.
Hope this can help you:).

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

How to assign foreign keys in Access within imported table from Excel

I will use Access database instead of Excel. But I need to import data from one huge Excel sheet into several pre-prepared normalized tables in Access. In the core Access table I have mainly the foreign keys from other tables (of course some other fields are texts or dates).
How should I perform the import in the easiest way? I cannot perform import directly, because there is NOT, for example, "United States" string in the Access field 'Country'; there must be foreign key no. 84 from the table tblCountries. I think about DLOOKUP function in the Excel and replace strings for FK... Do you know any more simple method?
Thank you, Martin
You don’t mention how you will get the Excel data into several Access tables, so I will assume you will import the entire Excel file into ONE large table then break out the data from there. I assume the imported data may NOT match with existing Access keys (i.e. misspellings, new values, etc.) so you will need to locate those so you can make corrections. This will involve creating a number of ‘unmatched queries’ then a number of ‘Update queries’, finally you can use Append queries to pull data from your import table into the final resting place. Using your example, you have imported ‘Country = United States’, but you need to relate that value to key “84”?
Let’s set some examples:
Assume you imported your Excel data into one large Access table. Also assume your import has three fields you need to get keys for.
You already have several control tables in Access similar to the following:
a. tblRegion: contains RegionCode, RegionName (i.e. 1=Pacific, 2=North America, 3=Asia, …)
b. tblCountry: contains CountryCode, Country, Region (i.e. 84 | United States | 2
c. tblProductType: contains ProdCode, ProductType (i.e. VEH | vehicles; ELE | electrical; etc.)
d. Assume your imported data has fields
Here are the steps I would take:
If your Excel file does not already have columns to hold the key values (i.e. 84), add them before the import. Or after the import, modify the table to add the columns.
Create ‘Unmatched query’ for each key field you need to relate. (Use ‘Query Wizard’, ‘Find Unmatched Query Wizard’. This will show you all imported data that does not have a match in your key table and you will need to correct those valuse. i.e.:
SELECT tblFromExcel.Country, tblFromExcel.Region, tblFromExcel.ProductType, tblFromExcel.SomeData
FROM tblFromExcel LEFT JOIN tblCountry ON tblFromExcel.[Country] = tblCountry.[CountryName]
WHERE (((tblCountry.CountryName) Is Null));
Update the FK with matching values:
UPDATE tblCountry
INNER JOIN tblFromExcel ON tblCountry.CountryName = tblFromExcel.Country
SET tblFromExcel.CountryFK = [CountryNbr];
Repeat the above Unmatched / Matched for all other key fields.

How to delete data from an RDBMS using Talend ELT jobs?

What is the best way to delete from a table using Talend?
I'm currently using a tELTJDBCoutput with the action on Delete.
It looks like Talend always generate a DELETE ... WHERE EXISTS (<your generated query>) query.
So I am wondering if we have to use the field values or just put a fixed value of 1 (even in only one field) in the tELTmap mapping.
To me, putting real values looks like it useless as in the where exists it only matters the Where clause.
Is there a better way to delete using ELT components?
My current job is set up like so:
The tELTMAP component with real data values looks like:
But I can also do the same thing with the following configuration:
Am I missing the reason why we should put something in the fields?
The following answer is a demonstration of how to perform deletes using ETL operations where the data is extracted from the database, read in to memory, transformed and then fed back into the database. After clarification, the OP specifically wants information around how this would differ for ELT operations
If you need to delete certain records from a table then you can use the normal database output components.
In the following example, the use case is to take some updated database and check to see which records are no longer in the new data set compared to the old data set and then delete the relevant rows in the old data set. This might be used for refreshing data from one live system to a non live system or some other usage case where you need to manually move data deltas from one database to another.
We set up our job like so:
Which has two tMySqlConnection components that connect to two different databases (potentially on different hosts), one containing our new data set and one containing our old data set.
We then select the relevant data from the old data set and inner join it using a tMap against the new data set, capturing any rejects from the inner join (rows that exist in the old data set but not in the new data set):
We are only interested in the key for the output as we will delete with a WHERE query on this unique key. Notice as well that the key has been selected for the id field. This needs to be done for updates and deletes.
And then we simply need to tell Talend to delete these rows from the relevant table by configuring our tMySqlOutput component properly:
Alternatively you can simply specify some constraint that would be used to delete the records as if you had built the DELETE statement manually. This can then be fed in as the key via a main link to your tMySqlOutput component.
For instance I might want to read in a CSV with a list of email addresses, first names and last names of people who are opting out of being contacted and then make all of these fields a key and connect this to the tMySqlOutput and Talend will generate a DELETE for every row that matches the email address, first name and last name of the records in the database.
In the first example shown in your question:
you are specifically only selecting (for the deletion) products where the SOME_TABLE.CODE_COUNTRY is equal to JS_OPP.CODE_COUNTRY and SOME_TABLE.FK_USER is equal to JS_OPP.FK_USER in your where clause and then the data you send to the delete statement is setting the CODE_COUNTRY equal to JS_OPP.CODE_COUNTRY and FK_USER equal to JS_OPP.CODE_COUNTRY.
If you were to put a tLogRow (or some other output) directly after your tELTxMap you would be presented with something that looks like:
.----------+---------.
| tLogRow_1 |
|=-----------+------=|
|CODE_COUNTRY|FK_USER|
|=-----------+------=|
|GBR |1 |
|GBR |2 |
|USA |3 |
'------------+-------'
In your second example:
You are setting CODE_COUNTRY to an integer of 1 (your database will then translate this to a VARCHAR "1"). This would then mean the output from the component would instead look like:
.------------.
|tLogRow_1 |
|=-----------|
|CODE_COUNTRY|
|=-----------|
|1 |
|1 |
|1 |
'------------'
In your use case this would mean that the deletion should only delete the rows where the CODE_COUNTRY is equal to "1".
You might want to test this a bit further though because the ELT components are sometimes a little less straightforward than they seem to be.