Combining/ Merging all TMaps into a single table - talend

I have the following talend job where each is an API call to get the currency exchange rate.
The fields within each Tmap are:
- FrommCurr
- ToCurr (Each Tmap has a hardcoded string value "EUR", "USD", "CAD")
- Date
- DateTime
I would like to combine all of these into a single table before inserting. what is the object that will let me combine them? Or should I insert each single TMap into the database? The database, I will be inserting into is MSSQL.

As long As all the final tMap components produce exactly the same schema , you can join the output into a single stream of rows with tUnite .

Related

Modifying json in azure data factory to give nested list

Im retrieving data from sql server in Azure Data Factory.
The API im passing it to requires the Json in a specific format.
I have been unable to get the data in the required format so far, trying "for json output" in tsql.
is there a way to do this in data factory with the data it retrieved from SQL Server?
SQL Server Output
EntityName customField1 CustomField2
------------------------------------------
AA01 NGO21 2022-01-01
AA02 BS34 2022-03-01
How it appears in Data Factory
[
{"EntityName": "AA01", "CustomField1": "NGO21", "CustomField2":"2022-01-01"},
{"EntityName": "AA02", "CustomField1": "BS32", "CustomField2":"2022-03-01"}
]
Required output
[
{
"EntityName": "AA01"
"OtherFields":[{"customFieldName": "custom field 1, "customFieldValue": "NGO21"},{"customFieldName": "custom field 2", "customFieldValue": "2022-01-01"} ]
},
{
"EntityName": "AA02"
"OtherFields":[{"customFieldName": "CustomField1", "customFieldValue": "BS34"},{ "CustomFieldName":"CustomField2", "customFieldValue" : "2022-03-01"}]
}
]
I managed to do it in ADF, the solution is quite long, i think if you will write a stored procedure, it will be much easier.
Here is a quick demo that i built:
the idea is to build the Json structure as requested and then using collect function to build the array.
we have 2 arrays, one for EntityName and one for OtherFields.
Prepare the data:
First, i added column names in the corresponding rows, we will use this info later on to build our Json, i used a Derived column activity to fill the rows with column names.
Splitting Columns:
In order to build the Json structure as requested, i split the data into two parallel flows.
first flow is to select CustomFieldName1 and CustomFieldValue1 and the second flow is to select CustomFieldName2 and CustomFieldValue2 like so:
SelectColumn2 Activity:
SelectColumn1 Activity:
Note: Please keep the EntityName, We will Union the data by it later on in the flow.
OtherFields Column:
In order to build the Json, we need to do it using Sub-columns feature in a Derived column activity, that will ensure to us the Json structure.
Add new column with a name 'OtherFields' and open Expression Builder:
add 2 subcolumns : CustomFieldName and CustomFieldValue, add CustomFieldName1 as a value for the subcolumn CustomFieldName and add CustomFieldValue1 to the CustomFiedValue column like so:
Add a derived column activity and repeat same steps to CustomFieldName2.
Union:
Now we have 2 flows, one for extracting field1 and field2, we need to Union the data (you can do it by position or by name)
In order to create an array of Json we need to aggregate the data; this will transform complex data type {} into array []
Aggregate Activity:
Group by -> 'EntityName'
Aggregates -> collect(OtherFields)
Building Outer Json:
as described in the question above , we need to have a json that consists of 2 columns : {"EntitiyName" :"" , "OtherFields":[]}
In order to do it, again we need to add a derived column and add 2 subcolumns,
also, in order to combine all Json's in one Json array, we need a common value so we can aggregate by it, since we have different values, i added a dummy value with a constant 1, this will guarantee to us that all Json's will be under the same array
**Aggregate Data Json Activity: **
the output is an array of Json's, so we need to aggregate the data column
Group by -> dummy
aggregate : collect(data)
SelectDataColumn Activity:
Select Data column because we want it to be our output.
Finally, write to sink.
P.s: you can extract data value so you wont end up with a data key.
Output:
ADF activities:
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-union
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

group the record by columns and wanted to pick only the max of created date record using mapping data flows azure data factory

HI there in Mapping data flows and azure data factory. I tried to create data flow in that we used aggregate transformation to group the records, now there is a column called [createddate] now we want to pick the max of created record and out put should show all the columns.
any advice please help me
The Aggregate transformation will only output the columns participating in the aggregation (group by and aggregate function columns). To include all of the columns in the Aggregate output, add a new column pattern to your aggregate (see pic below ... change column1, column2, ... to the names of the columns you are already using in your agg)

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.

Mapping Data Flow Common Data Model source connector datetime/timestamp columns nullified?

We are using Azure Data Factory Mapping data flow to read from Common Data Model (model.json).
We use dynamic pattern – where Entity is parameterised and we do not project any columns and we have selected Allow schema drift.
Problem: We are having issue with “Source” in mapping data flow (Source Type is Common Data Model). All the datetime/timestamp columns are read as null in source activity.
We also tried in projection tab Infer drifted column types where we provide a format for timestamp columns, However, it satisfies only certain timestamp columns - since in the source each datetime column has different timestamp format.
11/20/2020 12:45:01 PM
2020-11-20T03:18:45Z
2018-01-03T07:24:20.0000000+00:00
Question: How to prevent datetime columns becoming null? Ideally, we do not want Mapping Data Flow to typecast any columns - is there a way to just read all columns as string?
Some screenshots
In Projection tab - we do not specify schema - to allow schema drift and to dynamically load more than 1 entities.
In Data Preview tab
ModifiedOn, SinkCreatedOn, SinkModifiedOn - all these are system columns and will definitely have values in it.
This is now resolved on a separate conversation with Azure Data Factory team.
Firstly there is no way to 'stringfy' all the columns in Source, because CDM connector gets its metadata from model.json (if needed this file can be manipulated, however not ideal for my scenario).
To solve datetime/timestamp columns becoming null - under Projection tab we need to select Infer drifted column types and then you can add "multiple" time formats that you expect to come from CDM. You can either select from dropdown - if your particular datetime format is not listed in the dropdown (which was my case) then you can edit the code behind the data flow (i.e. data flow script), to add your format (see second screenshot).

DynamoDB column with tilde and query using JPA

i have table column with tilde value like below
vendorAndDate - Column name
Chipotle~08-26-2020 - column value
I want to query for month "vendorAndPurchaseDate like '%~08%2020'" and for year ends with 2020 "vendorAndPurchaseDate like '%2020'". I am using Spring Data JPA to query the values. I have not worked on column with tilde values before. Please point me in a right direction or some examples
You cannot.
If vendorAndPurchaseDate is your partition key , you need to pass the whole value.
If vendorAndPurchaseDate is range key , you can only perform
= ,>,<>=,<=,between and begins_with operation along with a partition key
reference : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
DynamoDB does not support this type of wildcard query.
Let's consider a more DynamoDB way of handling this type of query. It sounds like you want to support 2 access patterns:
Get Item by month
Get Item by year
You don't describe your Primary Keys (Partition Key/Sort Key), so I'm going to make some assumptions to illustrate one way to address these access patterns.
Your attribute appears to be a composite key, consisting of <vendor>~<date>, where the date is expressed by MM-DD-YYYY. I would recommend storing your date fields in YYYY-MM-DD format, which would allow you to exploit the sort-ability of the date field. An example will make this much clearer. Imagine your table looked like this:
I'm calling your vendorAndDate attribute SK, since I'm using it as a Sort Key in this example. This table structure allows me to implement your two access patterns by executing the following queries (in pseudocode to remain language agnostic):
Access Pattern 1: Fetch all Chipotle records for August 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-08-00 and Chipotle~2020-08-31
Access Pattern 2: Fetch all Chipotle records for 2020
query from MyTable where PK = "Vendors" and SK between Chipotle~2020-01-01 and Chipotle~2020-12-31
Because dates stored in ISO8601 format (e.g. YYYY-MM-DD...) are lexicographically sortable, you can perform range queries in DynamoDB in this way.
Again, I've made some assumptions about your data and access patterns for the purpose of illustrating the technique of using lexicographically sortable timestamps to implement range queries.