Data flow single row transformation - azure-data-factory

I have a transformation can this achieved by dataflow.
Thanks
ANuj Gupta

If the column data CN=SERVICE NOW,OU=TOOL;CN=PYTHON,OU=LANGUAGE;CN=ADF,OU=CLOUD schema is fixed, then you can use data flow derived column expression to achieve it.
I just made an example to get output, here's the dataset:
Data Flow derived column expressions:
Col1 column value: col1 --> {Col1 }
a column value: SERVICE NOW-->substring(split({ Col2},';')[1], 5, length(split({ Col2},';')[1])-12)
b column value: PYTHON --> substring(split({ Col2},';')[2], 4, length(split({ Col2},';')[1])-17)
c column value: ADF --> substring(split({ Col2},';')[3], 4, length(split({ Col2},';')[1])-20)
Screenshots:
But if the data is dynamic, we can't do the conversion in Data Factory, it's unachievable.

Logic : You can get the index of first occurence of 'CN=' and first occurence of comma to get first word between them, ie Service Now.
Similarly for others.
If I get time I will try to edit this with the actual syntax!

The strings are delimited by ;. First split by that(so you get an array) in a derived transform. You can then use the 'map' function to extract out the string before the last =. You will have an array of values(ADF/Python etc). Then you use a flatten transform to convert columns to rows.

Related

MD5/SHA Field Dataset in Data Fusion

I need to concatanate a few string values in order to obtain the SHA256 encrypted string. I've seen Data Fusion has a plugin to do the job:
The documentation however is very poor and nothing I've tried seems to work. I created a table in BQ with the string fields I need to concatanate but the output is same as input. Can anyone provide with an example on how to use this plugin?
EDIT
Below I present the example,
This is how the workflow looks like:
For the testing purposes, I added one column with the following string:
2022-01-01T00:00:00+01:00
And here's the output:
You can use Wrangler to concatenate the string values.
I tried your scenario adding Wrangler to the Pipeline:
Joining 2 Columns:
I named the column new_col, using , as delimiter:
Output:
What you described can be achieved by 2 Wranglers:
The first Wrangler will be what #angela-b described. Use the merge directive to create a new column with the concatenation of two columns. Example directive that joins column a and b using , as the delimiter and stores the result in column a_b:
merge a b a_b ,
The second Wrangler will use the hash directive which will hash the column in place using a specified algorithm. Example of a directive that hashes column a_b using MD5:
hash :a_b 'MD5' true
Remember to set the last parameter encode to true so that you get a string output instead of a byte array.

Get string after ',' delimeter comma or special characters

The field name is message, table name is log.
Data Examples:
Values for message:
"(wsname,cmdcode,stacode,data,order_id) values (hyd-l904149,2,1,,1584425657892);"
"(wsname,cmdcode,stacode,data,order_id) values (hyd-l93mt54,2,1,,1584427657892);"
(command_execute,order_id,workstation,cmdcode,stacode,application_to_kill,application_parameters) values (kill, 1583124192811, hyd-psag314, 10, 2, tsws.exe, -u production ); "
and in log table i need to get separated column wsname with values as hyd-l904149 and hyd-l93mt54 and hyd-psag314, column cmdcode with values as 2,2 and 10 and column stacode with values as 1,1 and 2, e.g.:
wsname cmdcode stacode
hyd-l904149 2 1
hyd-l93mt54 2 1
hyd-psag314 10 2
Use regexp_matches to extract left and right part of values clause, then regexp_split_to_array to split these parts by commas, then filter rows containing wsname using = any(your_array) construct, then select required columns from array.
Or - alternative solution - fix data to be syntactically valid part of insert statement, create auxiliary tables, insert data into them and then just select.
As in comment section I mentioned about inbuilt function in posgressql
split_part(string,delimiter, field_number)
http://www.sqlfiddle.com/#!15/eb1df/1
As the json capabilities of the un-supported version 9.3 are very limited, I would install the hstore extension, and then do it like this:
select coalesce(vals -> 'wsname', vals -> 'workstation') as wsname,
vals -> 'cmdcode' as cmdcode,
vals -> 'stacode' as stacode
from (
select hstore(regexp_split_to_array(e[1], '\s*,\s*'), regexp_split_to_array(e[2], '\s*,\s*')) as vals
from log l,
regexp_matches(l.message, '\(([^\)]+)\)\s+values\s+\(([^\)]+)\)') as x(e)
) t
regexp_matches() splits the message into two arrays: one for the list of column names and one for the matching values. These arrays are used to create a key/value pair so that I can access the value for each column by the column name.
If you know that the positions of the columns are always the same, you can remove the use of the hstore type. But that would require quite a huge CASE expression to test where the actual columns appear.
Online example
With a modern, supported version of Postgres, I would use jsonb_object(text[], text[]) passing the two arrays resulting from the regexp_matches() call.

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

Multi valued dimension on comma delimited string

We have a dimension which holds value as a comma delimited string (ex:"t1,t2,t3"), are there possibilities where we can get this dimension treated as a Multi Valued Dimension without storing them as JSON arrays?
Note: If we have to correct them and load as JSON arrays, all the historical data for the past 6 months has to be fixed
Thanks,
Sathish
As per documentation, multi-valued fields are generated during data ingestion. So if you have ingested your data as a single coma-delimited string, then the only way to treat it as a multi-value field, is to re-ingest your data. That is, if you want to filter by that value or to use it in "groupBy" queries. If you want just extract sub-portion of the data, then extraction functions can be of use.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited