MD5/SHA Field Dataset in Data Fusion - google-cloud-data-fusion

I need to concatanate a few string values in order to obtain the SHA256 encrypted string. I've seen Data Fusion has a plugin to do the job:
The documentation however is very poor and nothing I've tried seems to work. I created a table in BQ with the string fields I need to concatanate but the output is same as input. Can anyone provide with an example on how to use this plugin?
EDIT
Below I present the example,
This is how the workflow looks like:
For the testing purposes, I added one column with the following string:
2022-01-01T00:00:00+01:00
And here's the output:

You can use Wrangler to concatenate the string values.
I tried your scenario adding Wrangler to the Pipeline:
Joining 2 Columns:
I named the column new_col, using , as delimiter:
Output:

What you described can be achieved by 2 Wranglers:
The first Wrangler will be what #angela-b described. Use the merge directive to create a new column with the concatenation of two columns. Example directive that joins column a and b using , as the delimiter and stores the result in column a_b:
merge a b a_b ,
The second Wrangler will use the hash directive which will hash the column in place using a specified algorithm. Example of a directive that hashes column a_b using MD5:
hash :a_b 'MD5' true
Remember to set the last parameter encode to true so that you get a string output instead of a byte array.

Related

Extract table name from more than one schema length in Data Factory Expression

I need to extract table name from schema.table_name, I have more than one schema where length is unknown. e.g. Finance.Reporting or Setup.System. I want to extract Reporting and System from these string using Expression in Data Factory.
You can use the split() function to split the string based on delimiter which returns the array and get the second value from the string.
Note: Array index starts from 0.
#split('Finance.Reporting','.')[1]

Data flow single row transformation

I have a transformation can this achieved by dataflow.
Thanks
ANuj Gupta
If the column data CN=SERVICE NOW,OU=TOOL;CN=PYTHON,OU=LANGUAGE;CN=ADF,OU=CLOUD schema is fixed, then you can use data flow derived column expression to achieve it.
I just made an example to get output, here's the dataset:
Data Flow derived column expressions:
Col1 column value: col1 --> {Col1 }
a column value: SERVICE NOW-->substring(split({ Col2},';')[1], 5, length(split({ Col2},';')[1])-12)
b column value: PYTHON --> substring(split({ Col2},';')[2], 4, length(split({ Col2},';')[1])-17)
c column value: ADF --> substring(split({ Col2},';')[3], 4, length(split({ Col2},';')[1])-20)
Screenshots:
But if the data is dynamic, we can't do the conversion in Data Factory, it's unachievable.
Logic : You can get the index of first occurence of 'CN=' and first occurence of comma to get first word between them, ie Service Now.
Similarly for others.
If I get time I will try to edit this with the actual syntax!
The strings are delimited by ;. First split by that(so you get an array) in a derived transform. You can then use the 'map' function to extract out the string before the last =. You will have an array of values(ADF/Python etc). Then you use a flatten transform to convert columns to rows.

How to Validate Data issue for fixed length file in Azure Data Factory

I am reading a fixed-width file in mapping Data Flow and loading it to the table. I want to validate the fields, datatype, lengths of the field that I am extracting in the Derived column using substring.
How to Achieve this in ADF
Use a Conditional Split and add a condition for each property of the field that you wish to test for. For data type checking, we literally just landed new isInteger(), isString() ... functions today. The docs are still in the printing press, but you'll find them in the expression builder. For length use length().

How to pass an array for column pattern matching in mapping dataflow derived column from CSV file through pipeline?

I have a mapping data flow with a derived column, where I want to use a column pattern for matching against an array of columns using in()
The data flow is executed in a pipeline, where I set the parameter $owColList_md5 based on a variable that I've populated from a single-line CSV file containing a comma-separated string
If I have a single column name in the CSV file/variable encapsuled in single quotes and have the "Expression" checkbox ticked, it works.
The problem is to get it to work with multiple columns. There seems to be parsing problems having multiple items in the variable each encapsuled in single-quotes, or potentially with the comma separating them. This often causes errors executing the data flow with messages like "store is not defined" etc
I've tried having ''col1'',''col2'' and "col1","col2" (2x single quotes and double quotes) in the CSV file. I've also tried having the file without quotes, trying to replace the comma with escaped quotes (using ) in the derived column pattern expression with no luck.
How do you populate this array in the derived column based on the data flow parameter which is based on the comma-separated string in the CSV file / variable from the pipeline with column names in a working way?
While array types are not supported as data flow parameters, passing in a comma-separated string can work if you use the instr() function to match.
Say you have two columns, col1 and col2. Pass in a parameter with value '"col1","col2"'.
Then use instr($<yourparamname>, '"' + name + '"') > 0 to see if the column name exists within the string you pass in. Note: You do not need double quotes, but the can be useful if you have column names that are subsets of other columns names such as id1 and id11.
Hope this helps!

How to use a dynamic comma separated String value as input for a List()?

I'm building a Spark Scala application that dynamically lists all tables in a SQL Server database and then loads them to Apache Kudu.
I'm building a dynamic string variable that tracks the primary key columns for each table. The primary keys are comma separated within the variable. The following is an example of my variable value:
PrimaryKeys=storeId,storeNum,custId
The following is a required function that I must enter a List[String] as input (which primary keys is definitely not correct):
setRangePartitionColumns(List("storeId","storeNum","custId").asJava
If I just use the PrimaryKeys variable for the List input (like the following), it only works for a single column (and would fail in this example with 3 comma separated values):
setRangePartitionColumns(List(PrimaryKeys).asJava
The following is another example, but using a Seq(). I"m supposed to put the same Primary Key column names in the same format below. Manually typing the column names works fine, however I can not figure out how to dynamically input the variable values:
kuduContext.createTable(tableName, df.schema, Seq(PrimaryKey), kuduTableOptions)
Any idea how I can parse the variable PrimaryKey dynamically and feed it into either function regardless of the number of comma separated values included?
Any assistance is greatly appreciated.