Azure Data Factory DataFlow md5 specific columns - azure-data-factory

First of all, I have my array of columns parameter called $array_merge_keys
$array_merge_keys = ['Column1', 'Column2', 'NoColumnInSomeCases']
So then I am going to hash them, if the third NoColumnInSomeCases is not existed, I would like to treat it as null or some strings else there value.
But actually, when I use them with byNames(), it would return NULL because the last is not existed, even though first and second still have values. So I would expect byNames($array_merge_keys) always return value in order to hash them.
Since that problem cannot be solved, I am back to filter these only existed column
filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item)) => ['Column1', 'Column2']
But it comes to another problem that byNames() cannot compute on the fly, it said 'byNames' does not accept column or argument
array(byNames(filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item))))
Spark job failed: { "text/plain":
"{"runId":"649f28bf-35af-4472-a170-1b6ece50c551","sessionId":"a26089f4-b0f4-4d24-8b79-d2a91a9c52af","status":"Failed","payload":{"statusCode":400,"shortMessage":"DF-EXPR-030
at Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument
parameters","detailedMessage":"Failure 2022-04-13 05:26:31.317
failed DebugManager.processJob,
run=649f28bf-35af-4472-a170-1b6ece50c551, errorMessage=DF-EXPR-030 at
Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument parameters"}}\n" } -
RunId: 649f28bf-35af-4472-a170-1b6ece50c551
I have tried lots of ways, even created a new derived column (before that stream name) to store ['Column1', 'Column2']. But it said that column cannot be referenced within byNames() function
Do we have any elegant solution?

It is true that byName() cannot evaluate with late binding. You need to either use a Select transformation to set the columns in the stream you wish to hash first or send in the column names via a parameter. Since that is "early column binding", byName() will work.
You can use a get metadata activity in the pipeline to inspect which columns are present in your source before calling the data flow, allowing you to send a pipeline parameter with just those columns you wish to hash.
Alternatively, you can create a new branch, use a select matching rule, then hash the row based on those columns (see example below).

Related

How to add null value in Azure Datafactory Derived columns expression builder

I am currently using Azure Datafactory in that I am creating a Derived column and since the field will always will be blank, so I want the value to be NULL
currently Derived Column I am doing this for adding the expression e.g. toString("null") and toString(null()) but this is appearing as string. I only want null to appear without quotes in Json document
I have reproduced the above and got below results.
I tried to give null() to a column and it gave the error like below.
So, in ADF Dataflow, there should be any wrap over null() with the functions like toInteger() or toString()
When I give toString(null()) when id is 4 in derived column of dataflow and the sink is a JSON, it gave me the below output.
You can see the row with id==4 skipped the null valued key in JSON. If you give toString(null()) same key in every row will be skipped.
You can go through this link by #ShaikMaheer-MSFT to understand more about this.
AFAIK, The workaround for this can be to store the null as 'null' string to get that key in JSON like this and later use this as per your requirement.

Pass Column to Function where the column can be empty

I am trying to pass a column to a function where sometimes the column passed can be empty or blank.
For example
def test (df,segment):
score_df = df \
.withColumn('model_segment', when(lit(segment) =='',lit('')).otherwise(col(segment)))
return score_df
This works
test(df,'my_existing_column').show()
However this errors
test(df,'').show()
This errors with the message cannot resolve '``' given input columns
I get why it is doing that, but how would I go about handling this kind of scenario?
You can get a list of dataframe fields by df.columns,then check whether the incoming parameter exists in the field list, if it exists, execute show action, otherwise trigger a custom exception that the field does not exist.

Creating an array of columns from an array of column names in data flow

How can I create an array of columns from an array of column names in dataflow?
The following creates an array of sorted columns with and exception of the last column:
sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
I want to get an array of the columns for this array of column names. I tried this:
toString(byNames(sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))))
But I keep getting the error:
Column name function 'byNames' does not accept column or argument parameters
Please can anyone help me with a workaround for this?
Update--
It seems using ColumnNames() in any way (directly or assigning it to a parameter) seems to be leading to error. As at runtime on Spark it is fed to the byNames() function. Due to unavailability of a way to re-introduce as parameter or assign variable in Data Flow directly, see below which works for me.
Have empty string array type parameter in DataFLow
Use sha2 function as usual in derived column with parameter sha2(256,byNames($cols))
Create pipeline, there use getMetadata to get Structure from which you can get column names.
For each column, inside ForEach activity append to a variable.
Next, connect to DataFLow and pass the variable containing Column names.
The documentation for the byNames function states 'Computed inputs are not supported but you can use parameter substitutions'. This explains why you should use a parameter as input to create the array used in the byNames function:
Example: Where $cols parameter hold the list of columns.
sha2(256,byNames(split($cols,',')))
You can use computed columns names as input by creating the array prior to using in function. Instead of creating the expression in-line in the function call, set the column values in a parameter prior and then use it in your function directly afterwards.
For a parameter $cols of type array:
$cols = sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
toString(byNames(sort(slice($cols, compare(#item1, #item2))))
Refer: byNames

Convert varchar parameter with CSV into column values postgres

I have a postgres query with one input parameter of type varchar.
value of that parameter is used in where clause.
Till now only single value was sent to query but now we need to send multiple values such that they can be used with IN clause.
Earlier
value='abc'.
where data=value.//current usage
now
value='abc,def,ghk'.
where data in (value)//intended usage
I tried many ways i.e. providing value as
value='abc','def','ghk'
Or
value="abc","def","ghk" etc.
But none is working and query is not returning any result though there are some matching data available. If I provide the values directly in IN clause, I am seeing the data.
I think I should somehow split the parameter which is comma separated string into multiple values, but I am not sure how I can do that.
Please note its Postgres DB.
You can try to split input string into an array. Something like that:
where data = ANY(string_to_array('abc,def,ghk',','))

web2py db query select showing field name

When i have the follow query:
str(db(db.items.id==int(row)).select(db.items.imageName)) + "\n"
The output includes the field name:
items.imageName
homegear\homegear.jpg
How do i remove it so that field name will not be included and just the selected imagename.
i tried referencing it like a list [1] gives me an out of range error and [0] i end up with:
<Row {'imageName': 'homegear\\homegear.jpg'}>
The above is not a list, what object is that and how can i reference on it?
Thanks!
John
db(db.items.id==int(row)).select(db.items.imageName) returns a Rows object, and its __str__ method converts it to CSV output, which is what you are seeing.
A Rows object contains Row objects, and a Row object contains field values. To access an individual field value, you must first index the Rows object to extract the Row, and then get the individual field value as an attribute of the Row. So, in this case, it would be:
db(db.items.id==int(row)).select(db.items.imageName)[0].imageName
or:
db(db.items.id==int(row)).select(db.items.imageName).first().imageName
The advantage of rows.first() over rows[0] is that the former returns None in case there are no rows, whereas the latter will generate an exception (this doesn't help in the above case, because the subsequent attempt to access the .imageName attribute would raise an exception in either case if there were no rows).
Note, even when the select returns just a single row with a single field, you still have to explicitly extract the row and the field value as above.