Data Factory activity to convert in proper json - azure-data-factory

I am running my ADF pipeline with Dataflow and I am getting the output as json as something like this
{"key1":"value1","key2":"[vaq:233,popo:basic5542]"}
However, my actual requirement is to have something like this.
{"key1":"value1","key2":["vaq:233","popo:basic5542"]}
Check the placement of double inverted commas for key "key2".In my Data factory pipeline I am using Derived column action in Dataflow and for key2 I am doing concat ("[",Data1,",popo:basic5542]" ) and Data1 has value vaq:233.
How can I adjust the double inverted comma here?

You could you use the below expression and check whether this meets your requirement.
array(Data1,"popo:basic5542")
Instead of the concat function.
Output :
["pea:P1013","popo:basic5542"]

Considering popo:basic5542 is a static value, you can try as below expression.
concat("[","\"",Data1,"\"",",","\"","popo:basic5542","\"","]")
Or if you are getting popo and basic5542 dynamically, you can try as below.
concat("[","\"",Data1,"\"",",","\"",popo,":",basic5542,"\"","]")
Example:

Related

What is the equivalent to Kusto's CountOf() function in Azure Data Factory?

My requirement is to extract a string from filenames using a ADF variable, I need to extract the string until the final underscore '_' and the number of underscores vary in every filename as seen in the below example.
abc_xyz_20221221.txt --> abc_xyz
abc_xyz_a1_20221221.txt --> abc_xyz_a1
abc_c_ab_a1_20221221.txt --> abc_c_ab_a1
abc_c_ab_a1_a11_20221221.txt --> abc_c_ab_a1_a11
I tried to get it done using indexof() to get the position of the final underscore but it does not accept negative values, so I got the below logic which works in KQL (Azure Data Explorer) but fails in ADF because there is no CountOf() in this tool. Is there any equivalent function in ADF or can you please suggest me how to achieve the same in ADF?
substring("abc_xyz_20221221.txt", 0,
indexof("abc_xyz_20221221.txt", "_", 0,
strlen("abc_xyz_20221221.txt"),
countof("abc_xyz_20221221.txt", '_')))
You can try like this also using split and join inside ForEach activity.
Array for ForEach activity:
["abc_xyz_20221221.txt","abc_xyz_a1_20221221.txt","abc_c_ab_a1_20221221.txt","abc_c_ab_a1_a11_20221221.txt"]
Append variable inside ForEach:
#join(take(split(item(), '_'),add(length(split(item(), '_')),-1)),'_')
Result in an array variable:
As mentioned by #Joel Cochran, use the below expression in the append variable inside ForEach with lastIndexOf().
#substring(item(),0,lastindexof(item(),'_'))
This is a just a simpler form of what #Rakesh called out above . The only difference being , his implementation is iterating . In my case the file name is stored in a variable named foo
#substring(variables('foo'),0,lastindexof(variables('foo'),'_'))
output

Azure Data Factory - Data Wrangling with Data Flow - Array bug

Azure Data Factory - Data Wrangling with Data Flow - Array bug.
I have a tricky firewall log file to wrangle using Azure data factory. The file consists of 4 tab-separated columns. Date and Time, Source, IP and Data.
The Data column consists of key-value pairs separated with equal signs and text delimited by double-quotes. The challenge is that the data column is inconsistent and contains any number of key-value pair combinations.
Three lines from the source file.
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798637882880 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58492 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088638 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58492 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3222 rcvdbyte=1635 sentpkt=14 rcvdpkt=8 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798657887422 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58496 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088640 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58496 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3410 rcvdbyte=1791 sentpkt=19 rcvdpkt=11 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798670487613 tz="+0200" logid="0001000014" type="traffic" subtype="local" level="notice" vd="root" srcip=192.168.41.180 srcname="GKHYPERV01" srcport=138 srcintf="port1" srcintfrole="undefined" dstip=192.168.41.255 dstport=138 dstintf="root" dstintfrole="undefined" srccountry="Reserved" dstcountry="Reserved" sessionid=137088708 proto=17 action="deny" policyid=0 policytype="local-in-policy" service="udp/138" trandisp="noop" app="netbios forward" duration=0 sentbyte=0 rcvdbyte=0 sentpkt=0 rcvdpkt=0 appcat="unscanned" srchwvendor="Intel" osname="Windows" srcswversion="10 / 2016" mastersrcmac="a0:36:9f:9b:de:b6" srcmac="a0:36:9f:9b:de:b6" srcserver=0
My strategy for wrangling this data set is as follows.
Source the data file from azure data Lake Using a tab-delimited CSV
data set. This successfully delivers the source data in four columns
to my data flow.
Add a surrogate key transformation to add an incrementing key value to each row of data.
Add a derived column,
with the following function.
regexSplit(Column_4,'\s(?=(?:[^"](["])[^"]\1)[^"]$)')
This splits the data by spaces ignoring the spaces between semicolons.
Then the unfold creates a new record for each item in the array while preserving the
other column values.
unfold(SplitBySpace)
Then split the key-value pairs into their represented value and key by the Delimiter equals.
The final step would then be to unpivot the data back into columns with the respected values grouped by the surrogate key added in step 2.
This all sounds good but unfortunately step 5 fails with the following error. “Indexing is only allowed on the array and map types”.
The output after step 4.
The unfold function returns an array according to the inspect tab, see below. I would expect a string here!!
Now in step 5, I split by “=” with the expression split(unfoldSplitBySpace, '=') but this errors in the expression builder with the message “Split expect string type of argument”
Changing the expression to split(unfoldSplitBySpace1, '=') remove the error from the expression Builder.
BUT THEN the spark execution engines errors with “Indexing is only allowed on the array and map types”
The problem.
According to the Azure Data Factory UI, the output of the Unfold() function is an array type but when accessing the array elements or any other function the spark engine does not recognise the object as in array type.
Is this a bug in the execution or do I have a problem in my understanding of how the data factory and a spark engine understand arrays?
Split() function splits the string to multiple values based on delimiter and returns array type.
If you are splitting value of array of particular index, mention the index with in braces [].
Example:
Here I have an array value ["employee=Robert", "D"], and using split(), I am splitting the value of index 1 based on =.
split(value[1], '=')
Microsoft-help provided an answer.
First, it looks like a bug.
Second, there is a workaround. Cast the array output to string with toString(unfold(SplitBySpace))
https://learn.microsoft.com/en-us/answers/questions/860243/azure-data-factory-data-wrangling-with-data-flow-a.html#answer-865321

How to extract the value from a json object in Azure Data Factory

I have my ADF pipeline, Where my final output from set variable activity is something like this {name:test, value:1234},
The input coming to this variable is
{
"variableName": "test",
"value": "test:1234"
}
The expression provided in Set variable Item column is #item().ColumnName. And the ColumnName in my JSon file is something like this "ColumnName":"test:1234"
How can I change it so that I get only 1234. I am only interested in the value coming here.
It looks like you need to split the value by colon which you can do using Azure Data Factory (ADF) expressions and functions: the split function, which splits a string into an array and the last function to get the last item from the array. This works quite neatly in this case:
#last(split(variables('varWorking'), ':'))
Sample results:
Change the variable name to suit your case. You can also use string methods like lastIndexOf to locate the colon, and grab the rest of the string from there. A sample expression would be something like this:
#substring(variables('varWorking'),add(indexof(variables('varWorking'), ':'),1),4)
It's a bit more complicated but may work for you, depending on the requirement.
It seems like you are using it inside of an iterator since you got item but however, I tried with a simple json lookup value
#last(split(activity('Lookup').output.value[0].ColumnName,':'))

azure data flow convert string json array to json array

I'm using azure data flow, and I want to pass a json array inside an at() function like this :
the error is :
At function takes an array or a map for the first parameter.
urls value is :
[{"url":"http://url1.com"},{"url":"http://url2.com"},{"url":"http://url3.com"},{"url":"http://url4.com"}]
why it consider urls value as a string ?
You can convert to an array using the Derived column transformation.
In a Derived Column, first replace square brackets with empty/blank and with split function, split the string based on the delimiter comma (,).
Derived Column preview:
Now you can use this array in the at() function in your expression.

how to use regexExtract() extract spcified values in dataflow

The source data is like the photo. I am new to data flow and expression language. I wonder how to use regexExtract()(or any other expression function) to extract only the genres' names.
The expected output should be:
Animation
Comedy
Family
Adventure
Fantasy
...
Thanks!
You can use this expression split(split(genres,"'name':'")[2],"'")[1] to achieve this.
I create a csv file which contains your sample data.
Use the above expression in DerivedColumn transformation and get your expected value.