how to use regexExtract() extract spcified values in dataflow - azure-data-factory

The source data is like the photo. I am new to data flow and expression language. I wonder how to use regexExtract()(or any other expression function) to extract only the genres' names.
The expected output should be:
Animation
Comedy
Family
Adventure
Fantasy
...
Thanks!

You can use this expression split(split(genres,"'name':'")[2],"'")[1] to achieve this.
I create a csv file which contains your sample data.
Use the above expression in DerivedColumn transformation and get your expected value.

Related

Data Factory activity to convert in proper json

I am running my ADF pipeline with Dataflow and I am getting the output as json as something like this
{"key1":"value1","key2":"[vaq:233,popo:basic5542]"}
However, my actual requirement is to have something like this.
{"key1":"value1","key2":["vaq:233","popo:basic5542"]}
Check the placement of double inverted commas for key "key2".In my Data factory pipeline I am using Derived column action in Dataflow and for key2 I am doing concat ("[",Data1,",popo:basic5542]" ) and Data1 has value vaq:233.
How can I adjust the double inverted comma here?
You could you use the below expression and check whether this meets your requirement.
array(Data1,"popo:basic5542")
Instead of the concat function.
Output :
["pea:P1013","popo:basic5542"]
Considering popo:basic5542 is a static value, you can try as below expression.
concat("[","\"",Data1,"\"",",","\"","popo:basic5542","\"","]")
Or if you are getting popo and basic5542 dynamically, you can try as below.
concat("[","\"",Data1,"\"",",","\"",popo,":",basic5542,"\"","]")
Example:

How to extract the value from a json object in Azure Data Factory

I have my ADF pipeline, Where my final output from set variable activity is something like this {name:test, value:1234},
The input coming to this variable is
{
"variableName": "test",
"value": "test:1234"
}
The expression provided in Set variable Item column is #item().ColumnName. And the ColumnName in my JSon file is something like this "ColumnName":"test:1234"
How can I change it so that I get only 1234. I am only interested in the value coming here.
It looks like you need to split the value by colon which you can do using Azure Data Factory (ADF) expressions and functions: the split function, which splits a string into an array and the last function to get the last item from the array. This works quite neatly in this case:
#last(split(variables('varWorking'), ':'))
Sample results:
Change the variable name to suit your case. You can also use string methods like lastIndexOf to locate the colon, and grab the rest of the string from there. A sample expression would be something like this:
#substring(variables('varWorking'),add(indexof(variables('varWorking'), ':'),1),4)
It's a bit more complicated but may work for you, depending on the requirement.
It seems like you are using it inside of an iterator since you got item but however, I tried with a simple json lookup value
#last(split(activity('Lookup').output.value[0].ColumnName,':'))

Using the toInteger function with locale and format parameters

I've got a dataflow with a csv file as source. The column NewPositive is a string and it contains numbers formatted in European style with a dot as thousand seperator e.g 1.019 meaning 1019
If I use the function toInteger to convert my NewPositive column to an int via toInteger(NewPositive,'#.###','de'), I only get the thousand cipher e.g 1 for 1.019 and not the rest. Why? For testing I tried creating a constant column: toInteger('1.019','#.###','de') and it gives 1019 as expected. So why does the function not work for my column? The column is trimmed and if I compare the first value with equality function: equals('1.019',NewPositive) returns true.
Please note: I know it's very easy to create a workaround by toInteger(replace(NewPositive,'.','')), but I want to learn how to use the toInteger function with the locale and format parameters.
Here is sample data:
Dato;NewPositive
2021-08-20;1.234
2021-08-21;1.789
I was able to repro this and probably looks to be a bug to me . I have reported this to the ADF team , will let you know once I hear back from them . You already have a work around please go ahead that to unblock yourself .

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]