How to capture the Null values(custname) and respective CustID in separate file and the rest of the CustID's in other file - datastage

CustID
CustNAme
10
Ally
20
null
30
null
40
Liza
50
null
60
Mark

You need to generate an artificial key (e.g. line number) on each file. Then have the source of CustID as the stream input to a Lookup stage, and the source of CustName as the reference input of the Lookup stage, where the lookup key is LineNumber. Set the Lookup Failed rule to suit your own needs.
One way to generate line numbers is a Column Generator stage operating in sequential mode.

You can use a Transformer Stage with two output links. Use the output link constraints to check on null values to split the stream.
As constraint, just write IsNull(DSLink2.CustNAme) or IsNotNull(DSLink2.CustNAme) respectively.
Note: You can also write !IsNull(col) or Not(IsNull(col)) for IsNotNull(col)

Related

Only saving files without null value on Nifi

an absolute newbie here trying out Nifi and postgresql on docker compose.
I have a sample CSV file with 4 columns.
I want to split this CSV file into two
based on whether if it contains a row with null value or not.
Grade ,BreedNm ,Gender ,Price
C++ ,beef_cattle ,Female ,10094
C++ ,milk_cow ,Female ,null
null ,beef_cattle ,Male ,12704
B++ ,milk_cow ,Female ,16942
for example, above table should be split into two tables each containing row 1,4 and 2,3
and save each of them into a Postgresql table.
Below is what I have tried so far.
I was trying to
split flowfile into 2 and only save rows without null value on left side and with null values on right side.
Write each of them into a table each named 'valid' and 'invalid'
but I do not know how to split the csv file and save them as a psql table through Nifi.
Can anyone help?
What you could do is use a RouteOnContent with the "Content Must Contain Match" factor, with the match being null. Therefore, anything that matches null would be routed that way, and anything not matching null would be routed a different way. Not sure if it's possible the way you're doing it, but that is 1 possibility. The match could be something like (.*?)null
I used QueryRecord processor with two SQL statements each sorting out the rows with null value and the other without the null value and it worked as intended!

Use Azure Data Factory to Conditionally Split data to different tables

I want to use Azure Data Factory to split data similar to the data below based on the Name column to different tables. Ideally this could be done dynamically so that if a new Name value is added, then it will automatically split out that data to a separate table. I know how to manually specify a Conditional Split, I'm just wondering if theres any way to write an expression or etc. that would dynamically split these into separate tables, i.e. tbl_apple would have the first three rows, tbl_banana the next two, etc. ?
Thanks!
Name
Number
label
apple
1
a
apple
2
a
apple
3
a
banana
001
b
banana
002
b
carrot
0
dfb
carrot
1
dfb
carrot
2
dfb
carrot
3
dfb
plum
010
p
avocado
021
v
avocado
022
v
You can use Data flow script for conditional script but dynamic split condition isn't possible.
You can refer below syntax to write a conditional split script:
<incomingStream>
split(
<conditionalExpression1>
<conditionalExpression2>
...
disjoint: {true | false}
) ~> <splitTx>#(stream1, stream2, ..., <defaultStream>)
If you want it dynamically, you need to manage it programmatically. You can choose any data manipulation language like SQL or Python and read all the unique values in from the table and based on that you can split.
Use custom activity to run such scripts.
To move data to/from a data store that the service does not support,
or to transform/process data in a way that isn't supported by the
service, you can create a Custom activity with your own data movement
or transformation logic and use the activity in a pipeline.
For example, using Python Pandas module you can find the unique values in the dataframe. Refer below syntax:
<DataFrame_Name>.<Column_Name>.unique()
You will get the all unique values in the given column in a list.
Now you can loop over the list and store the records for each unique value in a separate table.

Mapping Data Flows Dynamic Column Updates

I have a text input source. This has over 100 columns so I won't show all of them here - a cut-down view of the data would be:
CustomerNo
DOB
DOD
Status
01418495
01/02/1940
NULL
1
01418496
01/01/1930
NULL
1
The users want to be able to update/override any of these columns during processing by providing another input text file containing the PK (CustomerNo) and the key/value pairs of the columns to be updated e.g.
CustomerNo
Variable
New Value
01418495
DOB
01/12/1941
01418496
DOD
01/01/2021
01418496
Status
0
Can this data be used to create dynamic columns somehow that update the customer records regardless of the columns they want to update - in the example above this would result in:
CustomerNo
DOB
DOD
Status
01418495
01/02/1941
NULL
1
01418496
01/01/1930
01/01/2021
0
I have looked at the documentation but don't see any examples of how something like this could be achieved? Thanks in advance for any advice.
You would use a technique similar to what I describe in this video: https://www.youtube.com/watch?v=q7W6J-DUuJY. What I've done is created a file with rules that have expressions and then apply those rules dynamically inside of my data flow.
The key to make this work is using the expr() function to dynamically evaluate the expression from the external file.

Azure Data Factory: Flattening/normalizing a cloumn from CSV file using Azure Data Factory activity

I have pulled a csv file from one of our source using ADF and there is one column called "attributes" which contains multiple fields (in the form of key value pairs). Now I want to expand that column into different fields (columns). Below is the sample of that:
leadId activityDate activityTypeId campaignId primaryAttributeValue attributes
1234 2020-06-22T00:00:44Z 46 33686 Mail {"Description":"Clicked: https://stepuptostepout.com/","Source":"Lead action","Date":"2020-06-21 19:00:44"}
5678 2020-06-22T00:01:54Z 13 33128 SMS {"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}
Here the attributes column have different Key-value pairs and I want them in different column so that I can store them in Azure SQL Database:
attributes
{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo"}
I want them as:
Reason New Value Old Value Source
Changed 110 null Marketo
I am using Azure Data Factory. Please help!
Updating this:
One more thing I have noticed in my data is that the keys are not uniform, also if there is one key (say 'Source') present for one lead ID it might not be present/missing in the other leadId, making this more complicated. Hence having a separate column for each Attribute Key might not be a good idea.
Thus, we can have a separate table for 'attribute' field with lead ID, AttributeKey, AttributeValue as columns (we can join this with our main table using LeadID). The Attribute table will look like:
LeadID AttributeKey AttributeValue
5678 Reason Changed
5678 New Value 110
5678 Old Value null
5678 Source Marketo
Can you help me I can I achieve this using ADF?
You can use data flow to do this thing.Below is my test sample.
Setting of source1
Setting of Filter1
instr(attributes,'Reason') != 0
Setting of DerivedColumn1
Here is my expression and it's complex.
#(Reason=translate(split(split(attributes,',')[1],':')[2],'"',''),
NewValue=translate(split(split(attributes,',')[2],':')[2],'"',''),
OldValue=translate(split(split(attributes,',')[3],':')[2],'"',''),
Source=translate(translate(split(split(attributes,',')[4],':')[2],'"',''),'}',''))
Setting of Select1
Here is the result:
By the way,if your file is json,may be simple to do this than csv.
Hope this can help you:).

How to retrieve a list of Columns from a single row in Cassandra?

The below is a sample of my Cassandra CF.
column1 column2 column3 ......
row1 : name:abay,value:10 name:benny,value:7 name:catherine,value:24 ................
ComparatorType:utf8
How can i fetch columns with name ('abay', 'john', 'peter', 'allen') from this row in a single query using Hector API.
The number of names in the list may vary every time.
I know that i can get them in a sorted order using SliceQuery.
But there are cases when i need to fetch data randomnly, as i mentioned above.
Kindly help me.
Based on your query, it seems you have two options.
If you only need to run this query occasionally, you can get all columns for the row and filter them on the client. If you have at most a few thousand columns, this should be ok for an occasional query.
If you need to run this frequently, you'll want to write the data such that you can query using name as the key. This probably means you'll have to write the data twice into two CFs, where one is by your current key, and the other is by name. This is a common Cassandra tactic.