Mapping Data Flows Dynamic Column Updates - azure-data-factory

I have a text input source. This has over 100 columns so I won't show all of them here - a cut-down view of the data would be:
CustomerNo
DOB
DOD
Status
01418495
01/02/1940
NULL
1
01418496
01/01/1930
NULL
1
The users want to be able to update/override any of these columns during processing by providing another input text file containing the PK (CustomerNo) and the key/value pairs of the columns to be updated e.g.
CustomerNo
Variable
New Value
01418495
DOB
01/12/1941
01418496
DOD
01/01/2021
01418496
Status
0
Can this data be used to create dynamic columns somehow that update the customer records regardless of the columns they want to update - in the example above this would result in:
CustomerNo
DOB
DOD
Status
01418495
01/02/1941
NULL
1
01418496
01/01/1930
01/01/2021
0
I have looked at the documentation but don't see any examples of how something like this could be achieved? Thanks in advance for any advice.

You would use a technique similar to what I describe in this video: https://www.youtube.com/watch?v=q7W6J-DUuJY. What I've done is created a file with rules that have expressions and then apply those rules dynamically inside of my data flow.
The key to make this work is using the expr() function to dynamically evaluate the expression from the external file.

Related

Combining columns in Qlik

I have an Excel sheet that has two seperate columns of data that I need combined for my table in Qlik. I know it would be easy to combine the two in the Excel document, but because this is a data feed I would prefer not to do any extra work. One column has the first name and the other the last. Thank you.
I tried to concat but that did not work.
It sounds like what you're trying to achieve is string concatenation, where you'd combine two or more strings together. It'd be the same concept for fields, as long as their values can be coerced to a string type. In Qlik, this is done very simply by using the ampersand & operator. You can use this in three places:
Data Manager
If done in the Data Manager, you are creating a new calculated field. This can be done by editing the table that you're loading in from Excel, selecting Add field >> Calculated field, and then using an expression like this:
first_name & ' ' & last_name
What that expression is doing is taking the field first_name and concatenating it's values with a space ' ' and then concatenating the values of the last_name field.
So your new field, which I'm naming full_name here, would look like this:
first_name
last_name
full_name
Chris
Schaeuble
Chris Schaeuble
Stefan
Stoichev
Stefan Stoichev
Austin
Spivey
Austin Spivey
Here's what the data manager would look like:
Then after you load the data, you will have a new field with the combined names:
Data Load Editor
Doing this in the Data Load Editor will also result in a new field and is the exact same expression (see line 6):
Chart expression
The other option you have is to use this expression "on-the-fly" in a chart without creating a new column in the app data model like the first two options. Just use that same expression from above in a chart field expression and you'll get the same result:

How in PostgreSQL to update an attribute inside a column with a JSONB and also keeping the data already inside

I have a situation where in a table called tokens I have a column called data
The data columns consist of something like this as a '{}'::jsonb
{"recipientId": "xxxxxxxx"}
My goal is to have as follow to update old data to new DB design and requirements
{"recipientIds": ["xxxxxxxx"]}
The reason is that the naming was changed and the value will be an array of recipients.
I don't know how to achieve this
change recipientIdto recipientIds
change the value format to an array but to not loose the data
Also this need to be done only where I have a type in ('INVITE_ECONSENT_SIGNATURE', 'INVITE_ECONSENT_RECIPIENT')
The table looks as follow is a simple table which contains few columns.
The data is the only one as '{}'::jsonb.
id
type
data
1
type1
data1
2
type2
data1
As an edit what I tried to do and partially solved my problem but cannot understand how to se the value to be [value]
update
"token"
set
"data" = data - 'recipientId' || jsonb_build_object('recipientIds', data->'recipientId')
where
"type" in ('INVITE_ECONSENT_RECIPIENT')
I can have now a recipientids: value but need to have recipientids: [value]
You were close, you need to pass an array as the second parameter of the jsonb_build_object() function:
(data - 'recipientId')||jsonb_build_object(
'recipientIds',
jsonb_build_array(data -> 'recipientId')
)

Azure Data Factory: Flattening/normalizing a cloumn from CSV file using Azure Data Factory activity

I have pulled a csv file from one of our source using ADF and there is one column called "attributes" which contains multiple fields (in the form of key value pairs). Now I want to expand that column into different fields (columns). Below is the sample of that:
leadId activityDate activityTypeId campaignId primaryAttributeValue attributes
1234 2020-06-22T00:00:44Z 46 33686 Mail {"Description":"Clicked: https://stepuptostepout.com/","Source":"Lead action","Date":"2020-06-21 19:00:44"}
5678 2020-06-22T00:01:54Z 13 33128 SMS {"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}
Here the attributes column have different Key-value pairs and I want them in different column so that I can store them in Azure SQL Database:
attributes
{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo"}
I want them as:
Reason New Value Old Value Source
Changed 110 null Marketo
I am using Azure Data Factory. Please help!
Updating this:
One more thing I have noticed in my data is that the keys are not uniform, also if there is one key (say 'Source') present for one lead ID it might not be present/missing in the other leadId, making this more complicated. Hence having a separate column for each Attribute Key might not be a good idea.
Thus, we can have a separate table for 'attribute' field with lead ID, AttributeKey, AttributeValue as columns (we can join this with our main table using LeadID). The Attribute table will look like:
LeadID AttributeKey AttributeValue
5678 Reason Changed
5678 New Value 110
5678 Old Value null
5678 Source Marketo
Can you help me I can I achieve this using ADF?
You can use data flow to do this thing.Below is my test sample.
Setting of source1
Setting of Filter1
instr(attributes,'Reason') != 0
Setting of DerivedColumn1
Here is my expression and it's complex.
#(Reason=translate(split(split(attributes,',')[1],':')[2],'"',''),
NewValue=translate(split(split(attributes,',')[2],':')[2],'"',''),
OldValue=translate(split(split(attributes,',')[3],':')[2],'"',''),
Source=translate(translate(split(split(attributes,',')[4],':')[2],'"',''),'}',''))
Setting of Select1
Here is the result:
By the way,if your file is json,may be simple to do this than csv.
Hope this can help you:).

Add contents of one column into another without overwriting

I can copy the contents of one column to another using the sql UPDATE easily. But I need to do it without deleting the content already there, so in essence I want to append a column to another without overwriting the other's original content.
I have a column called notes then for some unknown reason after several months I added another column called product_notes and after 2 days realised that I have two sets of notes I urgently need to merge.
Usually when making a note we just add to any note already there with a form. I need to put these two columns like that, keeping any note in the first column eg
Column notes = Out of stock Pete 040618--- ordered 200 units Jade
050618 --- 200 units received Lila 080618
and
Column product_notes = 5 units left Dave 120618 --- unit 10724 unacceptable quality noted in list Dave 130618
I need to put them together with our spacer of --- without losing the first column's content so the result needs to be like this for my test case:
Column notes = Out of stock Pete 040618--- ordered 200 units Jade
050618 --- 200 units received Lila 080618 --- 5 units left Dave 120618 --- unit 10724 unacceptable quality noted in list Dave 130618
It's simple -
update table1 set notes = notes || '---' || product_notes;
The solution provided by #MaheshHViraktamath is fine, but the problem with simple string concatenation is that if any of the items being concatenated are NULL, the whole result becomes NULL.
Another potential issue is if either field is empty. In that case you might get a result of field a--- or ---field b.
To guard against the first scenario (without putting checks in the WHERE clause) you can use CONCAT_WS like so: CONCAT_WS('---', notes, product_notes). This will combine the two (or however many you put in there) fields with the first parameter, i.e. '---'. If either of those two fields are NULL, the separator won't be used, so you won't get a result with the separator prepended or appended.
There are two issues with the above: if both fields are NULL, the result isn't NULL but an empty string. To handle this case just put it in a NULLIF: NULLIF(CONCAT_WS('---', notes, product_notes), '') so that NULL is returned if both fields are NULL.
The other issue is if either field is empty, the separator will still be used. To guard against this scenario (and only you will know whether it's a scenario worth guarding against, or if this is even desired, based on your data), put each field in a NULLIF as well: NULLIF(CONCAT_WS('---', NULLIF(notes, ''), NULLIF(product_notes, '')), '')
As a result you get: UPDATE your_table SET notes = NULLIF(CONCAT_WS('---', NULLIF(notes, ''), NULLIF(product_notes, '')), '');

HBase - rowkey basics

NOTE : Its a few hours ago that I have begun HBase and I come from an RDBMS background :P
I have a RDBMS-like table CUSTOMERS having the following columns:
CUSTOMER_ID STRING
CUSTOMER_NAME STRING
CUSTOMER_EMAIL STRING
CUSTOMER_ADDRESS STRING
CUSTOMER_MOBILE STRING
I have thought of the following HBase equivalent :
table : CUSTOMERS rowkey : CUSTOMER_ID
column family : CUSTOMER_INFO
columns : NAME EMAIL ADDRESS MOBILE
From whatever I have read, a primary key in an RDBMS table is roughly similar to a HBase table's rowkey. Accordingly, I want to keep CUSTOMER_ID as the rowkey.
My questions are dumb and straightforward :
Irrespective of whether I use a shell command or the HBaseAdmin java
class, how do I define the rowkey? I didn't find anything to do it
either in the shell or in the HBaseAdmin class(some thing like
HBaseAdmin.createSuperKey(...))
Given a HBase table, how to determine the rowkey details i.e which are the values used as rowkey?
I understand that rowkey design is a critical thing. Suppose a customer id is receives values like CUST_12345, CUST_34434 and so on, how will HBase use the rowkey to decide in which region do particular rows reside(assuming that region concept is similar to DB horizontal partitioning)?
***Edited to add sample code snippet
I'm simply trying to create one row for the customer table using 'put' in the shell. I did this :
hbase(main):011:0> put 'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:NAME','Omkar Joshi'
0 row(s) in 0.1030 seconds
hbase(main):012:0> scan 'CUSTOMERS'
ROW COLUMN+CELL
CUSTID12345 column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0500 seconds
hbase(main):013:0> put 'CUSTOMERS', 'CUSTID614', 'CUSTOMER_INFO:NAME','Prachi Shah', 'CUSTOMER_INFO:EMAIL','Prachi.Shah#lntinfotech.com'
ERROR: wrong number of arguments (6 for 5)
Here is some help for this command:
Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates. To put a cell value into table 't1' at
row 'r1' under column 'c1' marked with the time 'ts1', do:
hbase> put 't1', 'r1', 'c1', 'value', ts1
hbase(main):014:0> put 'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:EMAIL','Omkar.Joshi#lntinfotech.com'
0 row(s) in 0.0160 seconds
hbase(main):015:0>
hbase(main):016:0* scan 'CUSTOMERS'
ROW COLUMN+CELL
CUSTID12345 column=CUSTOMER_INFO:EMAIL, timestamp=1365600369284, value=Omkar.Joshi#lntinfotech.com
CUSTID12345 column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0230 seconds
As put takes max. 5 arguments, I was not able to figure out how to insert the entire row in one put command. This is resulting in incremental versions of the same row which isn't required and I'm not sure if CUSTOMER_ID is being used as a rowkey !
Thanks and regards !
You don't, the key (and any other column for that matter) is a bytearray you can put whatever you want there- even encapsulate sub-entities
Not sure I understand that - each value is stored as key+column family + column qualifier + datetime + value - so the key is there.
HBase figures out which region a record will go to as it goes. When regions gets too big it repartitions. Also from time to time when there's too much junk HBase performs compactions to rearrage the files. You can control that when you pre-partition yourself, which is somehting you should definitely think about in the future. However, since it seems you are just starting out with HBase you can start with HBase taking care of that. Once you understand your usage patterns and data better you will probably want to go over that again.
You can read/hear a little about HBase schema design here and here