Talend : Transform JSON lines to columns, extracting column names from JSON - talend

I have a json rest response with a structure somehow like this one :
{
"data" : [
{
"fields" : [
{ "label" : "John", "value" : "John" },
{ "label" : "Smith", "value" : "/person/4315" },
{ "label" : "43", "value" : "43" },
{ "label" : "London", "value" : "/city/54" }
]
},
{
"fields" : [
{ "label" : "Albert", "value" : "Albert" },
{ "label" : "Einstein", "value" : "/person/154" },
{ "label" : "141", "value" : "141" },
{ "label" : "Princeton", "value" : "/city/9541" }
]
}
],
"columns" : ["firstname", "lastname", "age", "city"]
}
I'm looking for a way to transform this data to rows like
| first_name_label | firstname_value | lastname_label | lastname_value | age_label | age_value | city_label | city_value |
---------------------------------------------------------------------------------------------------------------------------
| John | John | Smith | /person/4315 | 43 | 43 | London | /city/54 |
| Albert | Albert | Einstein | /person/154 | 141 | 141 | Princeton | /city/9541 |
Of course the number of columns and their names may change so I don't know the schema before runtime.
I probably can write java to handle this but I'd like to know if there's a more standard way.
I'm new to Talend so I spent hours trying, but since my attempts were probably totally wrong I won't describe it here.
Thanks for your help.

Here's a completely dynamic solution I put together.
First, you need to read the json in order to get the column list. Here's what tExtractJSONFields_2 looks like:
Then you store the columns and their positions in a tHashOutput (you need to unhide it in File > Project properties > Designer > Palette settings). In tMap_2, you get the position of the column using a sequence:
Numeric.sequence("s", 1, 1)
The output of this subjob is:
|=-------+--------=|
|position|column |
|=-------+--------=|
|1 |firstname|
|2 |lastname |
|3 |age |
|4 |city |
'--------+---------'
The 2nd step is to read the json again, in order to parse the fields property.
Like in step 1, you need to add a position to each field, relative to the columns. Here's the expression I used to get the sequence:
(Numeric.sequence("s1", 0, 1) % ((Integer)globalMap.get("tHashOutput_1_NB_LINE"))) + 1
Note that I'm using a different sequence name, because sequences keep their value throughout the job. I'm using the number of columns from tHashOutput_1 in order to keep things dynamic.
Here's the output from this subjob:
|=-------+---------+---------------=|
|position|label |value |
|=-------+---------+---------------=|
|1 |John |John |
|2 |Smith |/person/4315 |
|3 |43 |43 |
|4 |London |/city/54 |
|1 |Albert |Albert |
|2 |Einstein |/person/154 |
|3 |141 |141 |
|4 |Princeton|/city/9541 |
'--------+---------+----------------'
In the last subjob, you need to join the fields data with the columns, using the column position we stored with either one.
In tSplitRow_1 I generate 2 rows for each incoming row. Each row is a key value pair. The first row is <columnName>_label (like firstname_label, lastname_label) its value being the label from the fields. The 2nd row's key is <columnName>_value, and its value is the value from the fields.
Once again, we need to add a position to our data in tMap_4, using this expression:
(Numeric.sequence("s2", 0, 1) / ((Integer)globalMap.get("tHashOutput_1_NB_LINE") * 2)) + 1
Note that since we have twice as many rows coming out of tSplitRow, I multiply the number of columns by 2.
This will attribute the same ID for the data that needs to be on the same row in the output file.
The output of this tMap will be like:
|=-+---------------+-----------=|
|id|col_label |col_value |
|=-+---------------+-----------=|
|1 |firstname_label|John |
|1 |firstname_value|John |
|1 |lastname_label |Smith |
|1 |lastname_value |/person/4315|
|1 |age_label |43 |
|1 |age_value |43 |
|1 |city_label |London |
|1 |city_value |/city/54 |
|2 |firstname_label|Albert |
|2 |firstname_value|Albert |
|2 |lastname_label |Einstein |
|2 |lastname_value |/person/154 |
|2 |age_label |141 |
|2 |age_value |141 |
|2 |city_label |Princeton |
|2 |city_value |/city/9541 |
'--+---------------+------------'
This leads us to the last component tPivotToColumnsDelimited which will pivot our rows to columns using the unique ID.
And the final result is a csv file like:
id;firstname_label;firstname_value;lastname_label;lastname_value;age_label;age_value;city_label;city_value
1;John;John;Smith;/person/4315;43;43;London;/city/54
2;Albert;Albert;Einstein;/person/154;141;141;Princeton;/city/9541
Note that you end up with an extraneous column at the beginning which is the row id which can be easily removed by reading the file and removing it.
I tried adding a new column along with the corresponding fields in the input json, and it works as expected.

Related

How to implement UniqueCount in Spark Scala

I am trying to implement uniqueCount in spark scala
Below is the transformation i am trying to implement :
case when ([last_revision]=1) and ([source]=""AR"") then UniqueCount([review_uuid]) OVER ([encounter_id]) end
Input
|last_revision|source|review_uuid |encounter_id|
|-------------|------|--------------|------------|
|1 |AR |123-1234-12345|7654 |
|1 |AR |123-7890-45678|7654 |
|1 |MR |789-1234-12345|7654 |
Expected Output
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |2 |
|1 |AR |123-7890-45678|7654 |2 |
|1 |MR |789-1234-12345|7654 |null |
My code :
.withColumn("reviews_per_encounter", when(col("last_revision") === "1" && col("source") === "AR", size(collect_set(col("review_uuid")).over(Window.partitionBy(col("encounter_id"))))))
My Output :
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |3 |
|1 |AR |123-7890-45678|7654 |3 |
|1 |MR |789-1234-12345|7654 |null |
Schema :
last_revision : integer
source : string
review_uuid : string
encounter_id : string
reviews_per_encounter : integer
In place of 2(expected) i am getting value 3, not sure what mistake i am doing here.
Please help. Thanks
The output makes perfect sense, as I commented, this is because this:
size(collect_set(col("review_uuid")))
Means:
give me the count of unique review_uuids in the whole dataframe (result: 3)
What you're looking for is:
give me the count of unique review_uuids only if the source in the corresponding row is equal to "AR" and "last_revision" is 1 (result: 2)
Notice the difference, now this doesn't need window functions and over actually. You can achieve this both using subqueries or self joining, here's how you can do it using self left join:
df.join(
df.where(col("last_revision") === lit(1) && col("source") === "AR")
.select(count_distinct(col("review_uuid")) as "reviews_per_encounter"),
col("last_revision") === lit(1) && col("source") === "AR",
"left"
)
Output:
+-------------+------+-----------+------------+---------------------+
|last_revision|source|review_uuid|encounter_id|reviews_per_encounter|
+-------------+------+-----------+------------+---------------------+
| 1| AR| 12345| 7654| 2|
| 1| AR| 45678| 7654| 2|
| 1| MR| 78945| 7654| null|
+-------------+------+-----------+------------+---------------------+
(I used some random uuid's, they were too long to copy :) )

Convert every value of a dataframe

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.

except operation on two dataframe having a map column

I have two dataframes dfA and dfB. I want to remove all occurrences of dfB from dfA. The problem however is that they have a column which is of datatype map. except operation doesn't work well with that.
+--------+----------------------------------
|id | fee_amount | optional |
|1 | 10.00 | { 1 -> abc, 2-> def |
|2 | 20.0 | { 3 -> pqr, 5-> stu |
I was thinking I could drop the column somehow and add it back but it won't work because I wouldn't know which rows got removed from dfA. Options?

Clean dirty data

I have three variables (ID, Name and City) and need to generate a new variable flag.
There are something wrong with the observations. I need to find the wrong observations and create the flag. The variable flag indicates which column contains the wrong observation.
Suppose just one bad observation at most in each row.
Given dirty data!!!!!
|ID |Name |City
|1 |IBM |D
|1 |IBM |D
|2 |IBM |D
|3 |Google |F
|3 |Microsoft |F
|3 |Google |F
|8 |Microsoft |A
|8 |Microsoft |B
|8 |Microsoft |A
Result
|ID |Name |City |flag
|1 |IBM |D |0
|1 |IBM |D |0
|2 |IBM |D |1
|3 |Google |F |0
|3 |Microsoft |F |2
|3 |Google |F |0
|8 |Microsoft |A |0
|8 |Microsoft |B |3
|8 |Microsoft |A |0
Here is an answer in Stata that rests on many assumptions that you pointed out in the comments but not the initial quesiton:
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
// get dummy variable for
duplicates tag, gen(right)
gen flag = 0
encode Name, gen(Name_n)
encode City, gen(City_n)
qui sum
forvalues start = 1(3)`r(N)' {
local end = `start'+2
// check if ID is all same
qui sum ID in `start'/`end'
if `r(sd)' != 0 {
replace flag = 1 in `start'/`end' if right == 0
continue
}
// check if name is all same
qui sum Name_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 2 in `start'/`end' if right == 0
continue
}
// chech if city is all same
qui sum City_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 3 in `start'/`end' if right == 0
continue
}
}
drop right Name_n City_n
The intuition is that because they are grouped in 3s, two are always right, there is only one issue per group of 3, and they are sorted by ID which can be wrong but not greater than the next greatest right ID we can first check for duplicates, if there is a duplicate observation then that observation is right.
Next, (in the forvalues loop) we go through each group of three to see which of the variables has the wrong value, when we find it, we replace flag with the appropriate number.
This code is based on Eric's answer.
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
encode Name, gen(Name_n)
encode City, gen(City_n)
// get dummy variable for
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)
// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0
drop Name_n City_n col_*

Display %ROWCOUNT value in a select statement

How is the result of %ROWCOUNT displayed in the SQL statement.
Example
Select top 10 * from myTable.
I would like the results to have a rowCount for each row returned in the result set
Ex
+----------+--------+---------+
|rowNumber |Column1 |Column2 |
+----------+--------+---------+
|1 |A |B |
|2 |C |D |
+----------+--------+---------+
There are no any simple way to do it. You can add Sql Procedure with this functionality and use it in your SQL statements.
For example, class:
Class Sample.Utils Extends %RegisteredObject
{
ClassMethod RowNumber(Args...) As %Integer [ SqlProc, SqlName = "ROW_NUMBER" ]
{
quit $increment(%rownumber)
}
}
and then, you can use it in this way:
SELECT TOP 10 Sample.ROW_NUMBER(id) rowNumber, id,name,dob
FROM sample.person
ORDER BY ID desc
You will get something like below
+-----------+-------+-------------------+-----------+
|rowNumber |ID |Name |DOB |
+-----------+-------+-------------------+-----------+
|1 |200 |Quigley,Neil I. |12/25/1999 |
|2 |199 |Zevon,Imelda U. |04/22/1955 |
|3 |198 |O'Brien,Frances I. |12/03/1944 |
|4 |197 |Avery,Bart K. |08/20/1933 |
|5 |196 |Ingleman,Angelo F. |04/14/1958 |
|6 |195 |Quilty,Frances O. |09/12/2012 |
|7 |194 |Avery,Susan N. |05/09/1935 |
|8 |193 |Hanson,Violet L. |05/01/1973 |
|9 |192 |Zemaitis,Andrew H. |03/07/1924 |
|10 |191 |Presley,Liza N. |12/27/1978 |
+-----------+-------+-------------------+-----------+
If you are willing to rewrite your query then you can use a view counter to do what you are looking for. Here is a link to the docs.
The short version is you move your query into a FROM clause sub query and use the special field %vid.
SELECT v.%vid AS Row_Counter, Name
FROM (SELECT TOP 10 Name FROM Sample.Person ORDER BY Name) v
Row_Counter Name
1 Adam,Thelma P.
2 Adam,Usha J.
3 Adams,Milhouse A.
4 Allen,Xavier O.
5 Avery,James R.
6 Avery,Kyra G.
7 Bach,Ted J.
8 Bachman,Brian R.
9 Basile,Angelo T.
10 Basile,Chad L.