Row number function in wrangler in Cloud data fusion - google-cloud-data-fusion

I have data in GCS bucket and i want to create a new column row_number() and find a max of record from the original data. For example,below is my raw data.
ID MEMBER_ID SERVICE
3 234 xyz
4 234 abc
1 123 hyts
4 876 bts
10 876 xyz
and i want the output as below to my bigquery table.
ID MEMBER_ID SERVICE
4 234 abc
1 123 hyts
10 876 xyz
can you please suggest the possible way to do this in cloud data fusion.

You can use the Deduplicate plugin (you can find it under the Analytics section in Pipeline Studio) to get the row with the max id for each member_id.

Related

How to solve the below scenario using transformer loop or anything in datastage

My data is like below in one column coming from a file.
Source_data---(This is column name)
CUSTOMER 15
METER 8
METERStatement 1
READING 1
METER 56
Meterstatement 14
Reading 5
Reading 6
Reading 7
CUSTOMER 38
METER 24
METERStatement 1
READING 51
CUSTOMER 77
METER 38
READING 9
I want the output data to be like below in one column
CUSTOMER 15 METER 8 METERStatement 1 READING 1
CUSTOMER 15 METER 56 Meterstatement 14 Reading 5
CUSTOMER 15 METER 56 Meterstatement 14 Reading 6
CUSTOMER 15 METER 56 Meterstatement 14 Reading 7
CUSTOMER 38 METER 24 Meterstatement 1 Reading 51
CUSTOMER 77 METER 38 'pad 100 spaces' Reading 9
I am trying to solve by reading transformer looping documentation but could not figure out an actual solution. anything helps. thank you all.
Yes this could be solved within a transformer stage.
Concatenation is done with ":".
So use a stage variable to concat the input until a new "Meter" or "Customer" row comes up.
Save the "Customer" in a second stage variable in case it does not change.
Use a condition to only output thew rows where a "Reading" exists.
Reset the concatenated string when a "Reading" has been processed.
I guess you want the padding for missing fields in general - you could do these checks in separate stage variables. You have to store the previous item inorder to kow wat is missing - and maybe even more if two consecutive items could be missing.

Spotfire data difference: same column

I have the following table:
Id Claim_id Date
4 111 10/08/2017
5 333 27/08/2017
2 111 07/08/2017
3 222 08/08/2017
1 444 03/07/2017
7 333 02/09/2017
6 333 28/08/2017
there are more rows (dates) associated to the same Claim_id; column "Id" is based on column "Date" (more recent dates have a greater Id).
I need to create a Calculated Column given by the date difference over claim_id, with the following output:
Id Claim_id Date Days
3 111 10/08/2017 3
1 333 27/08/2017
2 111 07/08/2017
4 222 08/08/2017
7 444 03/07/2017
6 333 02/09/2017 5
5 333 28/08/2017 1
I have tried to use the code given here: Spotfire date difference using over function but it doesn't work (it produces wrong values).
I think that, maybe, it's because my table is not sorted, but I can't order it because I have no access to the source database.
How can I modify that expression?
Thank you!
Valentina
#V.Ang- One way to do this is by adding a column 'decreasing_count'.
What this does is, it counts the number of instances of ID by date. Meaning - ID with highest date would be counted first and then followed by next instance of the same ID with date lower than the previous date and so on. Advantage of this column is, your data need not be sorted for this solution to work.
Now, using this 'decreasing_count' column calculate the difference of dates.
decreasing_count column expression:
Count([Claim_id]) over (Intersect([Claim_id],AllNext([Date])))
Note: This column works in the background. You need not display it in the table
Days calculated column expression:
Days([Date] - Min([Date]) over (Intersect([Claim_id],Next([decreasing_count]))))
Final Output:
Hope this helps!

Combine rows from 2 files and write to DB using Spring Batch

I have File1.csv, with the columns id,name,age.
File2.csv has the columns id,designation. In both the files ID refers to same value and is unique.
Sample data
File1.csv
id name age
101 abc 30
102 def 25
File2.csv
id designation
101 manager
102 Assistant manager
Spring batch should read the files simultaneously, combine the data and write to DB as below
id name age designation
101 abc 30 manager
102 def 25 Assistant manager
How to read 2 files simultaneously in spring batch?
You have to implement a Reader that merges the two files togehter.
Have a look at my answer here
Aggregating processor or aggregating reader
where I have linked to other answers to a similar question

"Inserting" Records into Fields from a Database Feed

So the background to this is I'm trying to create a survival curve based on a database feed from the directions here.
What I have so far is three calculated fields per below. Patient ID is not a calculated field or necessary for the survival analysis, but I believe it could be useful for this question. For reference, there are about 20,000 unique patients.
Patient ID | Time | Censor | Group
Id1 3 0 1
Id2 8 0 2
Id3 1 1 1
Id4 3 1 1
Id5 11 0 1
Id5 7 1 2
What I would like to do is insert two records (one for each group) such:
Patient ID | Time | Censor | Group | Link
0 1
0 2
Id1 3 0 1 link
Id2 8 0 2 link
Id3 1 1 1 link
Id4 3 1 1 link
Id5 11 0 1 link
Id5 7 1 2 link
I unsuccessfully tried to create an excel spreadsheet with these base attributes to union with the columns, however, an excel spreadsheet does not appear to be able to union with a database.
My next idea is to find 2 of the 20,000 patients where I can create a calculated field along these lines (not sure this is feasible in Tableau, please excuse my syntax):
IF [Patient ID] = Id3 THEN [TIME] = 0 AND [CENSOR] IS NULL
END
and then a [Link] calculated formula:
IF [Patient ID] = Id3 THEN NULL
ELSE "link"
END
Any help would be appreciated. Would like to avoid inserting these records in the database.
The best / easiest option is to use an outer join to your excel workbook -- this is a new feature in Tableau version 10 (Cross database joins)
Then, once the dataset is combined, you can build business logic through a filter or calculated field based on the absence or presence of the Excel data.
http://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724

Pentaho spoon transformation from excel file

I have yearly data in my excel file in such format:
Country \ Years 1980 1981 ... 2010
Abkhazia 234 334 ... 456
Afghanistan 466 789 ... 732
...
Here is picture
And I want my data transform to 3 different tables and load it to postgres database.
Tables should look something like that
First table - country:
id | name
1 | Abkhazia
2 | Afghanistan
Second table dates:
id | date
1 | 1980
2 | 1981
And third is a table where all data is stored depending on country and date:
country_id date_id data
1 1 234
1 2 334
2 1 466
2 2 789
... ... ...
Any ideas how I could achieve my goal?
Assuming the source excel structure is as below (i have custom built this):
There are basically 3 parts to your question. I break down the transformation into part for better understanding:
1. Loading Table - Country
This is pretty straight forward based on the data given in the excel. Simply take an
Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.
2. Loading Table - Year:
The idea here is to display the Year ID in Row wise format instead of the columns given the excel source data. PDI version 5 and above provides you with a very useful step called Metadata Structure. This step allows you to get the structure of your table. In this case, we need to have the year columns pulled, ignoring the country column.
Follow the steps as below:
Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.
3. Loading the Final Table - Country and Year along with Data:
The way to display all the column data values to a row level in PDI is using Row Normalizer step. Use this step to display a normalized output. Now follow the below steps:
Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output
Hope it helps :)
I have placed the codes in github repo along with the data file which i have used. Its here.
Also, just realized that i have given the wrong naming conventions as per your question. Consider date_id as YearID and instead of id's i have given countryid and yearid.