how to fill down values using Azure Data Factory - azure-data-factory

sorry for the basic question, I am coming from PowerQuery background, and started using ADF for a new Project. first I started wrangling data flows and fill down values is not supported, Now I am trying with mapping data flow and I can't find in the documentation how to fill down a value ?
see example I have the ID column and looking to add FILL_ID

This data flow script snippet will do the trick:
source1 derive(dummy = 1) ~> DerivedColumn1
DerivedColumn1 window(over(dummy),
asc(movie, true),
startRowOffset: -1L,
endRowOffset: 0L,
Rating2 = first(coalesce(Rating))) ~> Window1
Window1 derive(Rating = iif(isNull(Rating),Rating2,Rating)) ~> DerivedColumn2
Create a new data flow
Add a Source transformation that points to your text file
Click on the script behind button on top right of browser UI
Hit Enter to create newline at the bottom the script
Paste the above snippet and click OK
You should now see a Derived Column, Window, and another Derived. Go into the Window and 2nd Derived Column to change my column names to yours for sort and the coalesce function. THen in the 2nd Derived Column, pick the names of your columns.
The first derived creates a dummy var that you'll need because your use case is to pick the previous non-null value across the entire dataset.
The Window sorts the data because your use case requires it and the window column creates a new column that uses coalesce() to find first non-null.
The 2nd Derived Column swaps in the previous value is the current is NULL.

You can use DerivedColumn.
1.add a column or select a column exist in your source.
2.enter an expression,if value of your column is null(you can check this by using Data preview),you can use iifNull function.About expression in dataflow,you can refer this.

Related

How to format the negative values in dataflow?

I have below column in my table
I need an output as below
I am using Dataflow in the Azure data factory and unable to get the above output. I used derived column but no success. I used replace function, but it's not coming correct. Can anyone advise how to format this in dataflow?
Source is taken in data flow with data as in below image.
Derived column transformation is added next to source.
New column is added and the expression is given as
iif(left(id,1)=='-', replace(replace(id,"USD",""),"-","-$"), concat("$", replace(id,"USD","")))
Output of Derived Column activity

Azure Data Factory Mapping Dataflow add Rownumber

I thought this would be fairly straight forward but i can't really find a simple way of doing it. I want to add a unique rownumber to a source dataset in a ADF Mapping Dataflow. In SSIS i would have done this with a script component but there's no option for that as far as i can see in ADF. I've looked for suitable functions in the derived columns expressions editor and also the aggregate component but there doesn't appear to be one.
Any ideas how this could be achieved?
Thanks
Many options:
Add a surrogate key transform
Hash row columns in Derived Column using SHA2
Use the rowNumber() function in a Window transformation
Give those a shot and let us know what you think
I did like this:
Add a Column with the same value to all the rows (I've used an integer with value = 1);
Added a window, using the column create previously on step 1 (Over);
Add a column on step 4 to window (window columns) with any name and rowNumber() as expression;

Column defined in source Dataset could not be found in the actual source

I have an ADF Copy Data flow and I'm getting the following error at runtime:
My source is defined as follows:
In my data set, the column is defined as shown below:
As you can see from the second image, the column IsLiftStation is defined in the source. Any idea why ADF cannot find the column?
I've had the same error. You can solve this by either selecting all columns (*) in the source and then mapping those you want to the sink schema, or by 'clearing' the mapping in which case the ADF Copy component will auto map to columns in the sink schema (best if columns have the same names in source and sink). Either of these approaches works.
Unfortunately, clicking the import schema button in the mapping tab doesn't work. It does produce the correct column mappings based on the columns in the source query but I still get the original error 'the column could not be located in the actual source' after doing this mapping.
could you check that is there a column named 'ae_type_id' in your schema? If that's the case, could you remove that column and try again? The columns in the schema must be aligned with columns in the query.
The issue is caused by an incomplete schema in one of the data sources. My solution is:
Step through the data flow selecting the first schema, Import projection
Go to the flow and Data Preview
Repeat for each step.
In my case, there were trailing commas in one of the CSV files. This caused automated column names to be created in the import allowing me to fix the data file.

How to assign csv field value to SQL query written inside table input step in Pentaho Spoon

I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.

Autoincrement using Sequences is not working as expected

I am currently working on a job something like this
The design is to,extract some data from customers,(say first name,last name) to one excel file,other data (say address) is to goto other excel file,i added a identity to tMap Numeric("s1",1,1) but it is starting from 1,3,5,7,9,11,13.... and on other excel it getting 2,4,6,8,10,12,...
but i need both excel to have same identity 1,2,3,4,5,6,....N
so that i can map the records
so can somebody guide me on this?
edit:
The autoincrement returns 1,2,3,4,5,6,... this is fine when thers only one tMap component in the job,but not similar when 2 tMaps are used ?
This is because the numeric sequence is static. Since you have only one sequence called "s1", it will be incremented twice at every iteration (one time for each tMap it's invoked in).
Just use some unique labels (ie. "s1" and "s2") to force the use of two independent sequences, thus the solution of your problem.