ADF Unpivot Dynamically With New Column - azure-data-factory

There is an Excel worksheet that I wanted to unpivot all the columns after "Currency Code" into rows, the number of columns need to be unpivot might vary, new columns might be added after "NetIUSD". Is there a way to dynamically unpivot this worksheet with unknown columns?
It worked when I projected all the fields and define the datatype for all the numerical fields as "double" and set the unpivot column data type as "double" as well. However, the issue is there might be additional columns added to the source file, which I won't be able to define the datatype ahead, in this case, if the new column has different data type other than "double", it will throw an error that the new column is not of the same unpivot datatype.

I tried to repro this in Dataflow with sample input details.
Take the unpivot transformation and in unpivot settings do the following.
Ungroup by: Code, Currency_code
Unpivot column: Currency
Unpivoted Columns: Column arrangement: Normal
Column name: Amount
Type: string
Data Preview
All columns other than mentioned in ungroup by can be dynamically unpivoted even if you add additional fields.

I confirm an Aswin answer. Got the same issue: failed dataflow with dynamically new columns. The reason was in datatype of unpivoted columns. Changed that to string and all goes smoothly.
Imported projection does not affect this case i`ve tried both with imported and manually coded, both works with "string" datatype.

Related

reduce function not working in derived column in adf mapping data flow

I am trying to create the derived column based on the condition that met the value and trying to do the summation of multiple matching column values dynamically. So I am using reduce function in ADF derived column mapping data flow. But the column is not getting created even the transformation is correct.
Columns from source
Derived column logic
Derived column data preview without the new columns as per logic
I could see only the fields from source but not the derived column fields. If I use only the array($$) I could see the fields getting created.
Derived column data preview with logic only array($$)
How to get the derived column with the summation of all the fields matching the condition?
We are getting data of 48 weeks forecast and the data to be prepared on monthly basis.
eg: Input data
Output data:
JAN
----
506 -- This is for first record i.e. (94 + 105 + 109 + 103 + 95)
The problem is that the array($$) in the reduce function has only one element, so that the reduce function can not accumulate the content of the matching columns correctly.
You can solve this by using two derived columns and a data flow parameter as follows:
Create derived columns with pattern matching for each month-week you did it before, but put the reference $$ into the value field, instead of the reduce(...) function.
This will create derived columns like jan0, jan1, etc. containing the copy of the original values. For example Week 0 (1 Jan - 7 Jan) => 0jan with value 95.
This step gives you a predefined set of column names for each week, which you can use to summarize the values with specific column names.
Define Data Flow parameters for each month containing the month-week column names in a string array, like this:
ColNamesJan=['0jan' ,'1jan', etc.] ColNamesFeb=['0feb' ,'1feb', etc.] and so on.
You will use these column names in a reduce function to summarize the month-week columns to monthly column in the next step.
Create a derived column for each month, which will contain the monthly totals, and use the following reduce function to sum the weekly values:
reduce(array(byNames($ColNamesJan)), 0, #acc + toInteger(toString(#item)),#result)
Replace the parameter name accordingly.
I was able to summarize the columns dynamically with the above solution.
Please let me know if you need more information (e.g. screenshots) to reproduce the solution.
Update -- Here are the screenshots from my test environment.
Data source (data preview):
Derived columns with pattern matching (settings)
Derived columns with pattern matching (data preview)
Data flow parameter:
Derived column for monthly sum (settings):
Derived column for monthly sum (data preview):

how to replace null values in dynamic table with 'mean' or 'unknown' as per the column data type in azure data factory?

I have data from two data sources i.e SQL and PostgreSQL. For every table want to replace the column having 'Null values' with MEAN if column type is integer and by 'Unknown' if column type is string.
I have tried using derived column but i am not sure how to pass on dynamic column values.
I created a pipeline with the 'LookUp' activity and 'ForEach' activity and calling a dataflow.
The migration is happening from SQL to Postgres so need to validate tables as well null values.
you have 2 cases here, the first one is replacing a null values in a string column with 'unknown' and the second case is replacing null values in an integer column with the mean of the values in the same column.
Main idea:
add a derived column , replace the null values in a string with unknown
fix the null values in an integer column,replace null with zeros so we will replace these zeros with the mean value when we calculate it by using a window activity.
Here is a quick demo that i built in ADF.
First, i created a dataset with 3 columns (name,height,address), height type is integer and address is a string like so:
ADF:
Derived Column activity:
modified address and height column as mentioned above.
Window activity:
in window activity, the idea is to replace the zeros with the mean value, to see the difference, i added a new column named it 'newHeight' just we can see the difference but you can override the original height column
in window settings -> window columns :
added a new column newHeight with the value :
case(height == 0 ,divide(sum(height),count(height)),toLong(height))
Output:
please read more about window transformation here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-window

Can I create a parameter in a Local?

I created a Derived Column with a Expression
(dummy sample)
iif(columnX=='true',1,0)
This expression will be util in anothers Derived Columns, so I'd like create a Local with this Expression, but in the place of columnX I'll put a parameter for another column
Is it possible? How?
I tried creating a data flow parameter (param2) and added your expression to it using a different parameter (param1) instead of ColumnX.
But I was not able to change the parameter value in the expression later, it was only taking the default value. Also did not find any related documents to assign a column value to a parameter in the data flow.
The only way I could think of is, using the expression multiple times in different derived columns taking different columns in place of ColumnX.
Derived column1: Added expression to the new column (col1)
Preview of Derived column1: Evaluating expression against Sample1 column.
Derived column2: Reusing same column name Col1 to evaluate the expression against Sample2 column. If the previous value is needed, you can assign the previous value of Col1 to the new column (previous_col1) in derived column2 as shown in the below snip.
Preview of derived column2

Pivot data in mapping dataflows

I want to pivot the values into description using mapping dataflows
My example
So the value column plots to each new description value column.
it should look like this
I understand that the groupings will be on the IDName, ID and DateTime columns and I have removed the columns I don't need.
I'd like to know what goes where in the pivot values
Thanks
Step1: Create dataflow
Step2: Add original dataset to source
Step3: In Pivot Settings use groupby on ID, IDName, DateTime columns
You will get data preview as expected
Step4: Then Configure Sink with output csv and Store data to target file.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.